Reliable Extraction of Text from Video

Open Access
Antani, Sameer
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
May 31, 2001
Committee Members:
  • Soundar Rajan Tirupatikumara, Committee Member
  • Amanda Spink, Committee Member
  • Lee David Coraor, Committee Member
  • Rajeev Sharma, Committee Member
  • Rangachar Kasturi, Committee Chair
  • Text Extraction
  • Video Indexing
  • Performance Evaluation
The detection and extraction of scene and caption text from unconstrained, general-purpose video is an important research problem in the context of content-based retrieval and summarization of visual information. The current state of the art for extracting text from video either makes simplistic assumptions as to the nature of the text to be found, or restricts itself to a subclass of the wide variety of text that can occur in broadcast video. Most published methods only work on artificial text (captions) that is composited on the video frame. Also, these methods have been developed for extracting text from images that have been applied to video frames. They do not use the additional temporal information in video to good effect. In addition, no comprehensive system has been developed which can robustly detect a large variety of text from video. This thesis presents a reliable system for detecting, localizing, extracting, tracking and binarizing text from unconstrained, general-purpose video. In developing methods for extraction of text from video it was observed that no single algorithm could detect all forms of text. The strategy is to have a multi-pronged approach to the problem, one that involves multiple methods, and algorithms operating in functional parallelism. The system utilizes the temporal information available in video. Given the greatly varied nature of video, and the text appearing in it, this approach minimizes the risk of failures while maximizing the potential payoff. The system can operate on JPEG images, MPEG-1 bitstreams, as well as live video feeds. It is also possible to operate the methods individually and independently. It was also noticed; that many methods published in the literature restrict their results to bounded text regions in frame images and lacked a thorough evaluation on general-purpose video data. This thesis addresses all the above issues by presenting a novel text detection method, a strategy for fusion of text extraction algorithms, and a thorough evaluation of methods for extraction of text from video.