Scene Text Understanding in Natural Images With Convolutional Neural Networks

Open Access
- Author:
- He, Dafang
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- April 19, 2019
- Committee Members:
- Clyde Lee Giles, Dissertation Advisor/Co-Advisor
Clyde Lee Giles, Committee Chair/Co-Chair
Zihan Zhou, Committee Member
James Z Wang, Committee Member
Daniel Kifer, Outside Member - Keywords:
- Deep Learning
Convolutional Neural Networks
Scene Text
Detection - Abstract:
- Text in images contains rich semantic information. The ability to read text could be used in many different applications, ranging from autonomous driving, image or video indexing, as well as assistive technology for visually impaired people. This problem is typically called scene text understanding. In order to understand text in natural images, we usually have several sub-fields related to it: (1) Scene text detection. (2) Scene text recognition and (3) Scene Text verification or retrieval. In this dissertation, I am going to investigate scene text understanding with a focus on text detection and text verification. Scene text detection aims at finding the location of each text instance. Usually we expect the model to predict a bounding box for each text instance. It shares several common difficulties with regular object detection such as noisy image, variance of scales and etc. However, one of the major difference between regular object detection and scene text detection is that we usually need to predict an oriented or even curved bounding box for each text instance. Scene text recognition usually follows scene text detection in an end-to-end text reading system. The model needs to transcribe each single text instance. Scene text verification verifies the existence of text in natural images. It is the most critical part in building a scene text retrieval system. In this dissertation, I am going to explore various methods for scene text detection and verification with convolutional neural network(CNN). Specifically, for scene text detection, I propose three algorithms and one training framework. The first algorithm adopts a traditional region proposal method with a novel CNN classifier which aggregates local context into classification. The second detection algorithm uses fully convolutional neural network for semantic text segmentation. A novel instance-aware segmentation is proposed to further split the extracted text block into text instances. The third work focuses on arbitrary oriented scene text detection. It proposes a general and novel framework called Detect-Associate-Segment (DAS) for detecting arbitrary oriented text. A key point based model is designed based on the framework which achieves state-of-the-art performance in various benchmark datasets. In addition to detection algorithms, this dissertation also explores a new training framework for scene text detection. A novel contour task is introduced to assist scene text detection and improves the final performance. For scene text verification, this dissertation studies a new end-to-end model design which outperforms traditional algorithms by a large margin. It is demonstrated on a large scale scene text dataset with millions of street view images.