Information Extraction and Retrieval from Digital Screenshots – Archiving in situ Media Behavior

Open Access
- Author:
- Chiatti, Agnese
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- July 19, 2019
- Committee Members:
- Prasenjit Mitra, Thesis Advisor/Co-Advisor
Nilam Ram, Committee Member
Xiang Zhang, Committee Member - Keywords:
- information extraction
information retrieval
digital screenshots - Abstract:
- A significant proportion of individuals' daily activities is experienced through digital devices. Smartphones, specifically, have become one of the preferred interfaces for content consumption and social interaction. Identifying the content that appears on smartphone screens and the rapid switches in content over time is thus a crucial prerequisite to studying media behavior and the potential impacts of screen content on physical, psychological and social health and well-being. A need then arises, to effectively extract the content enclosed in digital screenshot and represent it in a machine-readable and efficiently retrievable form. Moreover, screenshot images can depict heterogeneous content and applications, making the a priori definition of adequate taxonomies a cumbersome task, even for humans. Privacy protection of the sensitive data captured on screens means the costs associated with manual annotation are large, as the effort cannot be crowd-sourced. Thus, there is need to examine the utility of unsupervised and semi-supervised methods for classifying digital screenshot. This work introduces the implications of applying clustering on large screenshot sets when only a few labeled data points are available. We present an end-to-end framework implemented to: (i) extract text from digital screenshots, (ii) index the extracted text through Elasticsearch, (iii) store it a MongoDB collection of JSON documents, with their associated metadata, and (iv) classify the screenshot content through a combination of semi-supervised clustering and Active Learning.