Extending Video Action Recognition in the Compressed Domain

Open Access
- Author:
- Abrams, Samuel
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 14, 2022
- Committee Members:
- Vijaykrishnan Narayanan, Thesis Advisor/Co-Advisor
John Morgan Sampson, Committee Member
Chitaranjan Das, Program Head/Chair - Keywords:
- Action Recognition
Neural Networks
Video Compression - Abstract:
- As the internet continues to extend its reach into every facet of society, video is becoming one of the most common mediums for the communication of ideas and information. As of 2021, video traffic makes up 81% of all internet traffic worldwide. It is, therefore, clear that being able to understand video autonomously and on a mass scale provides huge potential for technologies that will shape the future. This could immediately impact the entertainment industry, surveillance, data collection, human-computer interaction, and more. Much work has been done in an attempt to take the successful strategies that convolutional neural networks have had on image classification and apply them to video understanding. One instrumental approach leverages the inherent redundancy in video by utilizing the existing format in which video is commonly stored: its compressed state. By learning directly on compressed video, models are exposed only to the most pertinent spatial information as well as motion information that is present without any computation. While the present methods generate near-SOTA results, there are many areas for improvement. All notable works in this topic have been performed on video compressed with the MPEG-4 part-2 codec which is dated in part due to its non-optimized compression ratio. An important contribution made to this discipline was showing that using this non-optimized codec the abundance of uncompressed frames allowed for the complete disregard for the motion information in the stream using a process of temporal shifting between the spatial frames. This generated improved accuracy with system energy savings of over 11x over the traditional method. Another contribution has been modifying the methods of compressed recognition on the archaic aforementioned codec and applying them to the modern, highly-optimized H.264. With this codec, the standard for platforms such as YouTube and Twitch, developing a method for performing recognition on H.264 directly is of immense importance for applying video understanding in the real world. Through my exploration of strategies in processing video in this format, I have developed a method that provides comparable accuracy to the methods on MPEG4 part-2 even given the constraints of utilizing H.264 directly.