A Prosody Based Approach for Automated Understanding of Coverbal Gestures
Open Access
- Author:
- Kettebekov, Sanshzar
- Graduate Program:
- Industrial Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- March 23, 2004
- Committee Members:
- Soundar Kumara, Committee Chair/Co-Chair
Rajeev Sharma, Committee Chair/Co-Chair
Ernest Emory Enscore Jr., Committee Member
Richard Donovan Koubek, Committee Member
Rangachar Kasturi, Committee Member
Mohammed Yeasin, Committee Member - Keywords:
- gesture recognition
multimodal
HCI - Abstract:
- Although both speech and gesture recognition have been studied extensively, vision-based acquisition of natural gestures still remains a challenging problem. Poor recognition accuracy of spontaneous conversational gestures is one of the reasons that restricted their applications for human-computer interaction, biometrics, animated character synthesis and etc. To date, attempts to improve that accuracy have been mostly limited to multimodal frameworks exploiting gesture-word level associations. Such semantically motivated schemas inherited the complexity of natural language and gesture understanding and were found hardly applicable outside of restricted domains. This thesis proposes a new approach that explores prosodic levels of structuring during coverbal gesture production for disambiguation of and gestures. It is based on a multimodal co-analysis of hand kinematics and various prosodic manifestations in speech such as changes of intonation, loudness, and pause duration. Continuous hand movement is represented as a sequence of distinct gesture primitives extracted from the visual signal. Two types of gesticulation are considered: deictic in Weather Channel broadcast and spontaneous beats in presentation monologue videos. For deictic gestures, we define separate computational frameworks to model co-articulation phenomenon and production constraints. For conversational beats, a set of gesture-speech articulations is defined to represent different functions within a narrative discourse. In this analysis observable perturbations in vocal production are related to the concurrent beat-like gestures. The efficacy of the proposed approach is demonstrated on a large multimodal corpus by a significant improvement of gesture recognition accuracy.