Prototype Workflow for Semi-Automatic Processing of Speech and Co-Speech Gestures

2025 programming

Integrating Speech and Gesture Processing for Linguistic Analysis using Python

Understanding the interplay between speech and gesture is crucial for linguistic and cognitive research. The current prototype, available on GitHub, aims to automate the analysis of temporal alignment between spoken demonstrative pronouns and pointing gestures in video recordings. By integrating computer vision (via Google’s MediaPipe) and speech recognition (using language-specific Vosk models) using Python, the workflow provides enriched video annotations and alignment data, offering valuable insights into deictic communication.

For reference, the GitHub repository includes an ELAN folder containing output from a traditional annotation process using the ELAN program. Ultimately, the performance of the semi-automated prototype must be validated against manual annotations created using ELAN or a similar program. For reference, below is a plot of manually-annotated data for the alignment between demonstrative pronouns and pointing gestures (see R code for the plot).

Plot of manually-annotated data for the alignment between demonstrative pronouns and pointing gestures.

How It Works: Running the Program

The prototype system, which is available on GitHub, requires primary data in the form of video and corresponding audio files, which should be placed in mnt/primary data. They video-audio pairs should be named in the same way (e.g., 1.mp4 and 1.wav). The video should feature a person in a medium or medium close-up shot.

Running the following command initiates the processing pipeline:

python main.py --audio_folder "mnt/primary data/audio" \
               --video_folder "mnt/primary data/video" \
               --model "mnt/primary data/vosk-model-de-0.21" \
               --demonstratives "der,die,das,den,dem,denen,dessen,deren,dieser,diese,dieses,diesen,diesem" \
               --output "mnt/output" \
               --max_time_diff 800

This command processes the data and stores results in the designated output directory.

Breaking Down the Processing Pipeline

1. Audio Transcription and Word Onset Extraction (`audio_processing.py`)

The speech recognition model transcribes spoken content and identifies demonstrative pronouns from a predefined list.
Onset times of these pronouns are extracted to facilitate alignment analysis.
Outputs include a plain text transcript and a WebVTT subtitle file.

2. Gesture Detection (`video_processing.py`)

MediaPipe’s hand landmark estimation detects pointing gestures based on the position of the wrist (landmark 0) and the tip of the index finger (landmark 8). The online demonstration is worth a check.
A pointing gesture is recognised at the moment when these landmarks are maximally distant from each other.

3. Alignment Analysis (`alignment_analysis.py`)

The extracted demonstrative pronoun onsets are compared with detected gesture apexes. Both categories are paired on a case-by-case basis if the distance between them is smaller than the maximum gap (max_time_diff).
Temporal differences between speech and gesture events are calculated.
A CSV file containing word-gesture alignment data is generated.
A scatter plot is created for each recording to illustrate the alignment between the words of interest and the closest pointing gesture within the max_time_diff window.

Plot of data for the alignment between demonstrative pronouns and pointing gestures, obtained using a semi-automated workflow in Python.

4. Video Processing and Annotation (`video_editing.py`)

The system overlays the transcribed speech as subtitles.
Gesture peaks are highlighted to make alignment patterns visible.
The original audio is merged into the video for reference.

5. Automated Execution (`main.py`)

The main script coordinates the entire process.
Multiple audio-video file pairs can be processed simultaneously.
Results are systematically organised in the output directory.

Current Challenges and Next Steps

1. Improving Pronoun Identification

One limitation of the current system is the overidentification of demonstrative pronouns. In languages such as English, French and German, many definite articles are mistakenly included because they share the same form as demonstrative pronouns. This issue could be addressed by replacing the current current fuzzy words_of_interest list with a more precise list, where each pronoun is contextualised by its preceding and subsequent words.

2. Enhancing Gesture Detection Accuracy

The system underidentifies pointing gestures, which impacts the overall analysis. Improving MediaPipe’s detection implementation and incorporating additional filtering methods—such as movement velocity thresholds—could significantly enhance accuracy.

Conclusion

This prototype represents an important step towards automating the analysis of speech-gesture interactions. By bridging linguistic and computer vision technologies, the system offers a scalable method for studying deictic communication, paving the way for further refinements in multimodal linguistic analysis.

Python MediaPipe computer vision linguistics gestures s