LAFCam Leveraging Affective Feedback Camcorder

Overview

The LAFCam project is designed for the scenario of shooting home videos. When you have a lot of video, can the cameraman's laughter indicate points of interest? To explore this possibility we built a system that recognizes laughter of the cameraman and uses this to index the video in an editing application.

Harvard Square

We collected data by having the cameraman wear a speech recognition microphone, and a Sony digital voice recorder. This microphone is a headset boom microphone with noise canceling so as to maximize the signal to noise ratio. This audio data was brought into a PC and segmented into 1-4 second segments of laughter and speech.

Hidden Markov Models

With this data we then trained two Hidden Markov Models (HMM), one for laughter and one for all other speech. The observation output at each stage of the model is the 64 spectral coefficients of the spectrogram of the audio signal at that point in time. The laughter model has three states and the output is models as a mixture of two gaussians distributions. The speech model has five states and is also modeled as a mixture of two gaussians.

Laughter Spectrogram

We performed two analyses of these HMMs. In the first case we took a data corpus of 40 laughter examples and 210 speech examples and trained the models with 70% of the data. Then the remaining 30% was used for testing, and we achieved 88% accuracy. The problem with this analysis is that it deals with "ideal" data. The test data was segmented by hand, and it was guaranteed to fall into one of the two classes.

Speech Spectrogram

For use in video editing, we need automatic segmentation and classification of a continuous audio file. To do this we take the audio file and, using Matlab, go through the file testing consecutive two second windows, sliding the window by a half second each trial. The window size was chosen from the average length of the training examples. The rate at which we move the window was chosen arbitrarily as a half second, which turned out to work reasonably well. This method of segmentation and classification did not perform as well as the first analysis and resulted in a 35% false positive rate. The segments that were incorrectly labeled laughter were mostly sounds that could be considered out of the model vocabulary. There were very few cases of it labeling speech as laughter, but coughs, loud background noises like cars and trains, and other such unknown sounds were incorrectly classified. This problem could be addressed by filtering the audio file before processing, training other models for the unknown sounds, or coming up with a criteria for labeling a segment as "out of vocabulary".

Video Editing

The video editing application was written in Isis. It displays the whole video and allows the user to move through with a scrollbar. The laughter detection data is displayed directly under the video sequence, visually showing the user where the points of interest may be.

Screenshot of the video application

Future work will incorporate other forms of affective feedback. In addition to collecting speech data, we collected data about the cameraman's skin conductivity and video of the cameraman's face for facial expression analysis. These two channels will be used in addition to the laughter detection to further analyze the interest and arousal of the cameraman, attempting to achieve an accurate automated editing system.

Paper

Detailed information is available in this paper (pdf), which was presented at CHI .