audio-visual speech activity detector · 2015. 1. 19. · detection. (3) collect a dataset of...

47
bachelor’s thesis Audio-Visual Speech Activity Detector Hana äarbortová January 2015 Ing. Jan Čech, PhD. Czech Technical University in Prague Faculty of Electrical Engineering, Department of Cybernetics

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

bachelor’s thesis

Audio-Visual Speech Activity Detector

Hana äarbortová

January 2015

Ing. Jan �ech, PhD.

Czech Technical University in Prague

Faculty of Electrical Engineering, Department of Cybernetics

Page 2: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

České vysoké učení technické v Praze Fakulta elektrotechnická

katedra řídicí techniky

ZADÁNÍ BAKALÁŘSKÉ PRÁCE

Student: Bc. Hana Šarbortová

Studijní program: Kybernetika a robotika Obor: Systémy a řízení

Název tématu: Audio-vizuální detektor řeči

Pokyny pro vypracování: Speech (or lip) activity detector [1] is an algorithm that automatically identifies whether a person in a video speaks at a time. This task is important in audio-visual diarisation problem [2,3], and in audio-visual cross modal identity recognition/learning. There are typical situations when multiple people are in a camera field of view and an audio signal is perceived. The objective is speaker identification. The speech activity detector can be visual only (based on observing the lip motion in the image), or audio-visual (which exploits the audio-visual synchrony between these modalities [4]). In the diploma thesis, do (1) Review the state of the art in lip activity detection (2) Modify an existing well performing algorithm (or design a new or combined one),for speech activity detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for precise localization of facial landmarks [5], and [6] will be provided.

Seznam odborné literatury:

[1] K. C. van Bree. Design and realisation of an audiovisual speech activity detector. Technical Report PR-TN 2006/00169, Philips research Europe, 2006. [2] Felicien Vallet, Slim Essid, and Jean Carrive. A Multimodal Approach to Speaker Diarization on TV Talk-Shows. IEEE Trans. on Multimedia, 15(3), 2013. [3] Athanasios Noulas, Gwenn Englebienne, and Ben J.A. Kroese. Multimodal Speaker Diarization. IEEE Trans. on PAMI, 34(1), 2012. [4] Einat Kidron, Yoav Y. Schechner, and Michael Elad. Pixels that sound. In CVPR, 2005. [5] Jan Cech, Vojtech Franc, Jiri Matas. A 3D Approach to Facial Landmarks: Detection, Refinement, and Tracking. In Proc. ICPR, 2014. [6] M. Uricar, V. Franc and V. Hlavac, Detector of Facial Landmarks Learned by the Structured Output SVM. In VISAPP 2012. http://cmp.felk.cvut.cz/~uricamic/flandmark/

Vedoucí: Ing. Jan Čech, Ph.D.

Platnost zadání: the summer semester 2015/2016

L.S.

Prof. Ing. Michael Šebek, DrSc. vedoucí katedry

prof. Ing. Pavel Ripka, CSc. děkan

V Praze dne 23. 12. 2014

Page 3: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Acknowledgement

I would like to express the greatest appreciation to my supervisor Jan �ech for hispatience, a magnificent guidance throughout this thesis and the amount of time hespent on consultations.

Declaration

I declare that I worked out the presented thesis independently and I quoted all usedsources of information in accord with Methodical instructions about ethical principlesfor writing academic thesis.

........................................ ........................................Date Signature

iii

Page 4: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Abstract

Cílem této práce je vytvo�it audio-vizuální detektor �e�i, t.j. algoritmus, kter˝ auto-maticky identifikuje, zda osoba ve videozáznamu v danou chvíli mluví. Tato úlohaje d�leûitá pro tzv. audio-vizuální diarizaci nebo audio-vizuální rozpoznávání a u�eníidentit. Navrûen˝ detektor �e�i má dv� fáze. V první fázi je �e� detekována pouze nazáklad� vizuální informace. V druhé �ásti jsou jiû detekované �ásti videa testoványna synchronnost s audio signálem. Z lokalizovan˝ch v˝znamn˝ch bod� na detekovanétvá�i jsou extrahovány geometrické video p�íznaky. Mel-frequency cepstral coe�cientsjsou pouûity jako audio p�íznaky. Synchronnost audio a video p�íznak� je testovánakanonickou korela�ní anal˝zou s fixními projek�ními koeficienty. Tento algoritmus jeschopen vizuáln� detekovat �e� a spolehliv� ov��it synchronnost na dostate�n� dlouhé�ásti videozáznamu, podle experiment� alespo� 8 sekund.

Klí�ová slova

Detektor �e�i, Audio-vizuální synchronnost, MFCC, CCA

iv

Page 5: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Abstract

The aim of this thesis is to create an audio-visual speech activity detector, i.e. analgorithm which automatically identifies whether a person in a video is speaking at atime. This task is important in audio-visual diarisation problem and in audio-visualcross modal identity recognition and learning. A two phase speech activity detectoris proposed. First, the speech activity is detected in a video sequence based on thevisual information only. Second, the parts detected in the first phase are tested onsynchrony with the audio signal. Geometrical video features are extracted from faciallandmarks that are localized in a region found by a face detector. Mel-FrequencyCepstral Coe�cients are used as audio features. The synchrony of audio and videofeatures is tested by Canonical Correlation Analysis with fixed projection coe�cients.The algorithm is able to detect lip activity and reliably confirm the synchrony on asu�ciently long audio-video sequences, at least 8 seconds according to the experiments.

Keywords

Speech Activity Detection, Audio-Video Synchrony, MFCC, CCA

v

Page 6: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related work 3

2.1 Speech activity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Audio speech activity detection . . . . . . . . . . . . . . . . . . . 42.1.2 Video speech activity detection . . . . . . . . . . . . . . . . . . . 4

2.2 Audio-video synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.1 Audio source localization . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Talking faces speech synchrony . . . . . . . . . . . . . . . . . . . 5

2.3 Speaker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.1 Diarization and video indexing . . . . . . . . . . . . . . . . . . . 6

3 Methods 8

3.1 Audio features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Video features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Face detection and facial landmarks localization . . . . . . . . . 123.2.2 Feature extraction and normalization . . . . . . . . . . . . . . . . 13

3.3 Canonical correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Experiments 18

4.1 Video lip activity detection . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Introspection of synchrony detection by the CCA . . . . . . . . . . . . . 20

4.2.1 Coe�cient estimation . . . . . . . . . . . . . . . . . . . . . . . . 214.2.2 Synchrony analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Applications of the CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.1 Detection of synchronized video segments . . . . . . . . . . . . . 304.3.2 Estimation of audio delay . . . . . . . . . . . . . . . . . . . . . . 32

5 Conclusion and future work 34

Bibliography 35

Appendices

A Contents of the enclosed DVD 39

vi

Page 7: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Abbreviations

AV Audio videoMFCC Mel Frequency Cepstral Coe�cientsSAD Speech Activity DetectionPLP Perceptual Linear PredictionHMM Hidden Markov ModelCRF Conditional Random FieldGMM Gaussian Mixture ModelPCA Principal Component AnalysisCoIA Coinertia AnalysisDBN Dynamic Bayesian NetworkfHMM factorial Hidden Markov ModelDFT Discrete Fourier TransformFFT Fast Fourier TransformDCT Discrete Cosine TransformLBP Local Binary PatternsPR Precision-RecallCCA Canonical Correlation Analysis

vii

Page 8: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for
Page 9: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

1 Introduction

Audio-visual speech activity detector is an algorithm which identifies whether a per-son visible in a video speaks at a time. There are typical situations that occur withrespect to a captured scene and sound. First, only a single person is visible in a video.Then the task is to determine whether the person is a speaker. Second, several peopleare captured in the video and its respective audio signal contains a speech of one ofthem. In that case we should determine who is the speaker. Third, multiple personsare visible but we don’t know if one of them is a speaker.

This task can be solved in two di�erent manners, video-only or audio-video. We cansay that a person is speaking when his/her lips are moving. It can be assumed that aperson is not likely to perform any lip motion when not speaking, or that the lip motionis rather very small and slow (i.e. are showing only emotional expressions). However,this might be a false assumption when dealing with multimedia data. Typically inTV news, an anchor is speaking while a scene containing a di�erent person is beingshowed on the screen. This person might had been speaking in the original scene butthe original speech has been replaced by the anchor’s comments. Since a person shouldbe classified as a speaker only if the same person is seen and heard, it is necessary tocheck whether the lip motion is synchronous with the audio signal.

1.1 Motivation

In recent years, the use of media increased rapidly, and the virtual distances betweenplaces got shorter. There are tons of TV shows, series, films and documentaries pro-duced every year, and it is common to meet other people through video calls and confer-ences. Sometimes, we just want to sort our media in a meaningful manner. Sometimes,selecting a part of the whole media file might be useful. The ability to automaticallyidentify a speaker can become a useful tool in several di�erent applications.

First of all, it can enhance liability of voice detection algorithms in noisy environ-ments. It can be assumed that a voice signal is likely to be detected if a person’s lipsvisible on a video are moving. Voice detection is a necessary preliminary step for speechrecognition. This might be useful even when making a video call. In that case, we wantto transmit only the signal relevant to the conversation in the highest quality, not abackground noise. We can assume that the relevant signal to be transmitted is whena person visible on the camera is speaking. Therefore, even a distant voice signal of aperson not visible in the video signal should be treated as a noise.

There are several audio-video recording types, such as video conferences and broad-casted news, where having a way how to extract a contribution of each speaker mightbecome handy. Diarization and video indexing are exactly the algorithms needed for aneasy orientation within a wast number of files and video hours. A diarization algorithmtakes a video as an input and gives a timeline of who was speaking when as an output.It has to learn speaker models from the video automatically. This involves a speaker

1

Page 10: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

1 Introduction

recognition method based on both audio and visual identity, and further assignment ofthese two modalities together. A preliminary step for this is a reliable speaker detector,that can indicate monologue sections within a video sequence.

After all, once we know how to detect synchrony automatically, it might be useful tomake shifted audio-visual signals synchronous, assuming one signal has been delayedwith respect to the other. This can be particularly useful when aligning media that hasbeen incorrectly received, corrupted by conversion or just asynchronously captured.

1.2 Problem statement

The goal of this thesis is to propose and test an audio-visual speech detector, analgorithm that indicates a video subsequences where a speaker is visible on the videoby investigating both audio and video signals jointly. A lip activity detector will beemployed first, as a lip motion is a necessary condition for speaking. In addition to that,the audio and video signals will be tested on synchrony. Only the subsequeces selectedby the video lip activity detector and classified as synchronous will be considered asspeech subsequences.

1.3 Thesis structure

The thesis is structured as follows: Section 2 provides an overview of the algorithmsthat have been published in the recent years. The cited publications briefly introduceproblems of the speech activity detection based on both audio and visual signals, audio-visual synchrony and speaker identification. The Section 3 describes the proposedmethods. The feature extraction from both audio and video signals, as well as theirsynchrony measure, is explained. The functionality tests of the proposed methods andtheir results are shown in Section 4. The results are discussed and conclusions are madein Section 5.

2

Page 11: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

2 Related work

Humans combine audio and visual information in deciding what has been spoken,especially in noisy environments; human speech perception is bimodal in nature. Thevisual modality benefit to speech intelligibility in noise has been quantified as far backas in [1]. Cross-modal fusion of audio and visual stimuli in perceiving speech has beendemonstrated by the McGurk e�ect [2]. For example, when the spoken sound "ga" issuperimposed on the video of a person uttering "ba", most people perceive the speakeras uttering the sound "da". This illusory e�ect clearly demonstrates the importance ofoptical information in the process of speech perception. There are three key reasonswhy vision benefits human speech perception [3]: It helps speaker (audio source) local-ization, it contains speech segmental information that supplements the audio, and itprovides complimentary information about the place of articulation.

Voice is produced by the vibration of the vocal cord and the configuration of thevocal tract that is composed of articulatory organs, including the nasal cavity, tongueand lips. Since some of these articulators are visible, there is an inherent relationshipbetween the acoustic and visible speech. The basic unit that describes how speech con-veys linguistic information is the phoneme. Approximately 42 phonemes in AmericanEnglish [4] and 36 in Czech [5] are generated by specific positions or movements ofthe vocal tract articulators, but only some of them are visible. Such visible units arecalled visemes. The number of visually distinguishable units, visems, is much smallerthe number of audible units, phonems, therefore a viseme often corresponds to a set ofphonemes. There is no universal agreement about the exact partitioning of phonemesinto visemes, but some visemes are well-defined, such as the bilabial viseme consistingof phoneme set "p", "b", "m". A typical clustering into 13 visemes introduced in [6] isoften used to conduct visual speech modeling experiments.

Speech consists of consonants and vowels. Vowels are voiced and can be consideredas quasi periodic source of excitation, while consonant are unvoiced and are similarto random noise [7]. Grouping consonants and vowels form di�erent phonemes andvisemes. Some visually distinctive consonants are acoustically easily confusable andvisa versa [3]. Information about the place of articulation can help disambiguate, forexample, the pairs "p" - "k", "b" - "d" and "m"-"n". On the other hand, consonances"d" and "n" sound di�erent but look the same as they are members of the same visemeclass. There are only few consonants that require lips to close firmly, specifically "b","p" and "m", and the lower lip to touch the upper teeth, "v" and "f". All vowels needmouth to be open when uttered.

2.1 Speech activity detection

Speech activity detection (SAD) is an important first step of speech processing al-gorithms, such as speech recognition and speaker recognition. It is usually understoodas a process of identifying all segments containing speech in an audio signal. Several

3

Page 12: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

2 Related work

types of speech activity detectors can be distinguished according to the data they workwith: audio, video and audio-video.

2.1.1 Audio speech activity detection

There are many audio-domain SAD algorithms available. Audio SAD can be classifiedas either time-domain approaches or frequency-domain approaches. Implementationof time-domain algorithms is computationally simple but better quality of speech de-tection is usually obtained with the frequency-domain algorithms. SAD include bothunsupervised systems that threshold against some value of an energy or voicing feature[8] and supervised systems which train a classifier with features such as Mel frequencycepstral coe�cients (MFCCs) or perceptual linear prediction coe�cients (PLPs). Clas-sifiers such as support vector machines [9], gaussian mixture models [10], and multi-layerperceptrons [10] have been successfully used. Algorithms for structured prediction havealso found success, including both hidden markov models (HMMs) [11] and conditionalrandom fields (CRFs) [12].

However, the reliability of the above audio-domain SAD algorithms deteriorates sig-nificantly with the presence of highly non-stationary noise. The vision associated withthe concurrent audio source, as already mentioned, contains complementary informa-tion about the sound which is not a�ected by acoustic noise, and therefore has thepotential to improve audio-domain processing.

2.1.2 Video speech activity detection

The visual aspects of speech detection can be categorized into geometric and non-geometric analysis. Geometric analysis focuses on the shape of the mouth region andlip features. The non-geometric analysis makes use of transformation to other domainsto represent certain aspects of the region of interest [13].

Exploiting the bimodal coherence of speech, a family of visual SAD algorithms hasbeen developed, exhibiting advantages in adverse environments. The algorithm in [14]uses a single Gaussian kernel to model the silence periods and Gaussian mixture models(GMM) for speech, with principal component analysis (PCA) for visual feature repre-sentation. A dynamic lip parameter is defined in [15] to indicate lip motion, which islow-pass filtered and thresholded to classify an audio frame. HMMs with post-filteringare applied in [16], to model the dynamic changes of the motion field in the mouthregion for silence detection. This HMM model is however trained only on the visualinformation from the silence periods, i.e. without using those from the active periods.

2.2 Audio-video synchrony

In the previous sections, it has been shown that it is possible to detect and under-stand information from an audio or video signals. Also, that their cross-modal analysis,assuming signals are synchronous, can help to seek information in an adverse environ-ment. But are they really synchronous? Is the talking face on the video really thesource of the audio signal?

4

Page 13: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

2.2 Audio-video synchrony

In order to investigate synchrony, the source of an audio signal has to be located ina video first. It is necessary to pinpoint only pixels associated with audio sources anddistinguish them from other moving sources. This problem can be general (anythingcan be a sound source), limited on a specific class or adapted on talking faces only.

2.2.1 Audio source localization

The problem of a source location that is unrestricted to a specific class of objectsis investigated in [17]. They presented a robust approach for audio-visual dynamic lo-calization, based on a single microphone. The solution had to overcome the fact thataudio and visual data are inherently di�cult to compare because of the huge dimen-sionality gap between these modalities. The proposed algorithm is based on canonicalcorrelation analysis (CCA) with removed inherent ill-posedness by exploiting the typi-cal spatial sparsity of audio-visual events.

An example of a class specific detection is a speaker localization in a large space wherethe speaker’s face cannot be seen. In this case, the decision has to be based only onbody motion or sound source location. An approach to audio-video speaker localizationin a large unconstrained environment has been proposed in [18]. This method auto-matically detects the dominant speaker in a conversation with the idea that gesturingmeans speaking. The classification is based on CCA of MFCC features and optical flowassociated with moving pixels regions. In order to increase preciseness, a triangulationof the information coming from the microphones is computed to estimate the positionof the actual audio source.

Talking faces are usually not detected by investigating audio-video synchrony. A facedetector is used instead and the synchrony is checked for each individual detected face.

2.2.2 Talking faces speech synchrony

Asynchrony of a talking face and an audio signal can occur in several cases. First, anaudio or video signal can be delayed. Second, a narrated scene where someone speakingis being shown while a narrator is speaking. Third, a not talking face or a picture isvisible but the audio signal consists of a speech. Each of these cases are likely to happenin di�erent types of videos, therefore methods dealing with asynchrony detection areusually problem specific. Work [19] presents an empirical study to review definitions ofaudiovisual synchrony and examine their empirical behavior.

Audio-video latency is a common problem when broadcasting or streaming media,this delay is sometimes corrected by the end user device or application. All the othercases are synchronous from the technical point of view, however the speaker is notvisible in the video. Speech and narrated scenes are usually being distinguished innews, sports or TV shows. Only small time windows can be considered for synchronydetection as the scene can be very dynamic. This problem is then naturally expandedto video indexing and diarization, discussed in section 2.3.1.

Liveness is a test that ensures that biometric cues are acquired from a live personwho is actually present at the time of capture [20]. The liveness detection is used asimpostor attack prevention in biometric systems based on speaker identification and

5

Page 14: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

2 Related work

face recognition. The talking face modality provides richer opportunities for verifica-tion than any ordinary multimodal fusion does [21]. The signal contains not only voiceand image but also a third source of information, the simultaneous dynamics of thesefeatures. Therefore, the liveness check is performed by measuring the degree of syn-chrony between the lips and the voice extracted from a video sequence. The systemsare usually based on CCA [21, 22, 23] and coinertia analysis (CoIA)[21, 22, 23, 20].Also, HMMs were used in [20].

2.3 Speaker identification

Speaker identification can be a very complex problem and its solution has to be al-most exclusively tailored for a specific application. Usually, the existence of only onespeaker at a time is assumed, however some papers are dealing with multiple activespeakers [24]. Sometimes, we know what is being said, i.e. a video transcript is avail-able. A high precision of automatic naming of characters in TV videos [25] was achievedby combining multiple sources of information, both visual and textual. Sometimes, weeven don’t know whether there is a speaker in a video.

Among the di�erent methods that perform speaker detection, only a few are perform-ing the fusion of both audio and video modalities. Some of them just select the activeface among all detected faces based on the distance between the peak of audio crosscorrelation and the position of the detected faces in the azimuth domain [26, 27]. A fewof the existing approaches perform the fusion directly at the feature level, which relieson explicit or implicit use of mutual information [28, 27, 29]. Most of them address thedetection of active speaker among a few face candidates, where it is assumed that allthe faces of speakers can be successfully detected by the video modality.

A good example of an application, where speaker detection has to be performed inreal time, is a video conference (distributed meeting). A boosting-based multimodalspeaker detection algorithm is proposed by [30]. They compute audio features from theoutput of sound source localization, place them in the same pool as the video features,and let the logistic AdaBoost algorithm select the best features. This speaker detectorhas been implemented in Microsoft RoundTable. A detection based on mouth motiononly is deployed in [31].

2.3.1 Diarization and video indexing

Speaker diarization is the task of determining “who spoke when?” in an audio orvideo recording that contains an unknown amount of speech and also an unknownnumber of speakers . More formally this requires the unsupervised identification ofeach speaker within an audio stream and the intervals during which each speaker isactive [32]. The application domains, from broadcast news, to lectures and meetings,vary greatly and pose di�erent problems, such as having access to multiple microphonesand multimodal information or overlapping speech. Clear examples of applications forspeaker diarization algorithms include speech and speaker indexing, document contentstructuring, speaker recognition (in the presence of multiple or competing speakers), tohelp in speech-to-text transcription, speech translation est.

6

Page 15: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

2.3 Speaker identification

Speaker diarization of AV recordings usually deals with two major domains, broad-casted news (BN) and conference meetings. The algorithms has to be adapted accordingto the di�erences in the nature of the data. BN usually has better signal-to-noise ratiobut the audio often contain music and applause. Also, people occur less frequentlyand the actions are less spontaneous. On the other hand, meetings are more dynamicwith possibly overlapping speech. A multimodal approach to speaker diarization onTV talk-shows is proposed in [33]. Work [34] proposed a Dynamic Bayesian Network(DBN) framework that is an extension of a factorial Hidden Markov Model (fHMM)and models the people appearing in an audiovisual recording as multimodal entitiesthat generate observations in the audio stream, the video stream, and the joint audio-visual space.

Indexing is trying to structure TV-content by person allowing a user to navigatethrough the sequences of the same person [35]. This structuration has to be donewithout any predefined dictionary of people. Most methods propose to index peopleindependently by the audio and visual information, and associate the indexes to obtainthe talking-face one [25, 36, 37, 38]. This approach combines clustering errors providedin each modality which is trying to be overcome in [35] .

7

Page 16: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3 Methods

The implementation of an audio-visual speech detector consists of several steps thathave to deal with both audio and video stream. As not all pixels in video frames areimportant for speech detection, as well as not all audio frequencies, the dimensionalitycan be reduced by extracting only the important features. Features from audio andvideo streams are extracted separately and then analyzed together in order to checktheir synchrony.

The number of video frames is significantly smaller than the number of audio sam-ples. However, we want to analyze frames and samples that correspond to each other.In this work, a subsequence of samples corresponding to a frame is considered as inFigure 1. Video features are extracted form each individual frame and audio featuresfrom the corresponding subsequence of audio samples, therefore there is exactly oneaudio and one video feature vector describing each frame. There is no audio samplesoverlap of two consecutive subsequences.

All extracted features need to be as independent on the speaker as possible. Theused methods have to treat both female and male voices, as well as video features haveto be invariant to the face size and head pose.

Frame NFrame N-1 Frame N+1 Frame N+2

Videostream

Audiostream

tN tN+1tN-1 tN+2Time

Subsequence N

Figure 1 Audio subsequence corresponding to a frame N

3.1 Audio features

An acoustic speech signal contains a variety of information. It contains a message con-tent as well as information from which the speaker speaker be identified. The most com-monly used audio features for both speech and speaker recognition are Mel-frequency

8

Page 17: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3.1 Audio features

27.1 27.105 27.11 27.115 27.12 27.125 27.13

−0.05

0

0.05

Time [s]

Am

plit

ud

e

1) Framing

27.1 27.105 27.11 27.115 27.12 27.125 27.13

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

Time [s]

Am

plit

ud

e

2) Windowing (Hamming window)

0 2 4 6 8 10

−110

−100

−90

−80

−70

−60

−50

−40

Frequency [kHz]

Mag

nitu

de

[d

B]

3) FFT

0.5 1 1.5 2

3

4

5

6

7

8

9

10

Frequency [kHz]

Ma

gn

itud

e

4) Mel−frequency wrapping

0.5 1 1.5 20.8

1

1.2

1.4

1.6

1.8

2

2.2

Frequency [kHz]

Magnitu

de

5) LOG

2 4 6 8 10 12

−1

0

1

2

3

4

Coefficient number

Coeffic

ient va

lue

6) MFCC

Figure 2 MFCC computation steps along with the results for each step of a single processedaudio subsequence

Cepstral Coe�cients (MFCCs), so they were chosen as audio features for this work.

Computing of MFCC features consists of six steps (as shown in Figure 2): framing,windowing, Fast Fourier Transform (FFT), Mel-frequency wrapping, logarithm andDiscrete Cosine Transform (DCT).

The first step of MFCC computation is framing. A frame can be seen as the result ofa speech waveform multiplied by a rectangular pulse whose width is equal to the frame

9

Page 18: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3 Methods

length. As we want to compute one audio feature vector in respect to each video frame,the frame length will be equal to the time di�erence between two consecutive videoframes (i.e. 40ms for frame rate 25fps). Windowing by a rectangular shape windowwould introduce a significant high frequency noise at the beginning and end points ofeach frame, because of the sudden changes from zero to signal and from signal to zero.To reduce this edge e�ect, the Hamming window is used instead. The coe�cients of aHamming window w are computed from the following equation.

w(n) = 0.54 ≠ 0.46 cos3 2fin

N ≠ 1

4, 0 Æ n Æ N ≠ 1 (1)

The window length is N .

The next step is a conversion from the time domain to the frequency domain bycomputing Fast Fourier Transform (FFT). The FFT is a fast implementation of theDiscrete Fourier Transform (DFT), shown in equation (2). The DFT of a vector x oflength N , a vector of the audio signal values in a particular frame , multiplied by thewindowing function vector, is another vector X of length N .

X(k) =N≠1ÿ

n=0x(n)w(n)Êkn

N , k = 0, 1, . . . , N ≠ 1 (2)

where

ÊknN = e≠j( 2fi

N ) (3)

The procedure continues with Mel-frequency warping. A Mel is a unit based on thehumanly perceived pitch di�erence between two particular frequencies. The Mel scalehas approximately linear frequency spacing below 1000 Hz, and logarithmic spacingabove [39]. Two times higher value in Mel indicates two times higher pitch (the pitchdi�erence equal to one octave). The reference point between this scale and normalfrequency measurement is defined by assigning a perceptual pitch of 1000 Mel to a 1000Hz tone, 40 dB above the listener’s threshold [40]. The conversion formula is expressedby the following equation.

mel(f) = 2595 log3

1 + f

700

4(4)

After the FFT block, the spectrum of each frame is filtered by a set of filters. Thefilter bank consists of 32 triangular shaped band-pass filters, whose centers are equallyspaced in the Mel space, see Figure 3. This gives a vector c of 32 discrete values, butonly the first thirteen, including the zeroth order coe�cient, are further considered.Their logarithm is then transformed by Discrete Cosine Transform (DCT), as shown inthe following equation,

Y (k) = a(k)Mÿ

m=1c(m) cos

3fi

2M(2m ≠ 1)(k ≠ 1)

4, k = 1, 2, . . . , M (5)

10

Page 19: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3.1 Audio features

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Frequency [kHz]

Figure 3 Triangular shaped band-pass filters in the frequency space

where

a(k) =

Y__]

__[

aÔM

, k = 1Ú

2M

, 2 Æ k Æ M

(6)

M is the number of cepstral coe�cients (i.e. the length of the vector c). The resultingMFCC features along with the original speech signal and its spectrogram can be seenin Figure 4.

The original MFCCs are used along with their delta (first order derivatives) and delta-delta (second order derivatives) values, as they describe the change between frames. Intotal, 39 audio features are extracted from a subsequence of audio samples correspond-ing to each video frame. In all experiments, Voicebox speech processing toolbox forMatlab [41] is used to extract MFCCs.

11

Page 20: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3 Methods

0.5 1 1.5 2 2.5

x 105

−0.5

0

0.5

Audio samples

Am

plit

ude

Video frames

Fre

quency

[kH

z]

50 100 150 200 250 300 350 400

0

5000

10000

Video frames

MF

CC

num

ber

50 100 150 200 250 300 350 400

2

4

6

8

10

12

Figure 4 Speech signal and its corresponding spectrogram and MFCC features, resectively

3.2 Video features

Video features describe the lip motion of a potential speaker. In this work, geomet-rical lip features are used, therefore the features describe the shape of a mouth ratherthan raw intensity values in the mouth region. The extraction of such features consistsof several steps, namely face detection, facial landmarks localization and geometricalfeatures extraction. Also, the features have to be normalized so that they are invariantto the face size and head pose. However, we suppose that all analyzed speakers arefacing the camera; no profile views are considered for simplicity.

3.2.1 Face detection and facial landmarks localization

The first step of extracting video features is a face detection. In this work, a reliablecommercial face detector Eyedea [43], based on Waldboost [42], is used. This detectorprovides a bounding box for each face that appears on a frame and the detection con-fidence.

As the next step, facial landmarks are estimated on each of the detected faces. TheChehra detector [44] is used for facial landmark localization in all experiments. Themethod estimates landmark locations by finding parameters of a 3D shape model bymeans of regression. A facial model is initiated and then the incremental update ofthe cascade of linear regression is done by propagating the results from one level tothe next. As learning the cascade of regression is by nature a Monte-Carlo procedure[45], every level is trained independently using only the statistics of the previous level.

12

Page 21: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3.2 Video features

Figure 5 A result of the face detection (the red box) and facial landmarks localization (thegreen stars)

Finally the landmarks are found by projecting the model to the image.

The detection and facial landmark localization is done in each frame independently,no tracking is used. This is mainly due to the code availability, as the authors of Chehraprovided Matlab codes just for landmarks localization in a single image, even thoughthe tracking procedure is described in [44].

The result of the face detection is a bounding box and the facial landmark detectorprovides positions of 49 facial landmarks for each face in a video frame. An example ofthe detection and localization result is shown in Figure 5.

3.2.2 Feature extraction and normalization

The positions of facial landmarks cannot be used directly as video features, since theirvalue depends on the face position and size in a particular frame. Neither coordinateswithin the face bounding box can be used, as the description would not be invariantto the head pose. Also, some of the landmarks are redundant for lip activity description.

First, the positions of facial landmarks are normalized. The normalization is done bya homography mapping from the estimated facial landmarks coordinates in a frame xe

to the corresponding initial facial model coordinates xi used by the Chehra detector.The mapping is given by the equation xi = Hxe [46], where points xi and xe are inhomogeneous coordinates and H is a 3 ◊ 3 matrix. The matrix H has to be estimated

13

Page 22: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3 Methods

from the facial points, preferably those that are not in the mouth region, as the liplandmarks are the main subject of normalization.

The equation xi = Hxe may be expressed in terms of the vector cross product asxi ◊ Hxe = 0. If the j-th row of the matrix H is denoted by hj€, than we may write

Hxe =

Q

cah1€xe

h2€xe

h3€xe

R

db (7)

.Writing xi = (x1

i , x2i , x3

i ), the cross product may be given explicitly as

xi ◊ Hxe =

Q

cax2

i h3€xe ≠ x3i h2€xe

x3i h1€xe ≠ x1

i h3€xe

x1i h2€xe ≠ x2

i h1€xe

R

db (8)

.Since hj€xe = x€

e hj for j = 1, 2, 3, this gives a set of three equations in the entries ofH, which may be written in the form

S

WU0€ ≠x3

i x€e x2

i x€e

x3i x€

e 0€ ≠x1i x€

e

≠x2i x€

e x1i x€

e 0€

T

XV

Q

cah1

h2

h3

R

db = 0 (9)

.These equations have the form of Aih = 0, where Ai is a 3 ◊ 9 matrix and h is a9-vector made up of the entries of the matrix H. As only two of the equations in 9are linearly independent, it is usual to omit the third equation [46]. Then the set ofequations becomes

C0€ ≠x3

i x€e x2

i x€e

x3i x€

e 0€ ≠x1i x€

e

D Q

cah1

h2

h3

R

db = 0 (10)

.

Each point correspondence gives rise to two independent equations in the entriesof H. Given a set of four such point correspondences, we obtain a set of equationsAh = 0, where A is the matrix of equation coe�cients build from the matrix rows Ai

contributed from each correspondence, and h is the vector of unknown entries of H.We seek a non-zero solution h. The matrix A is a 8 ◊ 9 matrix of rank 8, and thus hasa 1-dimensional null-space which provides a solution for h.

The coe�cients of a homography matrix H are computed from four point correspon-dences that are shown in Figure 6. These points were selected for these reasons: no orminimal movement of these points when speaking, no selected point is in the mouthregion, no three points are on a single line. This gives suitable normalization of thefacial landmarks, especially of the mouth region.

As not all facial landmarks are needed for speech detection, only those from the mouthregion are further considered. Actually, as lips tend to have shape of an ellipse whenuttering vowels, only two parameters describing the minor and major axes are needed.Therefore, only the distances between lips in the horizontal and vertical directions aremeasured. These two distances are computed from the normalized facial landmarks as

14

Page 23: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3.2 Video features

0 20 40 60 80 100 120 140

20

40

60

80

100

120

140

160

Figure 6 The four points used to estimate the homography mapping (red circles)

shown in Figure 7. The video descriptor for each face in a frame is a vector these twodistances and their first and second derivatives, in total 6 features.

0 50 100 150

20

40

60

80

100

120

140

160

Figure 7 Two distances used as video features (green line segments)

15

Page 24: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3 Methods

3.3 Canonical correlation analysis

An important tool for understanding the relationship between audio and video datais canonical correlation analysis (CCA). The CCA is a statistical method to measurethe relationship between two sets of multidimensional data [47]. We can use it to findthe linear combinations of audio variables and video variables whose correlations aremutually maximized. In other words, it finds the best direction to combine all the audioand image data, projecting them onto a single axis.

Let v represent the video features and a be a vector of audio features correspondingto a single frame, as defined in sections 3.2 and 3.1 respectively. As the number of audioand video features is di�erent, the length of the vectors v and a is also di�erent. Bothsignals are considered as random vectors due to their temporal variations. Both vectorsv and a are assumed to have a zero mean. Each of these vectors is projected onto aone dimensional subspace by coe�cient vectors wv and wa, respectively. The resultof these projections is a pair of canonical variables v€wv and a€wv. The correlationcoe�cients of these two variables defines the canonical correlation between v and a[48],

fl = E[w€v va€wa]

ÒE[w€

v vv€wv]E[w€a aa€wa]

= w€v CvawaÒ

wvvCvvwvw€

a Caawa

, (11)

where E denotes the expectation and C is a covariance matrix. Specifically Cvv andCaa are the covariance matrices of v and a, respectively; Cva is the cross-covariancematrix of the two vectors.

Let Nv be the dimension of visual features, Na the dimension of audio features andNF the number of frames. Define a matrix V œ RNF ◊Nv , where a row t contains thevideo features vector v€(t). In the same way, define a matrix A œ RNF ◊Na , wherea row t contains the coe�cients of the audio signal a€(t). The empirical canonicalcorrelation from equation (11) becomes

fl = w€v (V€A)waÒ

w€v (V€V)wvw€

a (A€A)wa

(12)

The CCA is defined as the maximum fl over the coe�cients wa and wv,

flú = maxwv ,wafl. (13)

This problem is solved by the equivalent eigenvalue problem [48]:

C≠1vv CvaC≠1

aa Cavwv = fl2wv

C≠1aa CavC≠1

vv Cvawa = fl2wa(14)

Maximizing the correlation is equivalent to finding the largest eigenvalue and its cor-responding eigenvector. It is necessary to solve only one of these equations, as thesolution of the other easily follows by using the solution of the first one.

16

Page 25: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

3.3 Canonical correlation analysis

The coe�cients wa and wv are usually estimated for each couple of matrices V andA separately, i.e. the coe�cients are di�erent for each investigated audio-video subse-quence. This is computationally ine�cient as the eigenvalue problem in equation (14)has to be solved for each subsequence. Also, as the direction of the highest correlationis di�erent every time, the discrimination between synchronous and asynchronous sub-sequences might not be very accurate. Depending on the length of the subsequence,the coe�cient might be estimated so that the correlation is actually higher for asyn-chronous frames rather than synchronous. For these reasons, it is preferred to estimatecoe�cients on a longer training sequence and keep them fixed for test subsequences.

In this work, the coe�cients wa and wv are estimated only once on a long audiovideo sequence and then used for computing fl of each tested subsequence. The ideais to classify the subsequences with a high fl value as synchronized and vice versa. Inorder to do so, an optimal threshold on fl is found.

17

Page 26: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

The speaker is detected in two di�erent ways, by considering the visual informationonly and by adding a synchrony test. First, a video only lip activity detector, whichestimates speech parts of a video from visual features only, was tested. Then the prop-erties of the CCA used as a statistical score of synchrony are investigated. At last, anaudio-visual speech detector is proposed as an application of the CCA with the use ofthe video lip activity detector.

4.1 Video lip activity detection

The lip activity detection is based on a simple fact that the lips are moving when aperson is speaking. The video features described in Chater 3.2 directly describe howmuch is a the mouth open at a time. The idea is to classify a part of a frame sequenceas “speaking” only when the lips are closing and opening quickly. If a mouth is open fora long time or the motion is too slow, it is probably just an expression of an emotion,such as smile or surprise. Figure 8 shows the measured vertical distance between lips(blue line); the horizontal distance depends on what is uttered, but it doesn’t play akey role in the lip activity detection.

Sliding window filter As we refer to a quick change of the distance between lips, it isbetter to work with the first derivatives, that are shown in Figure 8 (green line). Todetect the speaking parts, two sliding window filters and thresholding were employed.The first sliding window filter computes the standard deviation of the first derivativesin a particular window. As thresholding of this result usually provides a lot of short“speaking” subsequences rather than less longer ones, the result is further filtered bya mean filter. The mean filter reduces the number of short time low value results,therefore we get fewer subsequences but of a longer duration. An example of the signalafter filtering is shown in Figure 9.

Window size dependency The dependency of the result on sizes of the sliding win-dows has been tested. We thresholded the filtered signal by di�erent thresholds; if thethreshold is low, the rate of false positives is very high and vice versa. The overall error,the rate of false positives and false negatives, is decreasing with the window size. Thetest results can be seen in Figure 10. The result is better if the standard deviation filterwindow is bigger and can be improved even more by enlarging the mean filter windowsize. A larger window also means a higher threshold used for classification. Overall, theerror does not depend on the window sizes significantly. The key aspect of windowingby a larger windows is to detect longer subsequences, rather than a lot of shorter ones,as it is easier to determine synchrony on more frames.

18

Page 27: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.1 Video lip activity detection

500 1000 1500 2000 2500

−5

0

5

10

15

20

Frame number

vertical lip distancefirst derivatives vertical lip distanceground truthestimated labels

Figure 8 Original video features (vertical distance between lips and its firt derivatives) alongwith their original and estimated labels of the "speaking" parts

500 1000 1500 2000 25000

0.5

1

1.5

2

2.5

Frame number

windowing resultground truthestimated labels

Figure 9 First derivatives of the vertical distance between lips filtered by std and mean slidingwindow filters along with their original and estimated labels of the "speaking" parts (the samevideo segment as in Figure 8)

19

Page 28: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Threshold

|w

std|=3 |w

mean|=3

|wstd

|=3 |wmean

|=6

|wstd

|=3 |wmean

|=9

|wstd

|=3 |wmean

|=12

|wstd

|=6 |wmean

|=3

|wstd

|=6 |wmean

|=6

|wstd

|=6 |wmean

|=9

|wstd

|=6 |wmean

|=12

|wstd

|=9 |wmean

|=3

|wstd

|=9 |wmean

|=6

|wstd

|=9 |wmean

|=9

|wstd

|=9 |wmean

|=12

Figure 10 Dependency of false negative and false positive rate on the sliding windows sizes andthreshold for the video lip activity detector(|wstd| and|wmean| are sizes of the std and meanwindow filters, respectively)

The lip activity detector was tested on the Cardi� Conversational Database [49], Theachieved accuracy was 78.9%, using window sizes 5 for the standard deviation filter and15 for the mean filter, and threshold 0.8, a detailed result can be seen in Table 1. Themajor source of inaccuracy was an incorrect localization of facial landmarks, whichcould not be corrected by the lip activity detector itself. Moreover, the method is verysensitive to the threshold setting as every person opens mouth in a di�erent way andthe landmark localization work with di�erent accuracy for each person. The lip motionis a biometric marker for people identification [50]. However, not all speech parts can bedetected correctly even if all the features were perfect. There are some typical motionsthat look like visemes, but actually do not come with any sound, such us breathing inwith an open mouth before starting speaking.

4.2 Introspection of synchrony detection by the CCA

Canonical correlation analysis expresses the correlation between audio and video sig-nals and is being used as a statistical score of synchrony in this work. Several test hasbeen done in order to prove the capability of CCA to express the level of synchronybetween audio and video signals. The dependency of the CCA result on the number ofused frames has been investigated for both the training and testing. Also, a possibilityof using more than one eigenvector in equation (14) is discussed.

The training and testing of the CCA was done on two independent video sequences.

20

Page 29: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.2 Introspection of synchrony detection by the CCA

Both videos contain a single speaker talking without pauses, the speaker is di�erentin both videos. The CCA coe�cients were estimated on a single video file, using thewhole audio-video sequence. The testing was done on the second video using di�erentnumber of frames in tested subsequences. The videos were captured by a conventionalweb camera at resolution 720x480 at frame rate 25 fps. The duration of the trainingand testing video is 139 and 63 seconds, respectively.

4.2.1 Coe�cient estimation

The CCA coe�cients are estimated from the audio and video features of a videosequence. The result vary according to the length of the used sequence. The coe�cientestimated on a short sequences have dominant values, however the selection of thesedominant elements vary a lot from sequence to sequence. It has been shown that fortraining coe�cient values wa and wv converge to unique vectors with the increasinglength of the used sequence. Figure 11 shows the convergence of the estimated coe�-cients.

The expressiveness of the CCA was tested by sliding window of length |w| that isequal to the number of frames in a tested video subsequence. This window was shiftedone by one frame throughout the whole tested video sequence and the CCA result wascomputed on each of them. Taking a sequence of synchronous audio and video signals,the audio signal has been shifted by N frames with respect to the video signal in eachwindow. This way, both synchronous and asynchronous signal samples are investigated.Such results for each window and audio shift are shown in correlation diagrams.

Correlation diagram The correlation diagrams show the capability of the CCA toexpress the level of synchrony. The correlation diagram is a 2-D matrix where eachcolumn corresponds to a video frame number and row to the audio shift by N frames.By a frame number we mean the center frame of a tested window. Audio shift valuemeans that the audio signal was shifted by N frames with respect to the video signal.The CCA result for each window center frame and audio shift is expressed by the colorat a particular position. The values were mapped to a colormap according to theirvalue so that the higher values are warmer and the lower are colder colors.

The CCA coe�cient were estimated on the entire training video. The quality of thestatistical score of sychrony computed by using these estimated coe�cients was firsttested on the training video. A correlation diagram has been computed for a windowsize 200 frames, see Figure 12. This initial test shows that the di�erence betweenthe synchronous and asynchronous tested windows is readily apparent. In all furtherexperiments, we use this fixed projection.

4.2.2 Synchrony analysis

It has been shown that the CCA result is capable to distinguish synchronous andasynchronous sequences. However, the test in Figure 12 was carried out on the samevideo as the CCA coe�cients were estimated from. In order to prove that the CCA hasthe the capability to distinguish synchronous and asynchronous sequences in general,the CCA has been tested on an independent video. In the following tests, the samecoe�cients as in the Section 4.2.1 were used (coe�cient vectors estimated on the entire

21

Page 30: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

Figure 11 Convergence of the CCA coe�cients with a longer audio-video sequence used forthe estimation (each column contains concatenated vectors wa and wv). The beginning ofthe sequence used for estimation is the same for all columns, only the last frame numberchanges in each column, therefore in each column a di�erent number of frames was used forcoe�cient estimation.

Au

dio

sh

ift [

fra

me

s]

Frame number

|w|=260

500 1000 1500 2000 2500 3000

−100

−80

−60

−40

−20

0

20

40

60

80

100

Figure 12 The dependency of the CCA result on an audio shift shown on a correlation diagram(testing and training on the same video). The CCA was computed for a sliding window ofthe length 260 frames. The distinct red line of synchronous sequences is nicely visible.

22

Page 31: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.2 Introspection of synchrony detection by the CCA

training video), but a di�erent video with a di�erent speaker and a di�erent text wasused for testing.

Window size dependency The correlation diagrams were computed for di�erent win-dow sizes. The large windows provide an expected highly discriminating results, how-ever the small windows su�er from aperture problem. For small windows, a locallyhigh correlation are observed even for asynchronous windows. The dependency can beseen in Figure 13. A clearly visible di�erence between synchronous windows and win-dows with the sound track shifted by N frames can be seen for window size |w| longerthan 200. Window size between 100 and 200 provides a high value for synchronous se-quences, however some asynchronous sequences show high values of correlation as well.This phenomena is even more noticeable for |w| smaller than 100. This is probablycaused by a local similarity of the signal. A vertical cross-section through the diagramin Figure 13 for frame number 1200 can be seen in Figure 14. The values for smallwindow are rather oscillating, however an obvious peak expressing the synchrony isvisible for windows bigger than 100 frames. The reliability of the synchrony check bythe CCA highly depends on the window size.

Classification The CCA provides a single value for the whole AV subsequence. Asthe resulting value increases with a level of synchrony, it is intuitive to use thresholdingfor the classification. In order to find the best threshold, we constructed histograms ofvalues for synchronous and asynchronous subsequences. The histograms are computedfrom the diagrams shown in Figure 13 and can be seen in Figure 15. The histogram ofsynchronous samples is computed from the diagram row of audio shift 0; all windowsexcept for those between audio shifts -5 and 5 were used to computed histograms ofthe asynchronous samples. The values close to the audio shift 0 were eliminated as theaudio shift is too small to consider them as asynchronous, however they are not entirelysynchronous. As expected, the error (histogram overlap) decreases with the windowsize. The mean of asynchronous subsequences remains at zero for all window sizes, buttheir standard deviation is decreasing with larger windows. We can observe the samephenomenon for synchronous subsequences. The values for a window as small as tenframes are almost indistinguishable. However, the discriminability grows with a longerwindow size. This means that it is possible to classify an AV subsequece as synchronousor asynchronous by thresholding if the used window is large enough.

Projections to other directions The number of CCA coe�cient vectors is equal tothe minimum number of features of both audio and video feature vectors. As we use6 video features and 39 audio features, the number of CCA coe�cient vectors is 6.The linear combination of the first coe�cient vector, an eigenvector with the highestcorresponding eigenvalue, and the feature vector gives projection to the directions withthe highest correlation. Therefore, the first coe�cient vector is the most significant forthe synchrony evaluation. As all the coe�cient vectors, the eigenvectors in equation(14), are orthogonal, their projection provides an independent results that can be usedeither separately or combined together. Correlation diagrams for each individual coef-ficient vector computed by a sliding window of size 250 frames can be seen in Figure16. The result of the first coe�cient vector gives an already expected result. How-ever, the second shows that the CCA values are higher for more subsequences that aresynchronous. The other four does not seem to have a significant meaning while used

23

Page 32: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

Audio

shift

[fr

am

es]

Frame number

|w|=10

500 1000 1500

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]Frame number

|w|=60

500 1000 1500

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

|w|=110

200 400 600 800 1000 1200 1400

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

|w|=160

200 400 600 800 1000 1200 1400

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

|w|=210

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

|w|=260

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Figure 13 Correlation diagrams computed with a di�erent length of the sliding window |w| (thenumber of frames that are used for computation). The result of the CCA is very dependenton the number of frames used for the computation

24

Page 33: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.2 Introspection of synchrony detection by the CCA

−100 −80 −60 −40 −20 0 20 40 60 80 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Audio shift [frames]

CC

A r

esu

lt

|w|=10|w|=60|w|=110|w|=160|w|=210|w|=260

Figure 14 Cross-section (one column) of the diagrams in Figure 13 (window centered at theframe number 1200). A distinct mode of synchronized subsequences when CCA was computedon a large window, oscillations otherwise.

alone, however their combination may have a better result than the first coe�cient only.

Averaging projections The idea is to average results of multiple coe�cient vectors inorder to increase values for synchronous frames and vice versa. Since the contributionsof the coe�cient vectors are not correlated, their combination might filter out randomfluctuation of high correlation for small windows. The resulting diagrams of averagingare in Figure 17. The first diagram is a result of the first coe�cient vector only, thesecond is an average of the first two, the third is an average of the first three etc. Thenumber of high values is increasing for windows in the row of synchronous frames, aswe wanted. However, the values belonging to asynchronous frames are increasing aswell. This might not be an improvement for the final classification.

Error The histograms shown in Figure 15 are computed from the first CCA coe�cientvector only. It might be interesting to see whether the averaging of results from multiplecoe�cient vectors can decrease the error. The error is estimated from the histograms byintegrating their overlap. The intersection of the histogram curves defines the optimalthreshold. The error is then half of the sum of incorrectly classified subsequences.A graph that shows a dependency of error on the window size for multiple averagingresults is shown in Figure 18. The slope of curves representing errors of averaged results

25

Page 34: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

−2 −1 0 1 20

0.01

0.02

0.03

0.04

0.05

0.06|w|=10

CCA result bins

hit

rate

−2 −1 0 1 20

0.02

0.04

0.06

0.08

0.1

0.12

0.14|w|=60

CCA result binshit

rate

−2 −1 0 1 20

0.05

0.1

0.15

0.2|w|=110

CCA result bins

hit

rate

−2 −1 0 1 20

0.05

0.1

0.15

0.2|w|=160

CCA result bins

hit

rate

−2 −1 0 1 20

0.05

0.1

0.15

0.2

0.25|w|=210

CCA result bins

hit

rate

−2 −1 0 1 20

0.05

0.1

0.15

0.2

0.25

0.3

0.35|w|=260

CCA result bins

hit

rate

Figure 15 Histograms of the CCA results for asynchronous (red curve) and synchronous (greencurve) video subsequences computed from the diagrams shown in Figure 13. The histogramsare shown for an increasing window size.

26

Page 35: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.2 Introspection of synchrony detection by the CCA

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 1

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 2

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 3

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 4

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 5

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 6

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Figure 16 Correlation diagram computed by a fixed size window. Each of the correlationdiagrams is a projection by a di�erent CCA coe�cient vector. The diagrams are sorted bythe eigenvalue corresponding to each of the coe�cient vectors in the decreasing order.

27

Page 36: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 1

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 2

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 3

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 4

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 5

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Au

dio

sh

ift [

fra

me

s]

Frame number

CCA coefficient 6

200 400 600 800 1000 1200

−100

−80

−60

−40

−20

0

20

40

60

80

100

Figure 17 Correlation diagrams computed by averaging the results from each CCA coe�cientvector (i.e. average of the diagrams in Figure 16). The first diagram is the result of the firstcoe�cient vector only, the second is the average of the first and second one, the third of thefirst three est.

28

Page 37: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.3 Applications of the CCA

0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

window size [frames]

err

or

average of 1average of 2average of 3average of 4average of 5average of 6

Figure 18 Error (computed form the histogram overlaps) dependency on the window size andaveraging of the results contributed by individual coe�cient vectors

from the first three and four coe�cient vectors is decreasing more steeply than of theothers. Although, this phenomenon changes at window size 150 frames. The averagesfrom the first five and six coe�cient vectors seems to have the worst result. The bestresult, especially for large windows, provides the first coe�cient vector without anyaveraging. Its error is as low as 1% for a window of size 300 frames.

Precision-recall curves As the last test, we investigated precision-recall (P-R) curves.Precision is a fraction of all positive classifications that are true positive; recall is a frac-tion of all originally positive instances that were classified as positive. The precisionand recall is measured for an increasing threshold, which spans a P-R curve. The bestcurve is the one which has the biggest area under the curve. The dependency of theP-R curve on the window size can be seen in Figure 19. It clearly shows that a longerwindow improves the P-R curve. Figure 20 shows the e�ect of averaging is not verysignificant. However, the P-R curves of the averaged results indicate a slightly betterperformance for certain threshold settings.

4.3 Applications of the CCA

The CCA can be employed as a synchrony test. It has been used to improve thevideo only lip detector by testing the synchrony. Additionally, the CCA result for

29

Page 38: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

pre

cisi

on

|w|=10|w|=30|w|=50|w|=70|w|=90|w|=110|w|=130|w|=150|w|=170|w|=190|w|=210|w|=230|w|=250|w|=270|w|=290

Figure 19 Precision-Recall curve of the CCA results for di�erent window sizes on which theCCA is computed (only the first coe�cient vector is used)

shifted windows provides a tool that can estimate an unknown delay between audioand video signals in case of a corrupted sequence.

4.3.1 Detection of synchronized video segments

The speech can be estimated from the lip motion only, supposing that the speaker isalways speaking when his/her lips are moving. However, in some cases, the lip motionmay not always indicate speech. There are some situations when lips are moving andnothing is uttered, such as breathing in or smiling, or the uttered sound is not neces-sarily a speech, i.e. emotional expressions. Also, the speaker might be di�erent fromthe person that is visible on the video, however this case is rather rare.

It has been shown that the synchrony can be determined on a large windows witha high accuracy. However, to test a large window in an arbitrary video might not bealways reliable. Almost all videos that require a speech detection has some level of dy-namics. A change of speakers within a tested video sequence would introduce boundaryartifacts. A part of the tested sequence would contain a voice, face or both of a di�erentspeaker and the synchrony test would not be reliable.

In order to overcome these problems, we propose a two phase audio-visual lip activ-ity detector. In the first phase, the potential speech parts are detected by the videolip activity detector described in 4.1. This eliminates the boundary artifacts, i.e. it

30

Page 39: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.3 Applications of the CCA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

pre

cisi

on

average of 1average of 2average of 3average of 4average of 5average of 6

Figure 20 Precision-Recall curve of the averaging CCA results for window size |w| = 200. Eachcurve was obtained by averaging the results of individual coe�cient vectors

ensures that the tested parts really contain only the face of one potential speaker. Thepreliminary speech part detection also provides subsequences as long as possible, so thewindow for testing the synchrony is long enough. The second phase is the synchronytest by computing the CCA. The CCA is computed on each of the subsequences selectedby the lip activity detector. The subsequence represented by the CCA result is thenclassified by thresholding.

Cardi� Conversational Database The proposed audio-visual speech activity detectorwas tested on the Cardi� Conversational Database [49]. The database contains of 16videos that represent 8 conversations. Each speaker was captured separately on a singlevideo, however the two videos contain a single conversation. Therefore, we tested thevideos one by one, but the two videos belonging to one conversation were tested withaudio signal computed by summing the two audio signals together. This way, eachvideo contained both synchronous and asynchronous parts.

First, the video only lip activity detector with threshold 0.8, standard deviation filterwindow size 5 and mean filter window size 15, was used to detect the speech parts. Eachpart was then analyzed for synchrony. The CCA was computed on each subsequenceselected by the lip activity detector and classified as synchronous if the CCA resultvalue was higher than 0.2. The classification error was then computed as the sum of allincorrectly classified frames over the number of frames in the tested video. The CCA

31

Page 40: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4 Experiments

video name video only audio-video CCA on ground-truth segmentsP1_P2_1402_C1 0.2680 0.1816 0.1150P1_P2_1402_C2 0.2114 0.2383 0.1603P1_P3_1502_C1 0.2806 0.2801 0.1032P1_P3_1502_C2 0.1979 0.1920 0.0928P3_P4_1502_C1 0.1208 0.2271 0.0518P3_P4_1502_C2 0.1683 0.1941 0.0819P5_P2_1003_C1 0.1355 0.1511 0.0144P5_P2_1003_C2 0.4654 0.3797 0.0933P5_P3_2202_C1 0.2220 0.2537 0.0732P5_P3_2202_C2 0.2196 0.2217 0.0590P6_P2_1602_C1 0.1644 0.2054 0.0324P6_P2_1602_C2 0.2408 0.3088 0.0214P6_P3_1602_C1 0.1401 0.3253 0.3234P6_P3_1602_C2 0.2394 0.2361 0.0322P6_P4_1602_C1 0.1432 0.2248 0.0946P6_P4_1602_C2 0.1624 0.1948 0.0455

average 0.2079 0.1920 0.0871

Table 1 Classification error of the video only lip activity detector (video only), two phaseaudio-video speech detection, i.e. the result of video only lip activity detector tested onsynchrony by CCA (audio-video) and CCA classification of an the ground-truth segments(CCA on ground truth). The video names are constructed as follows: ID of speaker 1, ID ofspeaker 2, conversation number, video number of the conversation

projection coe�cients were fixed as in all previous tests. The classification results canbe seen in Table 1.

The overall classification error is lower for the audio-visual detection than the videodetection only. However, the results of the speech detection are very dependent onthe tested videos. The classification error of the CCA estimated on the ground truthsubsequences, i.e. the ground-truth subsequences of speech classified as asynchronous,is lower.

There are several sources of errors that can significantly influence the result accuracy.First, the CCA coe�cients are trained on a single video that is in a di�erent language(trained on female voice in Czech, tested on male voices in Welsh English). Second, dueto the lighting conditions of the dataset videos, the Chehra detector is very inaccurate.Third, all the steps are very sensitive to the threshold selection. Especially the thresholdof the video only lip detector significantly influence the result of the synchrony check.A high threshold can result in a large number of short windows, rather than smalleramount of long windows, and the CCA synchrony test is very inaccurate for very smallwindows. Selection of the optimal parameters is highly dependent on the speaker andcapturing conditions.

4.3.2 Estimation of audio delay

The CCA synchrony test provides a tool to estimate an audio shift for a video signalwith latency. Let v(t) be the video signal and a(t+·) be the audio signal with unknown

32

Page 41: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

4.3 Applications of the CCA

delay · . The delay is found by

·ú = arg max·

CCA!v(t), a(t + ·)

", (15)

that is implemented by shifting the audio signal back and forth. The frame di�erencebetween the original position and the maximum of the CCA results is the desired audioshift.

The shift estimation was tested on all videos previously used in this work. It has beentested on 10 seconds long subsequences. The video alignment finds the exact delay in99.1% subsequences. The error is caused by estimating the maximum somewhere veryclose to the correct shift, but not exactly to the desired one.

33

Page 42: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

5 Conclusion and future work

An audio-visual speech activity detector has been proposed. The algorithm has twophases, the video only lip activity detection and audio-video synchrony test. The visuallip activity is detected from the first derivatives of the vertical distance between lips byapplying two sliding window filters, the standard deviation and the mean, of di�erentsizes and thresholding. The synchrony is checked by CCA on the video subsequencespre-segmented by the visual lip activity detector. The CCA had fixed coe�cients es-timated on a single long video sequence. The CCA result is thresholded in order toclassify subsequences as synchronous or asynchronous. Only the parts of the videosegmented by the visual lip activity detector and confirmed by the CCA result thresh-olding as synchronous are labeled as speech part of the tested videos.

The video only lip activity detection gives good results if the optimal threshold isused. Unfortunately, the threshold varies a lot from speaker to speaker and it is hardto find a threshold that can be used generally. The way a person opens the mouth is abiometrical marker. The performance of the visual lip activity detector is also limitedby the quality of the video features. The Chehra landmark localization is not alwaysreliable, and is especially sensitive to illumination. This fact significantly influencedthe results of testing on the Cardi� Conversational Database.

The synchrony test was done by the CCA. The CCA is a standard tool for a cross-modal analysis. It projects feature vectors into a one-dimensional space in the directionof the highest correlation. The CCA coe�cients are usually estimated on each testedsubsequence separately, however we have shown that they converge to a unique vectorwhen estimated on a single long sequence. Therefore, a fixed projection was used. TheCCA is very accurate for long tested sequences. The local high correlation fluctuationis apparent for short subsequences and the test result is rather unreliable. We canaccurately distinguish between synchronous and asynchronous audio-video sequences ifa su�cient number of frames is used. The reliable result is for a sequence 8 secondslong. As a side e�ect, the audio shift can be estimated for videos when one of the audioor video signals was delayed.

The proposed solution uses a simple classification based on a projection to a one-dimensional space (a linear combination), regardless of the speech content, and hasfixed thresholds that does not accommodate to a particular speaker. Speech signals arenot stationary. Certain phonemes clearly maps to corresponding visemes; however, therecogntion of the synchrony is probably di�erent for each viseme and phoneme. Forcertain speech audio-visual segments, the synchrony recognition is intrinsically ambigu-ous. As a future work, we would like to investigate a machine learning based solution.A solution that would learn from data what audio features are important for particularaudio features automatically. We aim the decision on the synchrony be possible froma shorter sequence of several frames only. This way, we would probably not be limitedby a long subsequences and could directly classify video parts without a preliminarysegmentation, which might significantly decrease the classification error.

34

Page 43: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Bibliography

[1] W.H. Sumby and I. Pollack. “Visual contribution to speech intelligibility in noise”.In: Journal of the Acoustical Society of America 26(2) (1954), pp. 212–215.

[2] H. McGurk and J. MacDonald. “Hearing lips and seeing voices”. In: Nature 264(1976), pp. 746–748.

[3] A.Q. Summerfield. “Some preliminaries to a comprehensive account of audio-visual speech perception”. In: Dodd, B., Campbell, R. (Eds.), Hearing by Eye:The Psychology of Lip-Reading. London, United Kingdom: Lawrence Erlbaum As-sociates (1987), pp. 3–51.

[4] J.R. Deller, J.G. Proakis, and J.H.L. Hansen. Discrete-Time Processing of SpeechSignals. Englewood Cli�s, New Jersey, USA: Macmillan Publishing Company,1993.

[5] M. Kr�mová. Fonetika a fonologie. url: http://is.muni.cz/do/rect/el/

estud/ff/js08/fonetika/ucebnice/index.html (visited on 12/17/2014).[6] C. Neti et al. “Audio-Visual Speech Recognition”. In: Workshop Final Report

(2000).[7] M. Goyani, N. Dave, and N.M. Patel. “Performance Analysis of Lip Synchro-

nization Using LPC, MFCC and PLP Speech Parameters”. In: InternationalConference on Computational Intelligence and Communication Networks (2010),pp. 582–587.

[8] S. Sadjadi and J. Hansen. “Unsupervised speech activity detection using voicingmeasures and perceptual spectral flux”. In: IEEE Signal Processing Letters 20(2013), pp. 197–200.

[9] N. Mesgarani, M. Slaney, and S.A. Shamma. “Discrimination of speech fromnonspeech based on multiscale spectro-temporal modulations”. In: IEEE Trans-actions on Audio, Speech, and Language Processing 14(3) (2006), pp. 920–930.

[10] Tim Ng et al. “Developing a speech activity detection system for the DARPARATS program”. In: Interspeech (2012).

[11] T. Pfau, D.P. Ellis, and A. Stolcke. “Multispeaker Speech Activity Detection forthe ICSI Meeting Recorder”. In: Automatic Speech Recognition and Understanding(2001), pp. 107–110.

[12] A. Saito et al. “Voice activity detection based on conditional random fields usingmultiple features”. In: Interspeech (2010), pp. 2086–2089.

[13] K.C. van Bree. “Design and realisation of an audiovisual speech activity detector”.Masters thesis. Eindhoven University of Technology, 2006.

[14] Peng Liu and Zuoying Wang. “Voice activity detection using visual information”.In: International Conference on Acoustics, Speech, and Signal Processing 1 (2004),pp. 609–612.

35

Page 44: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Bibliography

[15] D. Sodoyer et al. “A study of lip movements during spontaneous dialog and itsapplication to voice activity detection”. In: Journal of The Acoustical Society ofAmerica 125(2) (2009), pp. 1184–1196.

[16] A. Aubrey, Y. Hicks, and J. Chambers. “Visual voice activity detection withoptical flow”. In: IET Image Process 4(6) (2010), pp. 463–472.

[17] E. Kidron, Y.Y. Schechner, and M. Elad. “Pixels that Sound”. In: IEEE ComputerVision and Pattern Recognition 1 (2005), pp. 88–95.

[18] E. D’Arca, N.M. Robertson, and J. Hopgood. “Look Who’s Talking”. In: Intelli-gent Signal Processing Conference (2013), pp. 1–6.

[19] H.J. Nock, G. Iyengar, and C. Neti. “Speaker Localisation Using Audio-VisualSynchrony: An Empirical Study”. In: Lecture Notes in Computer Science 2728(2003), pp. 565–570.

[20] E.A. Rúa et al. “Audio-visual speech asynchrony detection using co-inertia anal-ysis and coupled hidden markov models”. In: Pattern Analysis and Applications12(3) (2009), pp. 271–284.

[21] H. Bredin and G. Chollet. “Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification”. In: International Conference on Acoustics, Speech,and Signal Processing 2 (2007), pp. 233–236.

[22] N. Eveno and L. Besacier. “A speaker independent "liveness" test for audio-visualbiometrics”. In: Interspeech (2005), pp. 3081–3084.

[23] Zheng-Yu Zhu et al. “Liveness detection using time drift between lip movementand voice”. In: International Conference on Machine Learning and Cybernetics 2(2013), pp. 973–978.

[24] Zhao Li et al. “Multiple active speaker localization based on audio-visual fusionin two stages”. In: IEEE Conference on Multisensor Fusion and Integration forIntelligent Systems (2012), pp. 1–7.

[25] M. Everingham, J. Sivic, and A. Zisserman. ““Hello! My name is... Bu�y” – Auto-matic Naming of Characters in TV Video”. In: British Machine Vision Conference(2006), pp. 92.1–92.10.

[26] F. Talantzis, A. Pnevmatikakis, and A.G. Constantinides. “Audio-visual activespeaker tracking in cluttered indoors environments”. In: IEEE Transaction onSystems, Man, and Cybernetics (2009), pp. 799–807.

[27] E.A. Lehmann and A.M. Johansson. “Particle filter with integrated voice activitydetection for acoustic source tracking”. In: Eurasip Journal on Advances in SignalProcessing (2007).

[28] P. Besson et al. “Extraction of audio features specific to speech production for mul-timodal speaker detection”. In: IEEE Transaction on Multimedia 10(1) (2008),pp. 63–73.

[29] M. Beal, H. Attias, and N. Jojic. “Audio-visual sensor fusion with probabilis-tic graphical models”. In: European Conference on Computer Vision 1 (2002),pp. 736–750.

[30] Cha Zhang et al. “Boosting-Based Multimodal Speaker Detection for DistributedMeeting Videos”. In: IEEE Transactions on Multimedia 10(8) (2008), pp. 1541–1552.

36

Page 45: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Bibliography

[31] Yingen Xiong, Bing Fang, and F. Quek. “Detection of Mouth Movements andits Applications to Cross-Modal Analysis of Planning Meetings”. In: Interna-tional Conference on Multimedia Information Networking and Security 1 (2009),pp. 225–229.

[32] X. Anguera et al. “Speaker diarization: A review of recent research”. In: IEEETransactions on Audio, Speech and Language Processing 20(2) (2010), pp. 356–370.

[33] F. Vallet, S. Essid, and J. Carrive. “A Multimodal Approach to Speaker Diariza-tion on TV Talk-Shows”. In: IEEE Transactions on Multimedia 15(3) (2013),pp. 509–520.

[34] A. Noulas, G. Englebienne, and B.J.A. Krose. “Multimodal Speaker Diariza-tion”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 34(1)(2012), pp. 79–93.

[35] M. Bendris, D. Charlet, and G. Chollet. “People indexing in TV-content usinglip-activity and unsupervised audio-visual identity verification”. In: InternationalWorkshop on Content-Based Multimedia Indexing (2011), pp. 139–144.

[36] E. El Khoury et al. “Association of Audio and Video Segmentations for AutomaticPerson Indexing”. In: International Workshop on Content-Based Multimedia In-dexing (2007), pp. 287–294.

[37] M. Bendris, D. Charlet, and G. Chollet. “Talking faces indexing in TV-content”.In: International Workshop on Content-Based Multimedia Indexing (2010), pp. 1–6.

[38] Zhu Liu and Yao Wang. “Major Cast Detection in Video Using Both Speaker andFace Information”. In: IEEE Transactions on Multimedia 9(1) (2007), pp. 89–101.

[39] A. Das, M.R. Jena, and K.K. Barik. “Mel-Frequency Cepstral Coe�cient (MFCC)- a Novel Method for Speaker Recognition”. In: Digital Technologies 1(1) (2014),pp. 1–3.

[40] H. Seddik, A. Rahmouni, and M. Sayadi. “Text independent speaker recognitionusing the Mel frequency cepstral coe�cients and a neural network classifier”. In:First International Symposium on Control, Communications and Signal Process-ing, Proceedings of IEEE (2004), pp. 631–634.

[41] VOICEBOX: Speech Processing Toolbox for MATLAB. url: http://www.ee.ic.

ac.uk/hp/staff/dmb/voicebox/voicebox.html (visited on 11/02/2014).[42] J. äochman and J. Matas. “WaldBoost - Learning for Time Constrained Se-

quential Detection”. In: Conference on Computer Vision and Pattern Recognition(2005), pp. 150–157.

[43] Eyedea Face Detection. url: http://www.eyedea.cz/eyeface-sdk/ (visited on10/07/2014).

[44] A. Asthana et al. “Incremental Face Alignment in the Wild”. In: Conference onComputer Vision and Pattern Recognition (2014).

[45] Xuehan Xiong and F. De la Torre. “Supervised descent method and its applicationto face alignment”. In: Conference on Computer Vision and Pattern Recognition(2013).

[46] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cam-bridge University Press, 2000, pp. 71–72.

37

Page 46: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Bibliography

[47] H. Hotelling. “Relations between two sets of variates”. In: Biometrika 28 (1936),pp. 321–377.

[48] H. Knutsson, M. Borga, and T. Landelius. “Learning canonical correlations”. In:Tech. Rep. LiTH-ISY-R-1761, Computer Vision Laboratory, S-581 83, LinkopingUniversity, Sweden (1995).

[49] A.J. Aubrey et al. “Cardi� Conversation Database (CCDb): A Database of Nat-ural Dyadic Conversations”. In: V & L Net Workshop on Language for Vision(2013).

[50] B. Goswami et al. “Speaker Authentication using Video based Lip Information”.In: Proc. ICASSP (2001), pp. 1908–1911.

38

Page 47: Audio-Visual Speech Activity Detector · 2015. 1. 19. · detection. (3) Collect a dataset of training examples. (4) Evaluate the performance of the proposed algorithm. A code for

Appendix A

Contents of the enclosed DVD

directory contentthesis This thesis in PDFcode Matlab code for the proposed methoddata Not publicly available data used in this thesis

Table 2 Content of the attached DVD

39