speaker diarization

Upload: avinash-pandey

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Speaker Diarization

    1/47

    Speaker Diarization

    A Thesis Submitted in Partial Fulfillment of the

    Requirements for the Degree of

    Master of Technologyin

    Computer Science & Engineering

    By:

    Avinash Kumar Pandey

    2006CS50213

    Under The Guidance of:

    Prof. K. K. Biswas

    Department of Computer Science

    IIT Delhi

    Email: [email protected]

    Department of Computer Science

    Indian Institute of Technology Delhi

  • 8/6/2019 Speaker Diarization

    2/47

    i

    Certificate

    This is to certify that the thesis titled "Speaker Diarization" being submitted by

    Avinash Kumar Pandey, Entry Number: 2006CS50213, in partial fulfillment of therequirements for the award of the degree of Master of Technology in Computer

    Science & Engineering, Department of Computer Science, Indian Institute of

    Technology Delhi, is a bona-fide record of the work carried out by him under my

    supervision. The matter submitted in this dissertation has not been admitted for an

    award of any other degree anywhere unless explicitly referenced.

    Signed: ______________________

    Prof. K.K. Biswas

    Department of Computer Science

    Indian Institute Of Technology, Delhi

    New Delhi-110016 India

    Date: ______________________

  • 8/6/2019 Speaker Diarization

    3/47

    ii

    Abstract

    In this document we describe a speaker diarization System which is basically an

    internal segmentation system which uses text independent speaker identification using

    Gaussian mixture models and MFCC feature vector set, aided by also a Gaussian

    mixture models based speech activity detection system.

    A general speaker diarization system or speaker detection and tracking system

    consists of four modules, Speech activity detection, speaker segmentation; speaker

    clustering and speaker identification. In an internal segmentation type diarization

    system the functions of all the three modules of Segmentation, Clustering and

    Identification are discharged by the same speaker identification module. Segmentation

    is done after identifying which speaker this particular audio segment belongs to, so

    clustering is also done simultaneously.

    A speaker identification system is again of various types, it can impose certain

    limitations on the text that speaker utters to get identified or it may keep it completely

    limit-free. The later systems are called text independent speaker identification

    systems. Different speaker identification systems work on different types of feature

    vectors, the text independent speaker identification systems work on lower level

    glottal features of the speaker.

    The feature vectors thus obtained can be modeled using different statistical models;

    we have experimented with two models mainly, Vector quantization and Gaussian

    mixture models.

  • 8/6/2019 Speaker Diarization

    4/47

    iii

    Acknowledgements

    I would like to acknowledge the guidance ofProf. K. K. Biswas, (Department of

    Computer Science, IIT Delhi) whose guidance was the corner-stone of this project,

    without which this project would never have been possible. Thank you for your

    wonderful support. I would also like to express my gratitude towards Prof. S. K.

    Gupta and Prof. Saroj Kaushik for their guidance throughout the development of

    the project. I would like to gratefully acknowledge my debt to other people who have

    assisted in the project in different ways.

    Signed: ______________________

    Avinash kumar Pandey

    2006CS50213

    Indian Institute Of Technology, Delhi

    New Delhi-110016 India

    Date: ______________________

  • 8/6/2019 Speaker Diarization

    5/47

    iv

    Contents

    Certificate................................ ................................ ................................ .................. i

    Abstract ................................ ................................ ................................ .................... ii

    Acknowledgements ................................ ................................ ................................ . iii

    List of Figures................................ ................................ ................................ ..........vi

    List of Tables ................................ ................................ ................................ ..........vii

    1 Introduction................................ ................................ ................................ ...........1

    1.1 Motivation........................................................................................ .....................1

    1.2 Definition...................................................................................... ....................... .1

    1.3 History .........................................................................................................2

    1.3.1 Rich transcription framework........................................................................ .3

    1.4 Applications.................................................................................. ........................4

    1.5 Outline of the Work............................................................................................. .4

    1.6 Chapter Outline of the Thesis..................................................... ..........................5

    2 Speaker diarization............................... ................................................................... .6

    2.1 Introduction ..................................................................................... ....................6

    2.2 Speech activity detection..6

    2.2 Speaker segmentation..................................................... ......................................8

    2.2.1 Segmentation using silence............................................................................. .9

    2.2.2 Segmentation using divergence measures.......................................................9

    2.2.3 Segmentation using frame level audio classification.....................................10

    2.2.4 Segmentation using direction of arrival......................................... ................11

  • 8/6/2019 Speaker Diarization

    6/47

    v

    2.3 Speaker clustering.............................. ..................................... ................... ........11

    2.4 Speaker identification.12

    3 Speech Activity detection and implementation................................ ....................13

    3.1 Introduction.13

    3.2 Gaussian Mixture Models...................... 14

    3.3 Our algorithm for SAD...15

    3.3.1 Observation............................................................................... .....................15

    3.4 Experiments....16

    3.4.1 Fan noise.............................................................................. .................... ......16

    3.4.2 Silence..................................................... ................................... ....................16

    3.4.3 Fast paced speech................................................................. .................... .....17

    3.4.4 Moderate paced speech......................................................... ..................... ....17

    3.4.5 White noise................................................................................ ....................18

    3.5 Advantages and Drawbacks....................................... .........................................19

    4 Implementation of Text Independent Speaker Identification.......................... ...20

    4.1 Introduction.......................................................... ......................... ..................... .20

    4.2 Speech Parameterization : Feature vectors.................................. .......................23

    4.2.1 MFCC.................................................................................. ..........................23

    4.3 Statistical Modelling...........................................................................................27

    4.3.1 Vector Quantization............................................................ ................... ........28

    4.3.2 Gaussian Mixture Models......31

    4.4Bayesian Information Criterion ......33

    5 Experimental Results..35

    6 Conclusion........37

    Bibliography...38

  • 8/6/2019 Speaker Diarization

    7/47

    vi

    List of Figures

    Figure 1: Segmentation of a an audio clip ................................ ................................ ............. 8

    Figure 2: Gaussian mixture models ............................................................................. ...........16

    Figure 3: Fan noise ................................ ................................ ................................ .............. 16

    Figure 4: Silence ................................ ................................ ................................ ................. 16

    Figure 5: Fast paced speech.......................................................................................... .......... 17

    Figure 6: Moderately paced speech ................................ ................................ ..................... 17

    Figure 7: White noise ................................ ................................ ................................ .......... 18

    Figure 8: Training phase of a speaker identification system ................................ ................. 22

    Figure 9: Testing phase of a speaker identification system ................................ ................... 22

    Figure 10: General schematic to calculate PLP or MF-PLP features ................................ .... 23

    Figure 11: Different modules in computation of cesptral features ................................ ........ 24

    Figure 12: Vector quantization ................................ ................................ ............................ 30

  • 8/6/2019 Speaker Diarization

    8/47

    vii

    List of Tables

    Table 1: Speech activity detection experiment results ................................ .......................... 19

    Table 2:Speaker Diarization experiment results ................................ ................................ ... 36

  • 8/6/2019 Speaker Diarization

    9/47

    1

    Chapter 1

    Introduction

    In this chapter we will discuss where the problem of speaker diarization originated, what

    all problem domains it can find application in and how much work has already been done

    in this area and we will also discuss in brief about the several chapters in this thesis.

    1.1 Motivation

    Recording of speech, for several purposes, has long been in practice. There are many

    reasons to record ones voice, like for educational purposes, archival purposes and

    conserving a memory through the vicissitudes of time. While its more automatic than

    scribbling down the dialogue it is often times cheaper than video as well.

    Across the globe, in countless archives, there exists a huge amount of audio data. We

    organize our data, like database tables, through certain keys. In audio clips, as such, we

    have no key. The idea is to devise an organization for these databases in order to make

    them easy to be handled. One of the possible indexing of audio data can be in terms of

    speaker identity; this thought is the key idea for motivating speaker diarization.

    1.2 Definition

    In a given audio clip, the task of speaker diarization essentially addresses the question of

    Who spoke when. The problem involves labeling a given audio file entirely with

    speaker beginning and end times si and ei, for all homogenous single speaker segments.

    If there are portions that correspond to non speech those have to be explicitly mentioned

    too, for example, a sample output for some one minute audio file could be like 0:3

    seconds Non speech, 3:25

  • 8/6/2019 Speaker Diarization

    10/47

    2

    Speaker 1,25:37 music,37:60 speaker 2.

    1.3 History

    Historically, the problem was first thus formulated in national institute ofstandards and

    technology, rich transcription framework, or better known as RT- Framework, convention

    of 1999. Since then, till 2007, several conventions were held about this problem. Mainly,

    two types of diarization problems were undertaken

    1) Broadcast news speaker diarization2) Meeting or conference room speaker diarization

    In the broadcast news framework, the recording system is single modal. That is to mean,

    there is only one recording device, in which all the speakers take turn to speak.

    Apparently, the apparatus is simple, but the accuracy of the diarization system is

    hampered on this account.

    In the meeting room domain problem, audio clips are recorded across multiple distant

    microphones, locations of which were not disclosed. The results of diarization of these

    different clips were then combined together to enhance the efficiency of the overall

    diarization engine.

    There is one aspect about multimodal scenarios that lead to enhanced diarization results

    with them, the TDOA parameter; time delay of arrival. There are different recording

    devices; the distance of speaker from each microphone is bound to change with speakerchange because the two speakers will most probably not be occupying the same physical

    position. This information, the time difference in different recording devices, leads to a

    very strong tool for speaker segmentation or identifying speaker turn points.

    The problem we have undertaken to solve is similar to that of the broadcast news

    scenario where differentspeakers take turn to speak in a single recording device.

    The current state of art in the speaker diarization regime has moved beyond audio. The

    idea is to record not only audio, but also take visual cues from the speakers and audiences

    to determine speaker status and change. Its impact has been conclusively shown inMultimodal Speaker diarization of real world meeting using compressed domain video

    features, a paper by Friedland and Hung, 2010.

  • 8/6/2019 Speaker Diarization

    11/47

    3

    1.3.1 NIST RICH TRANSCRIPTION FRAMEWORK

    There is a set of related problems that is undertaken under NIST rich transcription

    framework, namely,

    1) Large vocabulary continuous speech recognition (LVCSR)2) Speaker diarization3) Speech activity detection(SAD)

    Let us now discuss each of these problems briefly and the success that has been achieved

    in solving them briefly.

    Large Vocabulary Continuous Speech Recognition

    This is the name for most common and important problem in speech processing. LVCSR

    is simply a technical name for the most general speech recognition system. The amateur

    speech recognition engines place some kinds of restrictions on the vocabulary for

    example the speaker could speak only out of so many words already recognized by the

    engine or the speaker should speak at a certain pace which generally meant a place

    slower than the normal pace of speaking. LVCSR was meant to overcome all these

    restrictions

    Speaker Diarization

    The problem statement of Speaker Diarization has already been introduced; it will be

    taken up in fair detail; a both theoretical and implementational detail as it is the object of

    this thesis.

    Speech Activity Detection

    Speech activity detection is a sub-problem in most of the speech processing applications

    and speaker Diarization is no different. We will also develop an algorithm for Speech

    activity detection. In chapter 3, we will discuss in detail about our algorithm for Speechactivity detection, its experimental performance, its advantages and its drawbacks.

  • 8/6/2019 Speaker Diarization

    12/47

    4

    1.4 ApplicationsSpeaker diarization provides for multifarious applications in diverse domains. Some of

    them follow

    1) Once our audio archive has been indexed as per the speaker identities, a user canquickly browse through the archive only looking at the speakers of ones own

    interest rather than manually looking for the speaker he is interested in, through

    the entire file.

    2) Speaker diarization also plays a vital role in automatic speech recognition. If wedo not the speaker identity and are trying to convert the speech to text, the

    phonetic models we apply are rather generic but once we know who exactly this

    speaker is we can migrate to a speaker specific phonetic model which performs

    better. Literature says, there has been about 10% improvement in automatic

    speech recognition if we knew in advance the identity of speaker.

    3) If one decides to resort to manual transcription in order to avoid the inherentdifficulties and inaccuracies in automatic speech recognition, even then a diary of

    which speaker started when will come in handy.

    1.5 Outline of the Work

    We created a system which is capable of producing a diary of a given audio clip,

    provided it has in its database the training samples of all the speakers present in the audio

    clip. Our implementation fundamentally consists of three main modules, a Speech

    activity detection module, A Bayesian Information criterion module and a Text

    independent speaker identification module. The SAD module, given a speech segmentdecides whether its speech or non-speech. The non-speech categories could be Gaussian

    noise, Background noise, Music or most commonly silence. The SAD module

    differentiates speech from these kinds of signals.

    After this, we have Bayesian Information Criterion module which narrows down on a

    given audio segment till it becomes reasonably assured that this particular segment

    belongs to a single speaker.

  • 8/6/2019 Speaker Diarization

    13/47

    5

    Then in the end, we have the core speaker identification engine based on MFCC feature

    vectors. It is of the text independent type, when we say its text independent we mean it

    does not depend on the text uttered by the speaker, it can identify a speaker no matter

    what he speaks, so this establishes that our audio clip doesnt have to be a fixed set of

    words, Our speakers can talk about anything and we will still be able to produce a diary.

    1.6 Chapter outline of the thesis

    In the remaining chapters we develop the idea of Speaker diarization first and then

    discuss the various details associated with the different modules chapter by chapter. In

    Chapter 2; we discuss one by one the theoretical ideas behind and the advances made in

    and techniques popular in each of the areas of speaker segmentation, Speaker clustering

    and speaker identification. Then, in the chapter 3 we discuss, our implementation of the

    speech activity detection algorithm, in chapter 4 we discuss our implementation and the

    general stuff about text independent speaker identification and in chapter 5 we discuss in

    brief about Bayesian Information Criterion. In chapter 6 we furnish results about the

    performance of our speaker diarization system and in chapter 7 we conclude the thesis.

  • 8/6/2019 Speaker Diarization

    14/47

    6

    Chapter2

    Speaker Diarization

    2.1 Introduction

    Speaker Diarization as Anguera and Wooters contended in Robust Speaker

    Segmentation for Meetings, ICSI SRI Spring, 2005, for the twin purposes of

    abstraction and modularity can be divided into four stages.

    1) Speech activity detection2) Speaker segmentation3) Speaker clustering4) Speaker identification

    2.1 Speech activity detection

    Speech activity detection is the first step in almost every other speech processing

    application, the primary reason being the highly computationally intensive nature of these

    processing methods. One wouldnt want to waste ones resources computing on part of

    the audio clip that is not speech. Suppose, there is a meeting room in our computer

    science department, a far field microphone is installed in the room and it is always alert.

    That is to mean, it always keeps recording. We do not want to set things up every time we

    enter the meeting room for discussions. Now after a certain period of time, say 24 hours,

    we take the output of the recorder and want to filter out portions that are non speech.

    They could be any kind of noise, music or sounds of people passing by in the gallery.

    This task is accomplished by speech activity detection module. The problem of Speech

    activity detection also goes by the name of Voice activity detection.

    There are various possible ways to determine whether an incoming clip is speech or not.

    One of those ways could be cepstral analysis, as investigated by Haigh and Mason in A

    Voice Activity Detectorbased on Cepstral analysis, 1993.

    Later in this thesis we will discuss in profound detail what we mean by Cepstral features,

    for now it is sufficient to know these are some values extracted from a stationary portion

    of the audio clip, may be a 10 seconds window.

  • 8/6/2019 Speaker Diarization

    15/47

    7

    The idea proposed by Haigh and Mason is to detect the speech end-points, that is , the

    points where speech portion begins and ends, by cepstral analysis. This approach will be

    strictly based on static explicit modeling of speech and non-speech. We will have to train

    a binary classifier to differentiate speech from non-speech based on some feature vectors

    extracted from the clip which the authors suggested should be cepstral features. As the

    reader will see later in the thesis, these are the same feature vectors we use for our text

    independent speaker identification engine.

    The model used could be any model ranging from a Gaussian mixture model to a support

    Vector machine.

    This is one way to tell speech from non-speech, the earliest algorithms for Voice activity

    detection used two common speech parameters to decide whether there is voice in a

    speech frame

    1)

    Short term energy2) Zero crossing rateShort term energy in an audio frame can be determined with the log energy coefficientin the MFCC feature vector set; it is the 0Th coefficient in the MFCC feature vector set.

    The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which

    the signal changes from positive to negative or back. This feature has been used heavilyin both speech recognition and music information retrieval and is defined formally as

    Where s is a signal of length Tand the indicator function is 1 if its argument A istrue and 0 otherwise.

    But as can be understood the usage of these two parameters can still lead us to error, there

    can be cases where there is high short term energy in a speech frame and it is still not

    speech, for example a Gaussian noise with a probability distribution such that the

    intensity peaks every few seconds.

    Same can be said for zero crossing rates, though uncommon there can be noises where

    the change from positive to negative happens very frequently and so it can deceive our

    system. To overcome these difficulties we have come up with a new algorithm for speech

    activity detection which will be described in detail in chapter 3.

  • 8/6/2019 Speaker Diarization

    16/47

    8

    2.2 Speaker segmentation

    Given an audio clip, speaker segmentation is the task of finding the speaker turn points.

    Our segmenter is supposed to divide the audio clip in non overlapping segments such thateach segment contains speech from only one speaker. Non speech segments, we assume,

    have already been filtered out by our speech activity detection module. The idea is better

    illustrated through the following figure.

    Figure 1 Speaker Segmentation

    The figure shown above is the amplitude graph of an audio clip, where the number of

    speakers exceeds one. We begin with one speaker putting forth his point, in between he is

    interrupted by another speaker and there is a region of overlapping speech from where the

    second speaker takes over. This overlapping speech region is an instance of turn point.

    There is a little abuse of the term point here, as we are calling a whole overlap region to

    be a point. So we have to identify such positions in the audio clip where the shifting over

    of speakers takes place. These, from now on, will be called homogenous speaker

    segments.

    There are various ways to go about solving the problem of speaker segmentation.

    y Segmentation using silence.y Segmentation using divergence measures.y Segmentation (and clustering) by performing frame level audio classification.y Segmentation (and clustering) using a HMM decoder.y Segmentation (and clustering) using direction of arrival.

    The last three methods are unified segmentation and clustering methods. They are also

    called internal segmentation methods at times.

    The earlier two methods are called external segmentation methods because the

    ascertaining of identity follows the merging of segments in clusters as opposed to internal

  • 8/6/2019 Speaker Diarization

    17/47

    9

    segmentation methods where for every frame you first find out who the speaker is, there

    you have essentially found out the cluster without bothering about the segment at all.

    Now we will discuss each of these methods briefly.

    2.2.1 Segmentation Using Silence

    Segmentation using Silence is a common sense method which is based on the

    assumption that whenever a speaker change happens there must be a portion of silence in-

    between. This, however, cannot be said to hold true in all environments, for example, in a

    parliament speaker changes almost inevitably happen by one speaker forcing his entry in

    another speakers speech. So a speaker change did happen, but there was no point of

    intervening silence. Hence we run the risk of losing some speaker turn points. Besides,

    this method of segmentation using silence is plagued by another difficulty. If we observe

    the amplitude graph of any speech file closely we will notice when a speaker speaks, he

    doesnt just keep speaking all the time, he stops frequently or the tonality of his voice bellows down to silence so even a continuous speaker segment contains many

    intermediate points of silence. What this means for our purpose is that while we miss

    some true some speaker turn points, we generate too many unnecessary segments as well.

    The greater the number of segments greater is the difficulty in clustering them. So we can

    see why the method of segmenting speech using silence is not such a good idea. Mainly

    for two reasons

    1) Misses some true speaker turn points with overlapping speech.2) Generates a much higher number of segments, false positives.

    2.2.2 Segmentation Using Divergence Measures

    Delacourt and Wellshowed in DISTBIC,Aspeaker-basedsegmentation for audio data

    indexing, that using divergence measures for speaker segmentation can be useful .They

    used Bayesian Information Criteria as the divergence measure.

    Segmentation using Divergence measures is the state of the art, lets first discuss what is a

    divergence measure and what all divergence measures one can use.

    A divergence measure is fundamentally a tool to determine how similar or dissimilar

    two things are. In our case it could be two successive audio frames or two successive

    windows of audio frames. Two famous divergence measures are Kullback- Leibler

    divergence and Bayesian information criteria. We have used Bayesian Information

    criteria in our implementation so we will be discussing it in detail in chapter 6. Right now

    we will discuss Kullback-Leibler Divergence.

  • 8/6/2019 Speaker Diarization

    18/47

    10

    In probability theory and information theory, the KullbackLeibler divergence (also

    information divergence, information gain, relative entropy, orKLIC) is a non-

    symmetric measure of the difference between two probability distributions Pand Q. KLmeasures the expected number of extra bits required to code samples from Pwhen using

    a code based on Q, rather than using a code based on P. TypicallyPrepresents the "true"

    distribution of data, observations, or a precisely calculated theoretical distribution. Themeasure Q typically represents a theory, model, description, or approximation ofP.

    Although it is often intuited as a distance metric, the KL divergence is not a true metric for example, it's not symmetric: the KL from P to Q is generally not the same as the

    KL from Q toP.

    For probability distributions Pand Q of a discrete random variable their KL divergenceis defined to be

    In words, it is the average of the logarithmic difference between the probabilities PandQ, where the average is taken using the probabilities P. The K-L divergence is only

    defined ifPand Q both sum to 1 and ifQ(i) > 0 for any i such that P(i) > 0. If thequantity 0log0 appears in the formula, it is interpreted as zero.

    Now that we know what a divergence measure is, we can proceed with our discussion of

    our segmentation using divergence measures. We consider two windows and calculate

    their similarity or dissimilarity index with our given divergence measure. If that

    dissimilarity is above a particular threshold, determined by empirical experimentation, we

    call that particular point a speaker turn point as it flanks the audio in two windows whichare different from each other.

    2.2.3 Segmentation using Frame level audio classification

    This is an example of an internal segmentation strategy. Frame level audio classification

    means with every frame of the audio, we determine as to what kind of an audio data it is,

    is it speech, is it music and then if it is speech which speaker does it come from in our

    database, this is more or less the strategy we will be following in our implementation

    except that instead of looking at individual frames we are looking at a window of frames

    which has been determined to contain a single speaker speech. During our

    experimentation with various statistical models we observed looking at individual frames

    rarely leads us to the right answers, we have to look at a certain length of the audio, look

    at a collection of frames and see which speaker the majority of points belong to. The

  • 8/6/2019 Speaker Diarization

    19/47

    11

    criteria for belonging may be proximity to a particular codebook code vector or a log-

    likelihood computation as in Gaussian Mixture Models.

    2.2.4 Segmentation using Direction of Arrival (DOA)

    In a multimodal corpus where we have multiple distant microphones to record speech,

    two parameters assume overwhelming significance in speaker diarization or speaker

    identification, direction of arrival and time delay of arrival. The process has been

    discussed at length in Real time monitoring of participants interaction in a meeting

    using audio-visualsensorsby Buso and Narayanan, 2008.

    The technique is called acoustic source localization and has been used widely in RADAR

    and SONAR. The speech is recorded in a smart room consisting of a microphone array.

    The microphone array is used for acoustic source localization. The approach is based on

    TDOA, time delay of arrival due to various microphones. The geometric inference of the

    source location is calculated from this TDOA. First pair-wise delays are estimated

    between all the microphones. These delays are subsequently projected as angles into a

    single axes system.

    2.3 Speaker clustering

    Speaker clustering is the next step from speaker segmentation. Let us rewind a little bit

    now. We were first given an audio clip from which using the speech activity detection

    module we filtered out non speech part. Next, we used our speaker segmentation tool to

    generate homogenous speaker segments from it. Now, as one can observe, the segments

    belonging to single speaker can be strewn across the clip. We have to label them as

    coming from a single source. So we proceed to speaker clustering. We build a similarity

    matrix corresponding to every single segment in the segments list, and we use a distance

    metric to calculate the distance between the segments. The segments which match best

    are merged and incrementally the identities of these clusters get changed. If we already

    know the number of speakers present, we can choose to keep as many clusters as the

    number of speakers. Otherwise, we choose a stopping criterion which has to be decided

    based on experimentation. When that stopping criterion is met, we stop merging the

    segments and the number of clusters present at that point will be the number of speakers

    present in our audio clip.

    Popular approaches to speaker clustering

    y Clustering using vector quantization.

  • 8/6/2019 Speaker Diarization

    20/47

    12

    y Clustering using iterative model training and classification.y Clustering in a hierarchical manner using divergence measures.y Clustering and segmentation using a HMM decoder.y Clustering and segmentation using direction of arrival

    2.4 Speaker Identification

    The last step is speaker identification. First the training part will be discussed.

    1) We divided every second of the training clip into approximately 173 frames andfor each of these frames we calculated MFCC features.

    2) Each MFCC feature vector consisted of 13 dimensions.3) We use the set of these feature vectors to generate a codebook for every speaker

    using the concept of vector quantization.

    4) Codebook of a speaker consisted of 16 feature vectors which best modeled the setof obtained feature vectors for the clip.

    Now we move on to the testing part.

    In testing

    1) We again divided the test audio clip into frames, 173 per second and computedthe feature vectors per frame.

    2) We matched each feature vector thus computed, with the individual codebooks.3) The codebook with maximum matches is declared to be the one corresponding to

    the identity of our test speaker.

  • 8/6/2019 Speaker Diarization

    21/47

    13

    Chapter 3

    Speech Activity Detection and

    Implementation

    3.1 Introduction

    Speech activity detection involves separating speech from

    1) Silence2) Background noise in different ambience3) White Gaussian noise4) Music5) Crowd noise

    The majority of algorithms that are used for speech activity detection fall in two

    categories.

    1. Noise level is estimated after looking at the entire file and anything over andabove a particular decibel level is called speech.

    2. According to the ambience, speech and non speech training sets are taken, modelsare trained based on them and then these models are used for further

    classification.

    Both of these kinds of algorithms suffer from some serious drawbacks. The algorithms in

    the first category fail to discriminate speech from non speech when noise is variable; it

    assumes same noise function throughout the audio. The algorithms falling in secondcategory will fail to perform in an unfamiliar environment. They need a lot of training

    data.

    The approach we propose here, overcomes both these hurdles, though it has drawbacks of

    its own. We want a speech detection algorithm which

    y Does not require any training at all.

  • 8/6/2019 Speaker Diarization

    22/47

    14

    y Is, nevertheless, able to grasp the difference between speech and nonspeech in dynamically changing ambience.

    Since we use Gaussian mixture models in our approach, a little briefing of what reallyGaussian mixture models are is called for.

    3.2 GAUSSIAN MIXTURE MODELS

    Many times, it is so that our distribution of data cant be accurately modeled by using a

    single multivariate distribution. The sample point set might come from two different

    Gaussian distributions. In that case, rather than modeling the dataset by a singlemultivariate distribution, its better to model the data as a mixture of two Gaussians

    where each Gaussian accounts for a certain fraction of point set which is called the

    mixing proportion of this Gaussian in the Gaussian mixture model. The mean and

    covariance matrices of each Gaussian can be completely independent and we can restrain

    them too as the case might be, in our case we let them be completely general.

    Figure 2: Gaussian Mixture Models

    The diagram above is a case where two individual Gaussians are present with different

    mean and covariance.

  • 8/6/2019 Speaker Diarization

    23/47

    15

    3.3 Our algorithm for Speech activity detection

    The algorithm consists of three simple steps.

    y Divide the entire audio clip in fixed sized intervals of 15 seconds.y Extract MFCC features from each of these segments.y Cluster individually, the feature vectors obtained for each of these intervals, using

    Gaussian mixture models where the number of components is two.

    3.3.1 Observation

    In case of silence, repeated beats, fan noise, white Gaussian noise the preponderance of

    points belongs to one of the clusters. The mixing proportion remains highly skewed.

    While for speech frames, the mixing proportion stays even, This attribute of speech

    frames becomes the basis of our classification strategy.

    We want to narrow down the number of segments which could be speech. Its a negative

    classification strategy. Whatever is clustered evenly remains a candidate for speech. Wecan filter out silence, instrument noises, repeated beats and white noise. This effectively

    narrows down our search space by a significant factor.

    3.4 Experiments

    3.4.1 Experiment 1: fan Noise

  • 8/6/2019 Speaker Diarization

    24/47

  • 8/6/2019 Speaker Diarization

    25/47

  • 8/6/2019 Speaker Diarization

    26/47

    18

    3.4.5 Experiment 5 : White Noise

    Figure 7 : White Noise

    y Software generated white noise.y Might come in our audios from some malfunctioning in our recording device ornetwork channel noise.y BIC value of clustering always comes out to be negative.y The algorithm often fails to converge for two components in 100 iterations.y When it does converge, the mixing proportion is highly skewed.

    Summary of experimental results

  • 8/6/2019 Speaker Diarization

    27/47

    19

    3.5 Advantages and Drawbacks

    Advantages

    y Does not require any training at all, so we can apply it even in environmentswhere our training based approaches would have failed for lack of training data.

    y Since it treats every segment independently, it can adjust itself dynamically tochanging ambience properties.

    Drawbacks

    y This algorithm is computationally intensive.y We have to calculate the feature vectors, cluster them and find the BIC for every

    single frame in the file.

    y Works well for relatively longer periods of silence, is not as robust for shortersegments.

  • 8/6/2019 Speaker Diarization

    28/47

    20

    Chapter 4

    Implementation of text Independent

    speaker identification

    4.1 Introduction

    Our approach to handling the other three modules, other than speech activity detection,

    can be called internal segmentation and clustering.

    We do not first segment and then cluster the audio clip, instead we take a portion of the

    clip, try to identify as to which speaker it belongs, if there is a confusion, that is the

    statistics revealed by our testing algorithms are not conclusive then we narrow it down

    further, thus we determine a minimum time difference between which a speaker change is

    not happening and in that we establish the identity of the speaker. We can also take this

    time to be 1 sec and then determine individually for all segments of 1 second to which

    speaker they belong.

    For doing so, we have two tools at our disposal.

    1) A text independent speaker identification engine and2) Bayesian information criteria BIC

    Speaker identification systems are primarily of two types, text independent and text

    dependent. The names are fairly self explanatory, text independent speaker identification

    systems are the ones which work for any test utterance while the text dependent speaker

    identification systems can work only with a fixed test utterance. The two systems classify

    their training data based on widely different sets of speech parameters, resulting in

    different accuracies too, under similar circumstances. Accordingly, they find applications

    in different problem domains.

    An abstract layout of a speaker identification/verification engine, consisting of various

    modules and stages, is shown in the following figures.

    While figure 1.1 shows the training phase, figure 1.2 is for testing phase.

  • 8/6/2019 Speaker Diarization

    29/47

    21

    Training phase assuming a most general speaker identification/verification system,

    consists of two fundamental modules,Speech parameterization module and statistical

    Modelingmodule.

    The raw speech data received is first processed to extract some useful characteristic

    information through the speech parameterization module. These information parameters

    are collected, usually, at different points in time domain and frequency domain, to mark

    the variations in the speech, characteristic of a particular speaker.

    The speech parameters thus obtained are then fitted in a statistical model of choice using

    the statistical modelingmodule, in order to calculate the defining parameters of the

    model corresponding to that particular speaker, for example, mean and variance in a

    single Gaussian model. The choice of statistical model could vary. We have

    experimented with two models, codebooks or vector quantization and Gaussian mixture

    models. The state of the art systems in speaker identification employ Gaussian mixture

    models with great success.

    Testing phase is preceded by collection of training samples of all speakers in our

    universe, those training samples are converted into corresponding speaker models and if

    the statistical modeling demands so, a universal model is formed with the training

    samples. Now, when we are given a test utterance, we calculate the speech parameters

    using the speech parameterization module and then use a decision scoringmodule to

    decide which of the available speaker models these parameters are best matched with.

    Again, the choice of decision scoring module is dependent on the statistical model weuse, when we use vector quantization, it is simply the number of points corresponding to

    each codebook, the one which has maximum points closest to the parameter set achieved

    is our output identity, the distance metric being the simple Euclidean distance metric.

    While, when use Gaussian mixture models as our statistical modeling tool, log-likelihood

    becomes the decision scoring module. We calculate log-likelihood of each feature vector

    obtained from the test utterance and see which speaker model it best corresponds to.

    So, our discussion so far has established there are fundamentally three variables in

    the speaker identification systems.

    1) Speech parameterization2) Statistical modeling3) Decision scoring

  • 8/6/2019 Speaker Diarization

    30/47

    22

    We will take them all one by one now, first giving the theoretical details, and

    then the experimental results.

    Input Speech parameters Model

    Figure 8: Different modules in the training phase of a speaker

    identification system

    Speech data from a Speech parameters

    given speaker

    Speaker models from the database Ide

    Identity

    Figure 9 :Different modules in the testing phase of a speaker

    identification system

    Speech Parameterization

    Module

    Statistical Modeling

    Speech parameterization

    Module

    Scoring Decision

  • 8/6/2019 Speaker Diarization

    31/47

    23

    4.2 Speech parameterization: Feature Vectors

    The speech parameterization module calculates or extracts useful speech parameters froma raw audio clip. The popular term for the parameters thus obtained is feature vectors.

    The most widely used feature vectors are from a particular class called cepstral features.

    We will discuss briefly what exactly we mean by cepstral features and then we will give

    the specifics of the feature vector set we are using.

    Cepstralfeatures based on filter-bank

    The entire process of calculation of filter-bank based cepstral features is shown

    schematically, module-wise, in figures 2 and 3.

    Figure 10: General Schematic for calculation Of PLP or MF-PLP

    features.

    Input

    Pre- emphasis Windowing FFT

  • 8/6/2019 Speaker Diarization

    32/47

    24

    Figure 11: Different modules employed in calculation of cepstral

    features, MFCC.

    Now we will discuss all of these modules and their relevance in our work one by one.

    1) Pre-emphasis: Emphasis is laid on certain special section of the speech signal,the special sections being the higher frequency range of the spectrum. It is

    believed that the nature of speech production reduces the higher frequencies,

    thereby, inducing a need for pre-emphasizing the signal making up for the loss in

    production process. In our case we studied, hardly any benefits were accrued by

    using pre-Emphasis so we have done away with this module.

    2) Windowing: This is a crucial phase in the calculation of feature vectors. Wemake an assumption called stationary assumption which means that if we

    consider a window of the speech signal small enough there wouldnt be any

    variations in the values of feature vectors across that small window.

    So we select a window beginning at the beginning of the speech signal, in our

    case of 20 ms, then we shift the starting position of the moving window by 10 ms

    and consider the next window of length 20 ms which means every two

    consecutive windows considered have 10 ms part in common. The choice of

    windows is again dependent on experimental evidence; we went with triangular

    hamming window. Other options could have been hamming or hanning windows.

    3) FFT: Next step is simply calculating the fast Fourier transform of the spectralvector thus obtained, after windowing and possibly pre-emphasis.

    4) Filter-bank: The spectrum obtained after applying FFT still contains a lot of un-necessary details and fluctuations, things we are not interested in. So, in order toobtain the features we are really interested in we multiply the spectrum thus

    obtained with a filter-bank. A filter-bank is nothing but a collection ofband pass

    frequency filters. So in essence we filter out all the un-necessary information and

    keep only the frequencies that concern us. The knowledge of these particular

    frequencies comes from our knowledge of the process of speech production. The

    Filter-

    bank

    20*LogCepstral

    transform

  • 8/6/2019 Speaker Diarization

    33/47

    25

    spectral feature set MFCC receives its name from its filter-bank which is called

    the Mel scale frequency filter-bank. This scale is an auditory scale which is

    similar to the frequency scale of the human ear.

    5) Cosine discrete transform: An additional transform is applied which in genericterms we have called the cepstral transform, in our case it is cosine discrete

    transform which when applied on the result of filter-bank operations yields final

    cepstral feature vectors , which are of interest to us.

    Two other important features are the log energy and the of log energy. MFCC is a 13

    dimensional feature set. The first coefficient is the log energy, we incorporated the

    difference of log energy as well in our feature set, which resulted in significantly

    improved recognition rates. In effect we have incorporated all the deltas corresponding to

    all the 13 feature vector dimensions.

    There are a number of popular feature vector sets that can be extracted from an audio

    clip. Different feature vectors sets capture different properties of an audio clip. The most

    widely used ones are

    1) MFCC Mel frequency cepstral features

    2) Rasta_PLP

    3) LPC, Linear predictive coding.

    The feature vector set which we are using is the MFCC one with its set, the set

    essentially means the differentials of these feature vectors, that is, it captures how the

    values of the MFCC feature vectors varies over time. This information, as it turns out, is

    also vital to characterizing a speaker.

  • 8/6/2019 Speaker Diarization

    34/47

    26

    4.2.1 MFCC

    The mel-frequency cepstrum (MFC) is a representation of the short-term powerspectrum of a sound, based on a linear cosine transform of a log power spectrum on anonlinear Mel scale frequency.

    Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up

    an MFC. They are derived from a type of cepstral representation of the audio clip (anonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-

    frequency cepstrum is that in the MFC, the frequency bands are equally spaced on theMel scale which approximates the human auditory system's response more closely than

    the linearly-spaced frequency bands used in the normal cepstrum. This frequencywarping can allow for better representation of sound.

    MFCCs are derived as follows

    1. Take the Fourier transform of a windowed excerpt of a signal.2. Map the powers of the spectrum obtained above onto the Mel scale using

    triangular overlapping windows.3. Take the logs of the powers at each of the Mel frequencies.4. Take the discrete cosine transform of the list of Mel log powers, as if it were a

    signal.

    5. The MFCCs are the amplitudes of the resulting spectrum.

  • 8/6/2019 Speaker Diarization

    35/47

    27

    4.3 Statistical Modeling

    Now that we have our 23 dimensional feature vectors, 173 of them for every second ofthe audio clip, we need to fit this data in a statistical model of choice. We want to get a

    concise representation for making sense of this data. The two most widely used statistical

    models are

    1)Vector Quantizationor Codebook quantization

    and

    2)Gaussian Mixture Models.

    We will discuss the theoretical details about both one by one and then detail the

    experimental results obtained with each.

  • 8/6/2019 Speaker Diarization

    36/47

  • 8/6/2019 Speaker Diarization

    37/47

    29

    The algorithm is deemed to have converged when the assignments no longer change.

    Voronoi Diagrams

    Without going in the details the Voronoi diagrams consist of Voronoi cells. So, given a

    set of points, the whole feature space is decomposed in a number of sections or cells, onecorresponding to each point. The points lying in a Voronoi cell of a point are the points

    which are closer to this particular point than any other point in the given set of points.

  • 8/6/2019 Speaker Diarization

    38/47

    30

    Image source http://www.data-compression.com/vq-2D.gif

    Figure 12

    4.3.2 Gaussian Mixture Models

    At times, it so happens that the representation provided by vector quantization or

    codebook quantization is not adequate for modeling variations in the tonalities of a

    particular speaker, so it seems like a nice idea to allow multiple underlying

    representations with different probabilities to model a particular speaker. This can be

    achieved handsomely by Gaussian mixture models which are state of the art for speaker

    recognition.

    In statistics, a mixture model is a probabilistic model for representing the presence ofsub-populations within an overall population, without requiring that an observed data-setshould identify the sub-population to which an individual observation belongs. Formally

    a mixture model corresponds to the mixture distribution that represents the probabilitydistribution of observations in the overall population.

  • 8/6/2019 Speaker Diarization

    39/47

    31

    However, while problems associated with "mixture distributions" relate to deriving theproperties of the overall population from those of the sub-populations, "mixture models"

    are used to make statistical inferences about the properties of the sub-populations givenonly observations on the pooled population, without sub-population-identity information.

    Some ways of implementing mixture models involve steps that do attribute postulatedsub-population-identities to individual observations (or weights towards such sub-populations), in which case these can be regarded as a types of unsupervised learning or

    clustering procedures. However not all inference procedures involve such steps.

    The structure of a general mixture model can be understood as follows, a typical finite-dimensional mixture model is a hierarchical model consisting of the following

    components:

    y N random variables corresponding to observations, each assumed to be distributedaccording to a mixture ofK components, with each component belonging to the

    same parametric family of distributions but with different parametersy Ncorresponding random latent variables specifying the identity of the mixture

    component of each observation, each distributed according to a K-dimensional

    categorical distributiony A set ofKmixture weights, each of which is a probability (a real number between

    0 and 1), all of which sum to 1y A set ofKparameters, each specifying the parameter of the corresponding

    mixture component. In many cases, each "parameter" is actually a set ofparameters. For example, observations distributed according to a mixture of one-

    dimensional Gaussian distributions will have a mean and variance for eachcomponent. Observations distributed according to a mixture ofV-dimensional

    categorical distributions (e.g., when each observation is a word from a vocabularyof size V) will have a vector ofVprobabilities, collectively summing to 1.

    The general mixture model can then easily be converted in a Gaussian mixture model

    with the adoption of parameters.

    Parameter estimation, Expectation Maximization

    We have used expectation maximization algorithm for parameter estimation or mixture

    decomposition in Gaussian Mixture Models.

    The expectation step

    With initial guesses for the parameters of our mixture model, "partial membership" of

    each data point in each constituent distribution is computed by calculating expectation

  • 8/6/2019 Speaker Diarization

    40/47

    32

    values for the membership variables of each data point. That is, for each data point xj anddistribution Yi, the membership valueyi,j is:

    The maximization step

    With expectation values in hand for group membership, plug-in estimates are recomputedfor the distribution parameters.

    The mixing coefficients ai are the means of the membership values over the Ndatapoints.

    The component model parameters i are also calculated by expectation maximization

    using data 32pointsxj that have been weighted using the membership values. Forexample, ifis a mean

    With new estimates forai and the i's, the expectation step is repeated to recompute newmembership values. The entire procedure is repeated until model parameters converge.

    4.4 Bayesian Information Criterions (BIC)When going through every window of the given audio clip, it becomes necessary to

    determine if this window belongs to one speaker or there exists a segmentation point in-

    between, this we can achieve by using Bayesian information criteria.

    The metric is an indicator of the acoustic dissimilarity between the two sub windows.

  • 8/6/2019 Speaker Diarization

    41/47

    33

    The question we fundamentally ask is, if this window is better modeled using one speaker

    alone or more speakers give it better representation, model with higher number of

    independent parameters are penalized using a lambda function.

    y Given that a model M is denoted by the statistical distribution theta, the BIC fora window can be defined as

    D ln

    1) is the series ofaudio feature vectors captured in the window W.2) D is the number of independent parameters present in theta.3) Second term is the penalty term and penalizes a model for its complexity.4) can be adjusted.5) Model with the higher BIC value is to be chosen.

    BIC in Segmentation

    BIC = BIC () BIC (

    =

    y Two models, M0 and M1 are defined.y Model M0 represents the scenario where t(test) is not a turn point so the left and

    right sub windows will belong to a common distribution theta(w).

    y Model M1 represents the scenario where t(test) is a turn point, then the left andright sub windows will belong to different distributions theta(L) and theta(R).

    y It is assumed that the feature vectors follow Gaussian distributions.

  • 8/6/2019 Speaker Diarization

    42/47

    34

    Chapter 6

    Experimental Results

    6.1 Datasets and Objective of the experiments

    In the previous three chapters we have discussed how we went about to implement our

    speaker diarization system.

    The primary objective of the experiments was to see if the performance of our speaker

    diarization system eroded with increase in the number of speakers in a conversation, for

    this it must be borne in mind that all other parameters are kept the same. So we came up

    with an idea to achieve the same. We recorded speech samples from six different

    speakers, they were asked to read excerpts from Frederic Wilhelm Nietzsches Also

    Sprach Zarthustra. Five training samples of 30 to 45 seconds each were taken from each

    speaker. Then ten testing samples were also taken. From this data then conversations

    were synthesized or contrived. We take a testing speech from one of the speakers; insert

    some silence before or after it and then stuff speech samples from other speakers and

    keep doing it till you are satisfied with the length and other attributes of the conversation

    clip.

    The results of our experiments with four, five and six speakers can be summarized as in

    the table on the following page.

  • 8/6/2019 Speaker Diarization

    43/47

    35

    #Speaker

    Codebook

    Accuracy

    Accuracy GMM

    2 85.4% 91.3%3 81.3% 86.8%

    4 74.5% 78%

    5 73.3% 76.5%

    6 71.2% 74%

  • 8/6/2019 Speaker Diarization

    44/47

    36

    Chapter 7

    Conclusion

    The experimental results clearly demonstrate that Gaussian mixture models are much

    more robust than vector quantization or codebook quantization when it comes to

    maintaining performance in the wake of increasing number of speakers in a conversation.

    This can be attributed to the fact that a speakers glottal chord behaves differently under

    different conditions like uttering different categories of phonetic sounds and so we need a

    model which accounts for the sub-populations within the feature point population

    generated by a particular speaker or a mixture model kind of representation which is

    adequately provided for by the Gaussian mixture models.

    There are different patterns or sub-populations within the dataset of feature vectors

    generated by a speaker but these are not many. In our mixture model the number of

    components has to be equal to the number of such sub-populations within a speakers

    feature vector set to optimize the performance.

    Moreover, it can be seen that despite using Gaussian mixture models, state of the art in

    speaker identification, the performance of our diarization system is not very encouraging,hence it becomes imperative to incorporate two other factors in a diarization system

    1) Make the corpus multi-modal and use Direction of arrival and time delay ofarrival to enhance the performance.

    2) Incorporate visual cues, as mere audio has not given significantly encouragingresults.

  • 8/6/2019 Speaker Diarization

    45/47

    37

    Bibliography

    Ajmera, J., & Wooters,C.2003. A Robust speaker clustering algorithm. In :ASRU2003-

    8th IEEEAutomatic Speech Recognition andUnderstanding workshop.

    Ajmera,J.. McCowan,I. & Bourlard,H.2004. Robust speaker change detection.IEEE

    Signal processing Letters., 11(8).

    Akita,Yuva and Kawahara, Tatsuya, 2003. Unsupervised Speaker Indexing Using Anchor

    models and Automatic Transcription of Discussions In : The interspeech2003 -8th

    European Conference on Speech Communication andTechnology.

    Anguera, X.,Wooters, C,, Pardo, J. & Hernando, J.2007. Automatic Weighting for the

    Combination of TDOA and Acoustic features in Speaker diarization for Meetings In :

    IEEE International Conference on Acoustics, Speech and SignalProcessing 2007.

    Anguera,X.,Wooters,C.,& Hernando,J.2005a. Speaker diarization for multi party

    meetings using Acoustic fusion. In : ASRU2005 -9th

    IEEEAutomatic Speech Recognition

    andUnderstanding workshop.

    Anguera, Xavier. 2005. XBIC : Real time Cross probability measure for speaker

    segmentation. Tech. Rept. ICSI

    Anguera, Xavier.2006b. Robust Speaker Diarization for Meetings. Ph.D thesis,

    Universitat Politecnica de Catalunya

    Anguera, Xavier, Wooters, Chuck, Peskin, Barbara, & Aguilo, Matu. 2005b. Robust

    Speaker Segmentation for Meetings : The ICSI-SRI Spring 2005 Diarization system. In :

    Rich transcription 2005 Spring MeetingRecognition Evaluation Workshop.

    Edinburgh,UK: Springer LNCS 3869.

    Anguera, Xavier, Wooters, Chuck, Hernando, Javier.2006 a. Friends and Enemies : A

    Novel Initialization for Speaker Diarization. In :Interspeech2006 ICSLP-9th

    International conference on Spoken Language Processing.

    Anguera, Xavier, Wooters, Chuck & Pardo, Jos M.2006b. Robust speaker Diarization for

    Meetings : ICSI RT06S Meetings Evaluation System In :Rich Transcription 2006 SpringMeetingRecognition Evaluation Workshop. Bethsda, MD,USA: Springer LNCS 4299.

  • 8/6/2019 Speaker Diarization

    46/47

    38

    Barras, Clause, Zhu, Xuan, Meignier, Sylvain & Gauvain , Jean-Luc.2004 . Improving

    Speaker Diarization. In : Rich Transcription 2004 Fall Workshop.

    Barras. Michael, Bimbot, Fedric, Ben, Mathieu & Gravier, Guillaume 2004 Multistage

    Speaker Diarization of Broadcast News. In:IEEETransactions on Audio, Speech,And

    Language Processing, 14(5).

    Bester, Michael, Bimbot, Frdric, Ben, Mathieu, & Gravier, Guillaume. 2004. Speaker

    Diarization Using bottom up clustering based on a Parameter-Derived Distance between

    adapted GMMs.

    In :ICSLP2004- 8th

    International Conference on Spoken Language Processing.

    Bimbot, Frdric, & Mathan, Luc, 1993. Text Free speaker Recognition using an

    arithmetic-harmonic sphericity measure. In : Eurospeech93 -3rd

    European Conference on

    Speech Communication and Technology.

    Black, A.,& Schultz,T.2006. Speaker clustering for Multilingual Synthesis. In : ISCA

    Tutorial andResearch Workshop on Multilingual Speech and Language Processing

    Bonastre, J.-F., Delacourt,P.,Fredouille,C.,Merlin,T.,&Wellelens,C.2000. A speaker

    tracking system based on speaker turn detection for NIST Evaluation. In: IEEE

    International Conference on Acoustics, Speech and SignalProcessing 2000.

    Burges, CJ.C1998. A Tutorial on Support Vector Machines for Pattern Recognition.

    Data mining andKnowledge Discovery,2(2).

    Cettolo, Mauro.2000. Segmentation, Classification, and Clustering of an ItalianBroadcast news Corpus. In : 6

    thRIAO 2000- Content Based Multimedia Information

    Access.

    Chen. Jingdong, Benesty, Jacob& Huang, Yiteng.2006. Time delay estimation in room

    acoustic environments : An Overview. EURASIPJournal on Applied SignalProcessing.

    Chen,Scott, Shaobing, & Gopalkrishnan,P.S.1998b. Speaker, Environment And Channel

    Change Detection and Clustering Via the Bayesian Information Criterion. In DARPA

    Speech Recognition Workshop 1998.

    Cheng.E.Lukasiak, J., Burnett,I.S.,& Stirling, D.2005. Using Spatial cues for meeting

    Speech Segmentation. In :IEEE ICME05- International Conference on Multimedia and

    Erpo 2005.

    Cohen,A.& Lapidus,V.1995. Unsupervised Text Independent Speaker classification. In :

    18th

    Convention of Electrical and Electronics Engineers in Israel.

  • 8/6/2019 Speaker Diarization

    47/47