presentation attack detection in automatic speaker ... · presentation attack detection in...

Presentation attack detection in automaticspeaker verification with deep learning

Juhani Seppälä

Master’s Thesis

School of Computing

Computer Science

April 2019

ITÄ-SUOMEN YLIOPISTO, Luonnontieteiden ja metsätieteiden tiedekunta, JoensuuSchool of ComputingTietojenkäsittelytiede

Opiskelija, Juhani Seppälä: Replay attack detection in speaker verification withdeep learningPro gradu -tutkielma, 74 p.Pro gradu -tutkielman ohjaajat: FT Tomi KinnunenHuhtikuu 2019

Tiivistelmä: tietoturvallisuuden kontekstissa perinteiset käyttäjän tunnistamisenmenetelmät perustuvat joko tietämykseen tai fyysiseen objektiin. On kuitenkin ole-massa erilaisia tilanteita, joissa perinteiset tunnistautumisen menetelmät eivät jokoole riittäviä sellaisenaan tai niitä ei voida soveltaa lainkaan. Tarve vaihtoehtoisillebiometriikkaan perustuville tunnistautumismenetelmille on ollut jatkuvasti kasvussaja nykypäivän käyttäjät vaativat järjestelmiltä sekä turvallisuutta että käytettävyyttä.Yritykset ja viranomaiset kaipaavat työkaluja huijaamisen sekä väärinkäytösten hillit-semiseksi. Automaattinen puhujan verifiointi (ASV) on biometrisen tunnistautumisenmenetelmä, jossa hyödynnetään puhetta. Yrityksille ASV antaa keinon petosten ennal-taehkäisyyn ja viranomaisille se tarjoaa uusia työkaluja esimkerkiksi rikospaikkatutk-intaa varten. Ääniohjattujen älykkäiden järjestelmien yleistyessä kasvaa tarve myösääneen perustuvalle tunnistautumiselle. ISO/IEC 30107-1:2016 -standardi määrit-telee niin sanotun presentaatiohyökkäyksen biometrisille järjestelmille. Presentaatio-hyökkäys on ongelma kaikentyyppisissä biometrisen tunnistautumisen järjestelmissä.Eräs tapa toteuttaa tällainen hyökkäys on toistaa kohdehenkilön nauhoitettua puhettabiometrisen tunnistautumisen järjestelmälle. Useassa eri itsenäisessä tutkimuksessa onhavaittu, että järjestelmien toimintakykyä voidaan heikentää toistetuilla äänitteillä. Ke-hittyneimmät nykyiset järjestelmät hyödyntävät niin sanottuja i-vektoreita, kun perin-teiset ASV -järjestelmät perustuivat Gaussin mixtuurimalleihin ja akustisiin piirteisiin.Tässä työssä tutkimme ns. syväoppimismenetelmien toimintaa presentaatiohyökkäyk-sen tunnistamiseen.

Avainsanat: biometriikka, automaattinen puhujan tunnistus, väärennös, väärennöstentunnistus, toistohyökkäys, ASVspoof, DNN, CNN

CCS -luokat (ACM Computing Classification System, 2012 version): Security andprivacy→Biometrics, Computing methodologies→Neural networks, Computingmethodologies→Supervised learning by classification

i

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, JoensuuSchool of ComputingComputer ScienceStudent, Juhani Seppälä: Replay attack detection in speaker verification with deeplearningMaster’s Thesis, 74 p.Supervisors of the Master’s Thesis: PhD Tomi KinnunenApril 2019

Abstract: In the context of information security, the traditional means of user authen-tication involves either knowledge (password) or physical token (badge or key). Thereare situations, however, where these are either not applicable or insufficient alone. Thedemand for alternate forms of authentication based on biometrics has been increasingand today’s users demand both security and convenience. Businesses and governmentsdemand tools to combat fraud and abuse. Automatic speaker verification (ASV) is a abiometric authentication method utilising speech data. For businesses ASV allows forearly fraud detection, while for law-enforcement, techniques from ASV may be of usein forensics. And, as more voice-operated, intelligent systems become mainstream insociety, the need for voice-based authentication increases. The ISO/IEC 30107-1:2016standard defines a so-called presentation attack for biometric systems. Presentationattacks present a problem for all biometric systems. One method to perform a presen-tation attack against an ASV system is by replaying a recording of the target speaker’sspeech to the biometric authentication system. Multiple independent studies have iden-tified that ASV system performance can be degraded when replay samples are intro-duced. Current state-of-the-art systems for ASV utilise the so-called i-vectors, whilethe classical systems were based on Gaussian mixture modelling of acoustic speechfeatures. In this work we investigate so-called deep learning approaches to replay at-tack detection.

Keywords: biometrics, speaker verification, spoofing, anti-spoofing, replay-attack,ASVspoof, DNN, CNN

CCS concepts (ACM Computing Classification System, 2012 version): Security andprivacy→Biometrics, Computing methodologies→Neural networks, Computingmethodologies→Supervised learning by classification

ii

Acronyms and abbreviations

ADC Analog-to-digital conversion

ASR Automatic speaker recognition

ASV Automatic speaker verification

CNN Convolutional neural network

CQCC Constant-Q cepstral coefficients

CQT Constant-Q transform

DCT Discrete cosine transform

DNN Deep neural network

DFT Discrete Fourier transform

GMM Gaussian mixture model

EER Equal error rate

EM Expectation-maximisation

FFT Fast Fourier transform

HFCC High frequency cepstral coefficients

LCNN Light convolutional neural network

LDA Linear discriminant analysis

LFCC Linear frequency cepstral coefficients

LPC Linear prediction coefficients

LPCC Linear prediction cepstrum coefficients

LVM Latent variable model

MAP Maximum a posteriori

MFCC Mel-frequency cepstral coefficients

MFM Max-feature-map

MLE Maximum likelihood estimate/estimation

MLP Multiple-layer Perceptron

MSE Mean squared error

PLDA Probabilistic linear discriminant analysis

RASTA Relative spectral processing

ReLU Rectified linear unit

STFT Short-time Fourier transform

SVM Support vector machine

TVM Total variability model

UBM Universal background model

iii

Mathematical notation

a The vector a.

M The Matrix M.

p(x) Probability density on x.

N (µ,Σ) Multivariate normal density with mean µ, and covariance Σ

I The identity or unit matrix.

iv

Contents

1 Introduction 1

2 Machine learning 42.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Some properties of machine learning . . . . . . . . . . . . . . . . . . 6

2.4 Latent variable models and the Gaussian mixture model . . . . . . . . 8

2.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.1 The classic multi-layer perceptron . . . . . . . . . . . . . . . 15

2.6.2 Implications of (32) and discussion . . . . . . . . . . . . . . 21

2.6.3 Convolutional neural networks . . . . . . . . . . . . . . . . . 24

3 Speaker recognition and verification 293.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Authentication and biometrics . . . . . . . . . . . . . . . . . 31

3.1.2 Speaker verification . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Speech processing and feature extraction . . . . . . . . . . . . . . . . 32

3.2.1 Mel-frequency cepstral coefficients (MFCCs) . . . . . . . . . 34

3.2.2 Constant Q cepstral coefficients (CQCCs) . . . . . . . . . . . 41

3.2.3 The spectrogram . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.4 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Speaker modelling and classification . . . . . . . . . . . . . . . . . . 45

3.3.1 Gaussian mixture model approaches to ASV . . . . . . . . . 46

3.3.2 Linear statistical model approaches . . . . . . . . . . . . . . 48

3.3.3 Deep learning approaches to ASV . . . . . . . . . . . . . . . 51

3.4 Vulnerabilities and countermeasures . . . . . . . . . . . . . . . . . . 52

3.5 Deep learning in replay attack detection . . . . . . . . . . . . . . . . 52

3.6 System measurement and evaluation . . . . . . . . . . . . . . . . . . 56

4 Experimental set-up and results 584.1 The ASVspoof 2017 v2 data . . . . . . . . . . . . . . . . . . . . . . 58

4.2 The results and discussion . . . . . . . . . . . . . . . . . . . . . . . 60

v

5 Conclusion 64

vi

1 Introduction

The expectation for high usability and the demand for alternative forms of user authen-

tication have been ever increasing and today’s users expect seamless and hassle-free

access to various services. Similarly, due to the increasing demand for services de-

manding high degree of security traditionally handled on-site, such as banking and

various governmental services, service providers have long been interested in the abil-

ity to enhance and, in some cases, simplify their user authentication schemes. Auto-

matic speaker verification (ASV), as an application of automatic speaker recognition

(ASR), allows for the possibility of using relatively easily and, more importantly non-

intrusively, collectable biometric – human speech. The field of ASV is still under

rapid change, with recent developments pointing towards an interesting future from

the perspective of the so-called deep learning evolution of machine learning, where

deep neural architectures are studied for modelling and inference.

Key use-cases for ASV can be found in off-site electronic services, business customer

service, as well as in law-enforcement and forensics [1]. Untapped potential remain in

biometric authentication (including speech), considering all the modern day personal

computing devices. Similarly, voice-based interfaces, such as Apple Siri, have been

rising in popularity. Many modern smart devices already include biometrics, either

in the form of facial recognition, or in the form of fingerprint detection. ASV adds

another possible layer of security of these decides.

The threat of replay attacks on AVS systems have been identified in separate studies

[2, 3, 4, 5]. A replay attack against an ASV system is initiated by playing a recording of

a person’s speech while interfacing with the target ASV system. Figure 1 demonstrates

the relative ease of obtaining a high-quality samples of the speaker of interest in the

age of the smartphone. Figure 2 illustrates the relative ease of launching a replay

attack. In [3], it was shown that systems do not fare well under the possibility of high-

quality impostor attacks, where even the state-of-art systems were heavily impacted

by replay data. A replay attack can be thought to be of special interest due to the low

requirements of performing this kind of an attack. Other types of attacks are noted

to require either high-level technical sophistication or special skills in impersonation.

Speech synthesis and voice conversion for the purpose of fooling ASV systems have

been studied more extensively in comparison to the relatively simple replay attack, but

the recent ASVSpoof 2017 challenge [2] has refocused the interest of researchers in

1

Figure 1: A mock scenario where high-quality recording of the target’s speech is ob-tained. The attacker (right) lures the victim into a (preferably) lengthy conversation,during which they collect speech data for later use during an attack. Disclaimer: thedepicted situation is unrelated to the ASVspoof data collection process.

Figure 2: Two off-the-shelf devices are needed to launch a replay attack against atelephone-based ASV system. The recording and playback device (left) is used to playback pre-recorded audio from the target speaker, while the device on the right is usedto make the call.

2

this area, underlying the criticality of the problem.

Classical ASV systems [6, 7] are based on generative models known as Gaussian mix-

ture models (GMMs), and acoustic features, the most common of which are the Mel-

frequency cepstral coefficients (MFCCs) [8]. More recently, so-called Contant-Q cep-

stral coefficients (CQCCs) [9] have been shown to be successful in ASV replay attack

detection. The current state-of-the-art system in ASV is the factor analysis -based i-

vector system [10], in which the idea of the GMM supervector is utilised. This work

is ultimately motivated by some results obtained from the ASVspoof 2017 challenge.

Several so-called deep learning approaches were presented for ASV replay attack de-

tection [11, 12]. In the former, a so-called light convolutional neural network (LCNN)

[13] was used to directly distinguish replay samples from genuine samples, while in

the latter, the focus was on combining a support vector machine (SVM) [14] classifier

with a CNN feature extractor.

This thesis is structured as follows. Section 2 covers some of the most important ma-

chine learning topics at the heart of ASV. Section 3 offers a contextualisation of the

ASV problem within automatic speaker recognition (ASR) and discusses some of the

popular speech processing, feature extraction and speaker modelling techniques. Sec-

tion 4 describes the system, including the parameters and the overall set-up of the

experiments. Finally, section 5 concludes.

3

2 Machine learning

This section serves the purpose of an introduction to the key concepts and terminology

behind the methods used in automatic speaker recognition. The reasoning here is to

steer the speaker recognition and verification discussion towards certain important in-

novations that led to the forming of particular methods and to focus on the differences,

strengths and weaknesses of these methods. This section starts with the general ideas

and terminology around machine learning. Subsection 2.4 covers the necessary back-

ground on a specific unsupervised model, known as the Gaussian mixture model, while

subsection 2.5 covers the background on factor analysis. Subsection 2.6 covers deep

learning. These three background sections each constitute the necessary backgrounds

to three major different approaches to speaker recognition and spoofing detection, each

of which will be discussed in section 3.

Machine learning refers to a broad class of methods and principles for the purpose of

making future predictions based on observed data and for the purpose of uncovering

underlying structures within data [15]. There are several different types of machine

learning problems [15]. In supervised learning, the task is to learn a mapping function

between the input space and the output space, while in unsupervised learning the learn-

ing task is performed without any mapping (or label) information. Some methods may

combine aspects of both (semi-supervised learning). Finally, there is reinforcement

learning, in which the learning involves reward and punishment mechanisms that guide

the model towards the wanted response determined by the designer. In the context of

this thesis, the two relevant classes of machine learning are supervised and unsuper-

vised learning, both of which will will review in the following two subsections.

Figure 3: Conceptual illustration of supervised (left) and unsupervised (right) learning(based on similar ideas in multiple sources, including [16])

4

2.1 Supervised learning

Supervised learning tasks can be broadly split into regression and classification tasks.

In the case of classification, given the inputs xi ∈ X , i ∈ 1, ..., N, drawn from a

training set X , training outputs yi ∈ 1, ..., C, the task is to learn a mapping between

the inputs and the outputs [15]. Here, C denotes the number of classes and N the

number of training observations. In the case that the training outputs are continuous,

the task becomes a regression problem. An individual xi is also referred to as features,

or to as a feature vector, with the dimensionality of the training data often exceeding 1.

The dimensionality of the training data depends on the learning task and availability of

suitable data, ranging from single-digit to hundreds or even thousands. As an example

of low-dimensional training data, consider that we would record the height, weight and

the shoe size of a person. On the other hand, a high-dimensional feature vector could

contain, for example, the raw pixel values of a digital image. The desired outputs are

known as class (or label) information. The number of classes determines whether a

classification problem is a binary classification or multi-class classification problem.

Additionally, if an input may be classified into multiple classes, the task is known

as multi-label classification. As an example of a binary classification problem, con-

sider a situation where the goal is to classify a speech sample into either pre-recorded

(spoofed) speech or live speech, which is also the learning task this thesis focuses on.

An example of multi-class classification could be a situation where the goal is to predict

the labels of digital images.

Figure 4: An imaginary image labelling scenario, where the most probable labellingsfor the two example images are shown.

Both supervised and unsupervised learning have been widely utilised in speaker recog-

nition, and subsequently, in spoofing detection, and often we will see these used to-

5

gether in some manner to build the actual classifier that solves the problem of interest.

The general speaker recognition problem implies a multi-class classification task, given

that the goal is to tie an incoming speech sample to a single class, i.e., the most likely

speaker. Spoofing detection, in turn is a binary classification problem, as the task is to

determine whether a given speech sample is from a pre-recorded speech, played from

an artificial audio device, or given by an actual human. In this work, spoofing detection

is discussed as an important sub-problem within the wider context of speaker recogni-

tion and, as we will see, the methods for spoofing detection largely rely on the findings

from work done on the general speaker recognition task. A key distinction between

these two problems in terms of the classification task itself is the way in which the

label information is used: in the general task, labels are used to tie samples to individ-

uals, whereas in spoofing detection labels are used to differentiate samples based on

the emitting source (artificial device or human speaker).

2.2 Unsupervised learning

Unsupervised learning differs fundamentally from traditional supervised learning in

that the label information is may be completely omitted from the process [16]. Instead

of thinking in terms of pairs of training samples and the corresponding target values

or classes, in unsupervised learning we are interested in the underlying structure of

the data (clustering) or the hypothesised statistical process that is thought to be related

to the observed data (density estimation) [17]. Dimensionality reduction is also an

important field utilising the concept of unsupervised learning, where the idea is to

find a lower-dimensionality representation for the data that still retains the important

variance within the data [16]. Examples of unsupervised learning include the classic

clustering techniques such as K-means, factor analysis and mixture models such as the

Gaussian mixture model, which is discussed later.

2.3 Some properties of machine learning

While some aspects of machine learning can be difficult to categorise, certain attributes

can be assigned for each learning problem at hand. In [17], it is noted that each learning

problem may be described in terms of three main attributes: 1) the type of the learning

6

task, 2) the metric that is used to evaluate the performance of the model, and 3) the

nature of the learning process: supervised or unsupervised. We have already discussed

two of these: the task can be anything from predicting a single real number (regression)

to density estimation, where we try to approximate an unknown probability density,

and the nature of the learning, or type of experience [17] used for the learning, the

latter of which is often categorised into either supervised or unsupervised learning.

Finally, we have the metric, which in classification is usually related to the accuracy of

the model predictions, and for real-valued predictions in regression-like tasks may be

some sort of distance metric. This is related to the empirical risk minimisation (ERM)

principle discussed below.

Most supervised learning is based on empirical risk minimisation (ERM) [18, 19], in

which the idea is to minimise the expected loss given our model predictions and the

training data [18]:

E(w) =1

N

N∑i=1

L(f(xi,w),yi), (1)

where L is a loss function, f(xi,w) the model’s prediction for a particular input train-

ing sample, given the model parameters w, and yi the training label associated with

that training sample. Under the principle, we should choose a model for which (1)

is lowest. The loss function, often also referred to as objective function or cost func-

tion is a mathematical expression of the wrongness we want associate with a particular

training sample and label pair, and the form of the loss depends on the learning task.

The key issue that machine learning, and especially supervised learning, is confronted

with is the issue of generalisation [17]. In specific, we have a set of training samples

that give us a glimpse into some unknown statistical process, and now we want to build

a model that is able to give reasonable guesses when confronted with completely new

data. The measure that gives us an idea of the goodness of our model in this setting is

the generalisation error (or test error), which is measured in terms a disjoint test set.

In most machine learning settings, the development of a model involves first using the

training data to reduce the training error towards zero and then testing the model on

the separate test set. We want both to be as small as possible, and two errors should

follow each other in a good model [17]. In the case that the training error is lower than

the test error, we have a model that underfits, and in the opposite case we have a model

that overfits.

7

2.4 Latent variable models and the Gaussian mixture model

In a latent variable model (LVM), the observed data is assumed to be affected by one

more more unobservable or latent variables or factors [20, 15]. Normally, in LVMs

and the related topic of factor analysis, the goal is to explain the variance of the ob-

served data with an arbitrary number of unobserved variables, which are found by the

means of factor analysis. The ideas behind LVMs were originally developed in the

social sciences for behavioural modelling [20, 21], but have also found widespread use

within machine learning. LVMs and factor analysis will be discussed in more detail in

subsection 2.4.

The Gaussian mixture model (GMM), or mixture of Gaussians [15, 16], consists of

multiple multivariate Gaussian distributions, each with a mean µk and a covariance

matrix Σk. A GMM is formalised as a weighted sum of its mixture components:

p (xi|θ) =K∑k=1

πkN (xi|µk,Σk) , (2)

whereN is the probability density function (pdf) of the multivariate Gaussian (Normal)

distribution (MVN):

N (x|µ,Σ) ,1

2πD/2|Σ|1/2exp

[−1

2(x− µ)TΣ−1(x− µ)

](3)

In the MVN, the mean vector, µ, defines the centre point of the distribution within

the D-dimensional space that it lies, and the covariance matrix, Σ ∈ RD×D, defines

how the probability density behaves around the mean, i.e., how the probability mass

is distributed. |Σ| is the determinant of Σ. The covariance of a multivariate random

variable is given by

cov[x] = E[(x− E)(x− E)T

], (4)

where for the Gaussian distribution we have E[x] = µ. The covariance can be either

general, diagonal or isotropic (or spherical) [16], of which the first one is the least

restricted in terms of number of parameters. The general covariance matrix is as its

definition in (4), however, we may restrict it to only be the diagonal matrix diag(σ2i )

so that we have a matrix with only the individual variances of each dimension of x.

Lastly, in the most restrictive option, we have Σ = σ2I, so that the covariance is simply

the identity matrix scaled by a variance parameter. The differences between these are

8

illustrated in Figure 5. The restricted covariances are useful due to the requirement of

computing the inverse of the covariance inside the exponent of (3), which is also known

as the precision matrix [15]. In (2), K denotes the number of mixture components and

Figure 5: Three multivariate normal distributions in 3D and 2D plots. The leftmostdistribution is one with the general case covariance matrix, while the central and right-most distributions have spherical and diagonal covariances, respectively. We see thatthe general case covariance may have an arbitrary angle for direction of the highestvariance, while the diagonal one is restricted to follow one of the axes for its shape ofthe variance. Lastly, the spherical covariance has clearly a stright circle shape. Plotsgenerated with GNU Octave (https://www.gnu.org/software/octave/) mesh and contourplots.

D the dimensionality of the vectors (xis). Further, πks are known as the mixing weights

or component priors. Two probabilistic constraints, 0 ≤ πk ≤ 1 and∑K

k=1 πk = 1,

must be satisfied. The πk represent the Bayesian prior knowledge we have regarding

the training data. The model encapsulates a 1-of-K coded latent variable zk, also known

as a one-hot vector, where one of the vector values is one and the rest zeroes. This one-

hot vector can be thought of as tying each observation in to a mixture component, with

the index of the one in the vector pointing to the kth mixture component. The constraint

9

∑k zk = 1 must hold. For the mixing weighs, πk, we have

p(zk = 1) = πk. (5)

The parameters of a GMM can be trained with an iterative algorithm known as the

expectation-maximisation (EM) algorithm [22, 15, 16]. The algorithm consists of two

steps, computed at each iteration. The E-step (expectation) step comes first in the iter-

ation. Here, data inference is done based on the current model parameter values. What

inference means with a generative model such as the GMM, is that we use the current

parameters of the distribution to sample arbitrary number of new samples. The second,

maximisation, or M-step, is an optimisation of the parameters given the previously up-

dated data. The purpose of the EM algorithm is to maximise the log-likelihood of the

Figure 6: GMM training process via the EM-algorithm: here, starting from a randominitial position, the mixture component parameters are shifted towards the clusters. Inthis toy example, we conveniently have three data clusters and three mixture compo-nents.

observed data, xi, the missing or hidden data zi, given the parameters θ. The likelihood

is a function that indicates how reasonable our model is given the data we observe (our

training data) where a higher likelihood is in favour of our model, and vice versa, and

it is

`(θ) =N∑i=1

log p (xi|θ) =N∑i=1

log

[∑zi

p (xi, zi|θ)

](6)

10

As noted in [15], due to the logarithm in front of the sum, the complete data log-

likelihood is used:

`c(θ) =N∑i=1

log p(xi, zi|θ) (7)

The complete data is the set of random variables X,Z, whereX is the observed data

and Z the unobserved data (xi ∈ X and zi ∈ Z). While Z is not directly observable, it

is assumed that there is exactly one Gaussian component in the mixture that generated a

particular observed sample in X . This data-generating process is encoded in Z. As the

zi are latent and therefore unknown, this log-likelihood cannot be directly evaluated.

The EM algorithm solves this problem such that, instead of evaluating (7), we evaluate

its expectation [15]:

Q(θ,θt−1) , E[lc(θ)|X,θt−1

]= E

[∑i

log p(xi, zi|θ)

]=∑i

∑k

rik log π +∑i

∑k

rik log p(xi|θk).

(8)

Note the introduction of the variable zi for the assumed unobserved data. Instead of

evaluating a likelihood, in the EM algorithm we evaluate the expectation of the joint

distribution of the observed data (the xis) and the unobserved data, represented by the

zis. We see that the sufficient statistics, means and covariances, appear only in the

right-hand sum. This is utilised in the steps of the algorithm. Here, t refers to the

current iteration of the algorithm and θt−1 refers to the parameters of the previous

iteration. Q is a new function known as the auxiliary function. The parameters θt in

the M step are optimised via the following:

θt = argmaxθ

Q(θ,θt−1), (9)

that is, we want to find the model parameters that maximise the expected log-likelihood

in (8). In the EM-algorithm, maximum likelihood estimation (MLE) [23] is used to find

the parameters. The basic principle of MLE is that we take the log-likelihood function,

the data, and obtain an estimate (maximum likelihood estimate) for each parameter of

the model. The estimator for a given parameter is obtained by taking the derivative of

the likelihood function with respect to that parameter at zero. This estimator can then

used to compute the estimate for the parameter of interest.

11

Now, the E-step in the EM iteration, based on (8), is done via [15]:

rik =πkN

(xi|θt−1

k

)∑Kk′=1 πk′N

(xi|θt−1

k′

) . (10)

That is, for each observed data point i, the responsibility that the kth component takes

for this data point is obtained with the Bayes rule. The responsibility computation

can be seen as a soft clustering of the data points, in contrast to hard clustering in

the K-means algorithm [16]. Thus, if we wanted to use the EM-GMM framework for

clustering purposes, we could select the data point that has the largest responsibility

value given each cluster to obtain cluster memberships. We will discuss this similarity

briefly later. The M-step is done in multiple phases. First, the mixing weight for the

kth component is set to be the proportion of the observed data that has been assigned

to this component [15]:

πk =1

N

∑i

rik =rkN

(11)

The new mean vector for the kth component is obtained by taking the mean over all

data points weighted by the responsibility of the kth component for each data point

[15]:

µk =

∑i rikxirk

(12)

The updated covariance matrix is obtained by following the MLE-reasoning for

the covariance parameter, in which we set the derivative of the log-likelihood,

l(xi|πk,µk,Σk), with respect to the parameter of interest (here Σk) to zero and solve

for that parameter. This results in the form [15]:

Σk =

∑i rik(xi − µk)(xi − µk)

T

rk

=

∑i rikxix

Ti

rk− µkµ

Tk

(13)

Assuming that the convergence threshold is set appropriately, the algorithm is guar-

anteed to end up in a local maximum in terms of the log-likelihood [15, 16]. For

the initialisation of the model in terms of the means and covariances that describe the

mixture components, it is possible to use the K-means algorithm [24, 15, 16]. Other

approaches include random initialisation and farthest point clustering [15].

12

Algorithm 1 The naive EM algorithm for GMMs1: Initialise each µk, Σk, πk, convergence threshold t, update variable u and evaluate

the initial log-likelihood.2: while u ≥ t do3: Evaluate the riks . E-step4: Using the new riks, update πk (11), µk (12) and Σk (13). . M-step5: Re-evaluate the log-likelihood for the updated model and set the increase in

log-likelihood to be u6: end while

2.4.1 Discussion

The EM-GMM scheme described earlier can be seen as a more robust way of doing

clustering that is conceptually close to the K-means algorithm (in fact, [16] notes that

K-means is a special case of GMM-EM). In the case of GMMs, we get a probabilistic

alignment (soft alignment) of each data point belonging to each mixture component,

whereas in the case of K-means, each data point either belongs to a cluster or does not

(hard alignment). These are often called soft clustering and hard clustering, respec-

tively.

In K-means clustering [25, 26], we first select k points at random and then iterate two

steps, in the first of which we assign data points to their nearest cluster centre according

to some distance metric (usually the euclidean metric, or L2-metric), and then in the

second step we set each cluster centre to be the mean all the points assigned to it. The

first step can be formalised as follows [16]:

rik =

1 if k = arg minj ‖xi − µj‖2

0 otherwise,(14)

where we get a one-hot encoding for the cluster assignments, anagolously to the re-

sponsibilities in EM-GMM. The updated cluster means are obtained such that [16]

µk =

∑i rikxi∑i rik

. (15)

Both (14) and (15) result from the objective or loss function [16]:

J = arg minrik,µk

N∑i=1

K∑k=1

rik‖xn − µk‖2, (16)

13

with (15) being the MLE for µk. Now, suppose that in the EM-GMM computation

we set each Gaussian component’s covariance matrix to be σID, use a constant πk =

1/K, and use an indicator function to set the responsibilities in E-step such that for a

given data point, the component responsibility ends up either being one for the most

likely component, or zero otherwise [15]. The result is the same clustering that is

obtained from K-means as only the means of the components are relevant in the E-step

due to the way we defined the covariances. EM-GMM can be thus seen as a more

robust way of modelling in comparison to K-means due to the possibility of modelling

the individual component variances. Another important difference arising from the

probabilistic responsibility assignments in EM-GMM is that it captures the uncertainty

of each component responsibility, obtained via 1 − maxk(rik) [15], i.e., we can look

at the relative magnitude of the highest assigned component responsibility to obtain a

measure of the uncertainty related to the component (or cluster) assignment.

2.5 Factor analysis

Factor analysis is a field of statistics where the goal is to understand the effect of un-

observed (latent) variables on the observed data and to estimate these latent variables

or factors. The origins of the field are in behavioural sciences, specifically in the study

of human intelligence [21]. Factor analysis is a popular statistical tool in fields where

there are hard-to-define and obtain quantities, such as marketing and economics, but

has also seen wider adoption in terms of its core ideas due to the relation to the dimen-

sionality reduction technique known as principal component analysis. Factor analysis

models are discussed in more detail in Section 3.3.2.

2.6 Deep learning

Many classical machine learning techniques require the use of often complex prepro-

cessing of data in the form of feature extraction, before any classification or regres-

sion task becomes feasible. Deep learning is a branch of machine learning, where the

aim is to move away from complex feature extraction processes into what is known

as representation learning (with multiple layers of representations), where the learn-

ing architecture can be thought of as performing feature extraction on the data [17].

These architectures usually combine together arbitrarily many non-linear components

14

in a layered structure, each layer thought as corresponding to a different level of ab-

straction [17]. We note, however, that this kind of description can be seen as rather

idealised, considering the difficulty of actually interpreting the learned parameters of a

deep neural network. Ideally, this kind of thinking would be reducing the need for in-

depth domain knowledge in adopting a deep architecture for some learning problem.

Deep learning, however, presents an entirely new set of challenges, from interpreta-

tion of the models to computational feasibility due to high parameter count. Finally,

classical machine learning techniques are still utilised in many deep learning systems.

The fundamental ideas behind deep learning can be traced back to the original ideas

of neural networks [27, 28, 29, 30] and the computational modelling of neurons. The

most widely known deep learning model is the classic multi-layer perceptron, which

utilises the simple model neuron, perceptron. While the ideas for training such models

were studied in the 1970s and 1980s, they did not gain wide interest in the machine

learning and pattern recognition communities until relatively recently, because in the

1990s it was widely thought that the training of such models would be difficult in

practice. For this thesis, two highly related deep learning models are presented – the

classical perceptron-based model (discussed next), and its training process, as well as

the so-called convolutional neural network.

2.6.1 The classic multi-layer perceptron

The multi-layer perceptron [27, 16] (deep feedforward network, feedworward neural

network), often abbreviated MLP, can be described in terms of non-linear function

approximation, where arbitrarily many layers of non-linear vector-valued functions,

each with modifiable (to be trained) parameters, are combined to a composite. A well-

known property of the MLP-networks is related to the universal approximation theo-

rem [31, 32, 17], which states that these kind of models can be used to model arbitrary

processes, given enough parameters. Of note is however the fact that this idea does not

consider the optimisation of such models [17], which is still an active area of research.

The first layer of a MLP can be written as [16]:

σ

(aj =

D∑i=1

w(1)ij xi + b

(1)j0

), (17)

15

where aj is the jth activation of this layer. D denotes the dimensionality of the in-

put layer, while wij is the weight term for the ith input layer component connected

to the jth hidden layer component. bj0 is the jth bias term for the first (and only)

layer of this model. Finally, σ denotes the activation function. The neural network

nomenclature for this kind of model becomes apparent when the weights are consid-

ered to be edges connecting the previous layer to the second and each jth component

a neuron. Moreover, the weighted sum with the bias term and the activation is the

perceptron, with the important distinction that in the MLP we use differentiable non-

linearity: z = σ(∑D

j=1 wij + b)

. Thus, the basic component of this kind of a model

is a linear combination, with adjustable parameters, followed by a non-linearity. (17)

can be extended to include multiple layers as follows [16]:

yk(x,w) = σ

(M∑j=1

w(2)kj h

(D∑i=1

w(1)ji xi + b

(1)j0

)+ b

(2)k0

), (18)

where M denotes the number of components on the added second layer and the acti-

vation function, h, is the activation function used for the first hidden layer. The bias

terms can be merged into the into the input vectors such that an input x becomes

x = (1, x1, ..., xd)T [16] and this simplification is used in the following discussion. We

could equivalently and more simply write our model in the matrix form:

G(x) = σ (h (xW1 + b1) W2 + b2) , (19)

where W1 is a weight-matrix containing weights for the first layer and b1 the bias

vector for that layer. The activation functions are not necessarily chosen to be the

Figure 7: A high level view of a MLP-network. Here we see the input layer, withdimensionality D, an arbitrary number of hidden layers, and finally the output layer,with size T .

same, as seen here. Traditionally, the logistic sigmoid function has been presented as a

16

suitable activation function for the hidden layers:

σ(x) =1

1 + exp(−x). (20)

However, the sigmoid function can lead to a problem known as vanishing/exploding

gradient [33, 34] during the training process, where the error signal used for the pa-

rameter update tends to either go too small for sensible updates or explode entirely.

This issue was noted to be especially significant for deeper networks, such as so-called

recurrent neural networks (RNNs). The the hyperbolic tangent, tanh, is another com-

monly used activation function. Generally, the requirements for the hidden layer ac-

tivation functions are such that the function must be differentiable and monotonically

increasing [16]. In the context of modern deep learning, a widely accepted replacement

for the activation function is the so-called rectified linear unit (ReLU) [35, 17] and its

variants:

a(x) = max(0, x). (21)

ReLU is a piecewise function with two sections. This property, coupled with the non-

linearity of the function, makes it a desirable activation function from the perspective

of parameter optimisation [17]. In classification networks, the activation function at

the output layer is typically the softmax function [16]:

yi(x) =eai∑j e

aj, (22)

where ai is the ith activation at the output layer, and the sum is taken over all the activa-

tions at the output layer. The softmax activation constrains the output layer predictions

into probabilities, such that 0 ≤ yi ≤ 1 and∑

i yi = 1, which is what we want to have

in a classification network.

The parameters of an MLP network can be trained with various strategies utilising

gradient descent and back-propagation. In MLP training using these two techniques,

the parameters are adjusted in small steps towards the negative gradient [36, 37, 30].

Back-propagation is used to evaluate the errors for each network parameter, which

are then used to adjust the parameters accordingly. The gradient descent parameter

optimisation can be formalised as follows [16]:

w(τ+1) = w(τ) − η∇E(w(τ)), (23)

17

Figure 8: The rectified linear unit (ReLU) activation function

where w(τ) is the flattened parameter vector of the current time-step, where each layer’s

weights have been concatenated into a single vector of parameters. η is known as the

learning rate and∇E is the vector of partial derivatives (a gradient) of the error (or ob-

jective or loss) function E with respect to all the network parameters, calculated with

back-propagation. The precise formulation of the error function depends on the acti-

vation function used at the output layer of the network as well as on the classification

task. For regression, mean-squared-error (MSE) can be used [16]:

E(w) =1

2

N∑i=1

‖yi − ti‖2, (24)

where w is the vector of weight and bias parameters, yi the predicted output and ti

the corresponding label or target vector. The error or cost functions presented here for

neural networks are based on the principle maximum likelihood [17]. In the case that

our labelling information is in the form of one-hot-coded vectors, and the task is to

perform binary or multi-class classification, binary or categorical cross-entropy can be

used. Binary cross-entropy can be written as follows [16]:

E(w) = −N∑i=1

(ti log(yi) + (1− ti) log(1− yi)). (25)

Categorical cross-entropy, as presented in [16], can be written as

E(w) = −N∑i=1

M∑c=1

ti,c log(yi,c), (26)

18

where M denotes the number of predicted classes, ti,c the target label and yic the pre-

dicted probability of ith sample belonging to the cth class. The cross-entropy losses are

motivated by information theory, and cause the network to bring the model predictions

closer the training data, which represents the underlying true distribution we want to

generalise for.

In the back-propagation technique, the partial derivative of the error function for the

nth input vector, En(θ), with respect to a weight parameter wji, is obtained by utilising

the chain rule for partial derivatives [16]:

∂En∂wji

=∂En∂aj

∂aj∂wji

, (27)

where wji is the ith weight parameter connected to the jth activation, aj , of some layer

of our network. The aj is the weighted sum∑

iwjizi, where zi is the ith output coming

from a previous layer of the network to this component. Let

δj ≡∂En∂aj

. (28)

The δjs are known as the errors [16]. In (28) the error for each network activation

is defined as the gradient of the objective function for the current (nth) sample with

respect to the jth activation at some layer of the network. The errors at the output layer

of the network are simply the differences between the activations at the output layer

and the values of the target vector, or δk = yk − tk. It is now possible to write the

partial derivative of the jth activation with respect to the ith weight connected to that

activation as follows:∂aj∂wji

= zi. (29)

Following the derivation in [16] and substituting the previous two into (27), we get

∂En∂wji

= δjzi. (30)

The errors for a unit of a hidden layer of the network are obtained as follows [16]:

δj ≡∂En∂aj

=∑k

∂En∂ak

∂ak∂aj

. (31)

The sum is evaluated over all hidden layer units that the kth output unit has connection

19

Figure 9: Components of MLP error calculation (based on the illustration in [16]).The green arrow signifies the direction of activation calculation, often referred to as"forward pass". The red arrow shows the direction of the error back-propagation com-putation, starting from the output layer.

with. Finally, we can follow [16], and substitute (28) into (31) and utilise the definition

of the activation, aj =∑

iwjizi and zj = h(aj), to obtain the general form of the

back-propagation for a unit on a hidden layer of the network [16]:

δj = h′(aj)∑k

wkjδk, (32)

where h′ is the derivative of the hidden activation function of the layer this unit resides

at. From (32) we see that, after a forward pass through the network, one may iterate

backwards through the network using the immediately previously computed errors at

some layer to compute the errors for the next layer of the backwards pass through

the network. Importantly, there is no complex dependency between the errors of units

within the same layer of the network, as only the previous layer error values are needed

to be taken into account. Of note is also the fact that the computations in a MLP-

network can be represented as a computational graph [17], as it can be easily seen

that there are no cycles in the forward pass nor in the back-propagation step, and most

modern software tools utilise some kind of a graph.

The algorithm for naive gradient descent and back-propagation over the whole data is

described below. While our basic algorithm utilises the entire dataset before a update,

the size of the batch does not necessarily need to set in this manner. One strategy is, for

example, to select some batch size smaller than N to either partition the data or sample

20

Algorithm 2 Gradient descent and back-propagation1: while i < N do2: Forward-propagate an input vector xi through the network, calculating the ac-

tivations of each hidden and output unit of the network.3: Evaluate the output-layer errors using δk = yk − tk.4: . forward pass5: Using the output-layer errors, δks, back-propagate through the network by util-

ising (31) and (30), summing and maintaining the errors for each parameter.6: . back-propagation7: Adjust the network parameters using (23). . gradient descent8: end while

randomly from it (mini-batch gradient descent). The other extreme is an on-line [16]

version of the algorithm which calculates the gradient in terms of a single input vector,

and adjusts the network parameters each time, instead of going through the entire data

before adjustments (sequential/stochastic gradient descent). The mini-batch approach

is noted to be a popular compromise between using all of the data and a single data

point during error computation [17], where it is noted also that in [38] it was shown

that in terms of the generalisation performance using a single sample is desirable, but

as noted in [17] this often problematic from optimisation perspective. Additionally, the

choice of the mini-batch size may be affected by hardware and parallerisation consid-

erations [17].

2.6.2 Implications of (32) and discussion

From (32), we can see that the iterative computation of the errors in these kinds of

networks can be achieved merely by knowing the derivative of the chosen activation

function(s). However, as noted in [16], due to the multiple nonlinearities found within

neural networks (the activation functions), the overall error function is non-convex.

This implies that only local optima of the error function can be found by utilising the

gradient descent parameter update in (23). For the generalisation performance of the

model, i.e., when the model is tested on unknown new data beyond the training data,

the global optimum may not be a desirable [16].

It is noted in [39] that the stochastic gradient descent (SGD) has been widely popu-

lar as an optimisation technique for neural networks, even as the fine details of the

techniques used in practice have seen refinement in the recent years. These include

the ideas of momemtum [40], Nesterov momemtum or Nesterov accelerated gradient

21

(NAG) [41], RMSsprop [42], ADAM [43], Adamax [43], Adagrad [44], and Adadelta

[45], to name a few. For now, let us assume that we are using the mini-batch vari-

ant of back-propagation and gradient descent. The update rule [39] using momemtum

becomes

vt = γvt−1 + η∇E(w)

w = w − vt,(33)

where γ is an adjustable hyperparameter that controls the amount of the previous up-

date values used in computing the new update. Here we omitted the time-step index for

the parameters w for simplicity and conciseness. The intuition behind momemtum is

that it helps the parameter optimisation process in getting over valleys and hills in the

gradient "landscape" by taking into account the history of past update values during

the update. Many of the variants of SGD discussed here include something similar in

nature.

In the Nesterov variant (34), the previous momemtum values are used together with the

current parameters to obtain an estimate of the future parameter values such that the

historical direction of the parameter updates is taken into account during the gradient

computation itself.

vt = γvt−1 + η∇E(w − γvt−1)

w = w − vt.(34)

Practical challenges in getting the standard SGD to converge into desirable results have

further led into the development of so-called adaptive versions of SGD, where usually

each parameter is updated individually based on some statistic of said parameter. The

update rule for the adaptive method Adagrad is the following [39]:

wt+1,i = wt,i −η√

Gt,ii + ε· gt,i, (35)

where wt,i corresponds to the parameter i at time-step t and gt,i to the derivative of the

parameter. Gt ∈ Rd×d is a diagonal matrix of the sum of squares of the past gradients.

Finally, ε is added to the denominator for numerical stability. The intuition of (35)

is that we scale the learning rate parameter η according to the size of the past errors,

which allows for the optimisation to adapt to the size of the errors on a per-parameter

22

basis. One motivation for this scaling is suggested [39] to be the problem of sparse

data, e.g., high-dimensional data where the variability is concentrated on particular

dimensions, which is common in many real-world situations. While adaptive meth-

ods have been found to be successful empirically in certain tasks such as in training

Generative Adversarial Networks (GANs) [46], in [47] it was suggested that adaptive

methods may not be well-justified as drop-in replacements for standard SGD as these

were shown to give widely differing results from standard SGD.

Beyond the choice of the update rule during optimisation, a strategy for the initial

values for the network parameters must be decided upon. Common approaches include

drawing from normal and uniform distributions, with some heuristics depending on

the choice of non-linearity in the model. One of these heuristical random initialisation

methods was presented in [48], where the initial parameter weights for a particular

layer are drawn such that

W ∼ U(−√

6√nj + nj+1

,

√6

√nj + nj+1

), (36)

where nj is the number of incoming connections ("fan in") to this layer and nj+1 the

number of outgoing connections ("fan out"). The normal variant of this method draws

the parameter values such that

W ∼ N(0, σ), (37)

where

σ =

√2

nj + nj+1

. (38)

The above initialisation is known as Xavier or Glorot initialisation according to one of

the authors and has been found to be effective with deep networks.

MLP-networks are particularly prone to a common problem in machine learning

known as overfitting, where the model’s neurons memorise some parts of the train-

ing set. The result is degraded inference performance when the model is introduced

to previously unseen data. Strategies for guarding against overfitting vary, and include

parameter weight decay [17], data augmentation [17], noise injection [17], and dropout

[49, 17]. In weight decay strategies, the norm of all of the model parameters are con-

strained to a norm (L1, L2) via the use of a regularisation term in the error function. As

overfitting may often arise from insufficient data, data augmentation may be sometimes

used to generate new training samples by applying suitable transformations on the real

23

samples. In noise injection, we add random noise to the input data for the purpose of

making the model more robust to small disturbances in the input. Dropout is a method

for applying noise to the weights of the network, where we randomly "drop" connec-

tions in the network, i.e., we set some of the activations at a particular layer to zero at

random. Finally, we note that controlling for the capacity of the network, in terms of

the number of layers as well as the size of the layers themselves, is an important part

of regularisation.

2.6.3 Convolutional neural networks

Convolutional neural networks (CNNs) [50] constitute a class of neural architectures

for the purpose of allowing for translational invariance [17] for structures in the input

data (especially in the visual domain, such invariance is critically important since ob-

jects may change their location within an image, but they still need to be recognised by

the model.). A CNN is a neural network that utilises an operation known as convolu-

tion in its discrete form at some point of the architecture [17]. The discrete convolution

operation is defined as follows:

s[t] = [x ∗ w][t] =∞∑

a=−∞

x[a]w[t− a]. (39)

For a 2D array-like input, such as a grayscale (single layer) digital image, convolution

becomes [17]

S(i, j) = (K ∗D)(i, j) =∑m

∑n

D(i−m, j − n)K(m,n), (40)

where D is the input matrix and K a two-dimensional kernel that is convolved with

the input. Taken through an activation function, such as a ReLU, the output of this

operation is often called a feature map. [17] notes that, in practice, cross-correlation is

often utilised:

S(i, j) = (K ∗D)(i, j) =∑m

∑n

D(i+m, j + n)K(m,n). (41)

Typically, convolutional neural architectures dealing with digital images have to ac-

count for the three layers of a RGB-image, however, for our purpose of adapting such

a model for speech data in the form of a spectrogram, the above form of convolution

24

Figure 10: Convolution with a 2x2 kernel (Adapted from [17])

applies. Figure 10 is a simplification of the convolution operation. In practice, two de-

Figure 11: Zero-padded input

sign parameters are used for convolution in CNNs: stride and padding [17]. When the

kernel is convolved with the input, the kernel has its centre at the current input value.

In order to make the operation feasible at the borders, with varying input sizes, zeroes

are added to the input (Figure 11). The stride-parameter determines how many input

values the kernel centre moves at each step of the computation in each direction (width

and height) (Figure 12). The dimensions of the output are given by the following [51]:

output width =W − Fw + 2P

Sw+ 1, (42)

and

output height =W − Fh + 2P

Sh+ 1, (43)

25

where W is the width and height of the input, Fw and Fh the width and height of the

kernel respectively and P the amount of zero-padding used.

Figure 12: Height and width stride of 2 with 3x3 kernel and zero-padding

CNNs contain three desirable properties – sparse interactions (or sparse connectivity

or sparse weights), parameter sharing and equivariant representations [17], the last

of which we already hinted at earlier. In a traditional neural network, assuming that

no dropout, where some of the connections are dropped is utilised, each hidden node

is connected to every node in the previous and next layer. In a CNN, because the

convolution kernel used at each network layer is smaller in size than the input, the

parameter count is greatly reduced. Parameter sharing is achieved, similarly, due to

the use of the kernel. Generally speaking, parameter sharing implies the use of the

same parameter for multiple functions within the model. The intuition in the case of

the CNN is that the kernel of each network level ties multiple positions of each input to

its values, while in a MLP-network, each position in the input is only tied to a specific

network node. Finally, we have the equivariance property. Function f is said to be

equivariant with respect to the function g if the following holds:

f(g(x)) = g(f(x)). (44)

The convolutional layer of a CNN has equivariance with respect to translation, where,

26

as the input data is shifted wholly into a certain direction, the result of the convolution

operation changes in a predictable manner.

In addition to the convolutional layers, CNNs normally contain another distinct compo-

nent – the pooling function or layer. A pooling layer can be thought of as performing

downsampling on the input, approximately representing it with a smaller number of

values. A common type of a pooling layer is the max-pool layer, which has some

similarity to our convolution operation. Unlike in the normal convolution operation,

where essentially a dot-product is computed between the kernel and a section of the

input, in max-pooling, a local maximum is simply taken around a section of the input

determined by the kernel and the stride-parameter (Figure 13). Other pooling functions

have been studied, such as taking the average, the L2-norm, or average distance from

the kernel centre position [17]. While an architectural description of a CNN may sepa-

Figure 13: Max-pool with 2x2 kernel and stride 1 (Adapted from [51])

rate these into distinct network layers, a convolutional layer of a CNN can be thought of

as consisting of three distinguishable stages [17], each of which are illustrated in Fig-

ure 14 below. The final layers of a CNN are typically so-called fully connected (FC)

layers, which are exactly like the traditional neural network layers discussed in Section

2.6.1. A complete CNN architecture is illustrated in Figure 14 below. Both CNNs and

MLPs (both types are often mixed in modern networks) are commonly optimised with

some variant of SGD discussed earlier.

In the recent years, a plethora of different higher level frameworks for developing and

training such models have seen rise, including Tensorflow 1, Keras 2, PyTorch 3, Caffe

1tensorflow.org2keras.io3pytorch.org

27

Figure 14: The basic building blocks of a convolutional neural network. The convo-lution layers compute local activations of the input, while the pooling layers computesummarisation of the input at various stages. The max-pooling layer displayed herereduces the size of the channel outputs, but retains the number of channels.

4, and Microsoft Cognitive Toolkit 5. The recent explosion of neural network research

can be largely attributed to the rise of practical and affordable parallel computation of

backpropagation and SGD via the use of graphics processing units (GPUs), the most

popular lower level framework being NVIDIA’s CUDA platform 6. Moreover, certain

higher level frameworks such as the historically more research-oriented PyTorch, make

the development and deployment of various kinds of networks straightforward with

standardised implementations for many layer types, loss functions, and optimisation

methods.

4caffe.berkeleyvision.org5microsoft.com/en-us/cognitive-toolkit6developer.nvidia.com/cuda-zone

28

3 Speaker recognition and verification

Speaker recognition refers to the study of a number of tasks that utilise speech data with

the aim of tying the speech data to individuals in some manner [52]. These involve

speaker identification, speaker verification, speaker or event classification, speaker

segmentation, speaker tracking and speaker detection. Automatic speaker verification

(ASV), as the focus of this work, refers to a process where the identity of an user of a

system is verified using speech data, while identification refers to the process of iden-

tifying a person from an audio stream, possibly containing speech from multiple per-

sons. Speaker/event classification includes problems such as speaker age classification

or, for example in the case of event classification, classifying an audio event to relate

to music (e.g., singing) or a car (not speech). Speaker segmentation involves the prob-

lem of separating different speakers within an audio sample. Finally, we have speaker

detection and speaker tracking. The former involves detecting the presence of a par-

ticular speaker from an audio source, while the latter involves tracking a person across

multiple audio sources. A distinction should be made with automatic speech recogni-

tion (ASR), which is a broader and older problem, and refers to linguistic analysis of

speech, where the task is to predict the message that was spoken. Machine language

translation falls into the purview of speech recognition. Voice recognition has been

historically used as a synonym for both speech recognition and speaker recognition. In

this thesis, unless otherwise mentioned, we focus on ASV.

ASV can be further split into two different problems depending on what kind of in-

formation is used in the recognition task. In text-dependent speaker recognition addi-

tional linguistic information is used in conjunction with speech data, whereas in text-

independent speaker recognition, speech data may be to some degree used indepen-

dently of any linguistic information. [52]

3.1 Biometrics

We have seen that speaker recognition by itself is a wide area of study. This subsection

serves for the purpose of contextualising the problem of interest of this work, namely,

the detection of a spoofed audio sample within a speaker verification system. It is one

of many different forms of biometric authentication.

29

Biometrics is the science of identifying or verifying persons based on physiological

or behavioural characteristics, while the word biometric refers to a specific mode of

biometrics (e.g., fingerprint is a biometric.) [53]. The most well-known biometric is

the fingerprint, which has until recently been somewhat of a synonym for biometrics

general. Examples of behavioural characteristics include handwritten signatures and

indeed certain properties of voice (we will later discuss how human voice contains

actually both physiological and behaviour traits.). The key difference of these two

types of characteristics is the fact that physiological characteristics are, at least to some

degree, directly measurable and less dependent of human mental functioning, while

the latter are more complex and a function of a person’s life experience, development

over time and situational setting. Consider how languages and ways of speaking are

learned, for example. Certain physiological characteristics may potentially contain

substantially more information for the purpose of biometrics in terms of a single mea-

surement, such as a fingerprint. In contrast, a behavioural characteristic, such as voice,

may contain less information in terms of a single measurement, implying the need for

measurements over a length of time (this will be a common theme in the following

sections.).

Desiderata of a particular biometric can be evaluated by looking at the following crite-

ria [54, 53, 55]:

• Universality – The biometric characteristic should be measurable in the case of

every individual.

• Uniqueness – Persons should be uniquely identifiable by the characteristic.

• Permanence – The characteristic should stay similar even as time passes.

• Collectability – There should exist some way to measure the characteristic.

• Acceptability – Measuring and processing the biometric should be acceptable for

humans, i.e., not humiliating, inconvenient or dangerous.

• Performance – The biometric should provide accuracy and consistence in terms

of the measurement.

While the overall quality of a biometric is some kind of a function of all of the above

criteria, it would be impossible to cover all perfectly. Consider, for example, that cer-

tain obvious trade-offs exist between the criteria. An easily collectable biometric, such

30

as a short speech sample, might have limited usefulness in terms of performance [53].

Should be noted that the problem of fooling or evading a system utilising biometrics

is not contained in our desiderata, but the inclusion of such ideas could be contested

as these criteria consider the quality of the biometric itself, independent of any system

implementation considerations.

3.1.1 Authentication and biometrics

In the domain of information security, the three modes of authentication are based on

possession, knowledge and biometrics [56, 53, 55]. Possessions are are any physical

or abstract (i.e., electronic) belongings useful for the purpose of uniquely identifying

a person, while knowledge refers to personal secrets, such as passwords. Biomet-

rics were discussed above: the physiological and behavioural characteristics of a per-

son. These can can also be described in terms of the three phrases: "What you have"

(possession), "What you know" (knowledge) and "What you are" (biometrics). One

or all three of the modes can be utilised in conjunction, depending on the use-case. In

this thesis, we focus primarily on the third mode, in the context of ASV.

While there are numerous potential application areas for biometric authentication, one

way to classify these is into physical access control, logical access control and unique-

ness confirmation [53]. The first one refers to any use-case where the purpose is to

control access to a physical location, such as a particular room within a building, while

in the case of logical access control, authentication is used to control access to a system

or a process (e.g., a course registration system or one’s bank account). The last class of

use-cases refers to situations where biometrics is used to provide an eligibility check

upon user enrollment in a system, e.g., to check against duplicate registrations.

Two methods of biometric authentication can be identified [53]: verification and iden-

tification. Verification implies that an user provides some sort of an identifier in con-

junction with the required biometrics – The task is then for the system to check that the

provided biometrics match given the user-provided identifier and the stored biometrics

for that identifier. Biometric identification on the other hand implies a search through

a database of stored biometrics, where given a match credentials are returned to the

user. There are two main categories of speaker identification: closed-set and open-set

identification [57]. In closed-set identification, the speaker model database is assumed

to contain all relevant speaker data, whereas in open-set identification no such assump-

31

tion holds, allowing for the possibility of no match in terms of the model search. The

open-set problem can be considered to be the more difficult one of the two.

3.1.2 Speaker verification

Figure 15: A High-level view of the ASV process

Let us now look at the ASV task setting in more detail. The ASV process is as follows:

a user provides the system with an unique identifier and a speech sample. The system

then builds a model (test speaker model) from the user-provided speech sample and re-

trieves a model (target speaker model) from a database of models corresponding to the

user-provided identifier. The test speaker model, along with the target speaker model,

is then introduced to a binary classifier where the output of the classifier decision de-

termines the outcome of the verification. Typical systems also involve a third model

known as a Universal Background Model (UBM) that acts as a competing model such

that likeness of the test speaker model to the UBM is evidence against the test speaker

model originating from the claimed speaker. Traditional ASV systems, such as those

resembling the GMM-UBM system [7], may use the UBM directly in the classifier, as

shown in Figure 15.

3.2 Speech processing and feature extraction

While deep learning techniques adapted for speaker recognition may abandon much

in terms of hand-crafted feature extraction, the fundamental concepts related to speech

processing and feature extraction remain an important topic of discussion.

Computer processing of speech begins with the conversion of an analog signal that

is continuous in both time and amplitude (energy) into a digital signal that is both

32

discrete-time and discrete-valued [58]. In this process known as analog-to-digital con-

version (ADC), a band-limited signal is sampled at time intervals, governed by the

sampling theorem, resulting in a signal that consists of continuous amplitude values at

discrete time-steps. The signal then goes through quantisation, in which the contin-

uous values are approximately represented with B bits, such that each sample can be

quantised into 2B distinct levels. Finally, we have the so-called front-end of the ASV

system, in which the actual feature extraction before modelling and classification is

done. For the purposes of this thesis, we will constrain our discussion on the rightmost

section (ASV front-end) of the overall signal processing chain illustrated in Figure 16.

The properties of good features for ASR were identified by Nolan in [59], essentially

Figure 16: A High level view of the processes between a recording microphone andthe final feature vector

containing a lot of the attributes we discussed in subsection 3.1. Nolan, however, al-

ready identified the problem of fooling the system with mimicry or disguise. Five

attributes of good features for ASR in forensics were identified:

• High between-speaker variability, low within-speaker variability

• Resistance to disguise or mimicry

• High likelihood of presence in samples

• Robustness in transmission

• Ease of extraction

These attributes were presented from the perspective of the forensic applications of

ASR, but as is noted in [1], they are relevant regardless of the application area. Fea-

tures in ASR and ASV can be split into auditory and acoustic features [1]. Auditory

features are those that can be identified by a human listener, as opposed to mathe-

matically defined low-level features, the latter of which are of interest in this thesis.

Furthermore, these two groups both contain features that can either be linguistic or

33

non-linguistic in nature – linguistic features deal with language and contain informa-

tion that is phonological, morphological or syntactic. These linguistic properties may

be present both in acoustic and auditory features. Examples of non-linguistic features

could be the rate and the lengths of pauses during speech. Finally, features in ASR can

be either short-term or long-term features, depending on the length of the speech seg-

ment used to process the features. Typical acoustic features, such as those presented in

the following subsections, are short-term in nature, but features may also be extracted

over a longer speech segment – these are called utterance-level features.

Many short-term acoustic features have been developed over the years and the most

popular of these can be considered the Mel-frequency cepstral coefficients (MFCCs)

and linear-predictive coding (LPC) -based features [1].

3.2.1 Mel-frequency cepstral coefficients (MFCCs)

Spectral analysis of a speech signal is at the heart of traditional speaker recognition,

and while numerous different low-level features have been developed over the years in

speech processing, we will approach this discussion from the perspective of the widely

used Mel-frequency cepstral coefficients [8, 57, 52] (MFCCs) and the newer constant-q

cepstral coefficients (CQCC) [9] features. In a typical speech processing setting, the

Figure 17: MFCC feature extraction

common operation are short-term framing and windowing of the signal. A spectral

feature vector is normally not computed from the whole speech utterance, but from a

relatively short ’snapshot’ of the signal, in the range of 20 to 30 ms local context, known

as a frame. The justification for the framing comes from the assumption that, due to

the physical constraints of the voice-producing vocal-tract of humans, the signal can

be assumed to be approximately statistically stationary during this short time-window.

The analysis frame is moved across the whole signal in a overlapping manner such that

34

significant portion (usually 50 %) of any frame overlaps with the previous and the next

frame.

The next step is windowing, in which the signal frame is multiplied point-wise by a

suitable window function. The Hamming window can be regarded as the most com-

monly utilised windowing function in speech processing [52] (Figure 45). It is defined

as

w[n] = 0.54− 0.46 cos

(2πn

N − 1

), (45)

where n is the nth input sample value and N the length of the window. A related

Figure 18: A Hamming window with N = 120

similar window function is the Hann (Hanning) window [52]:

w[n] = 0.5

[1− cos

(2πn

N − 1

)](46)

Many different windows have been developed for different use-cases [60] and these in-

clude Bartlett, Poisson, Kaiser, Dolph-Chebyshev and Gaussian windows among oth-

ers. The windowed signal frame then goes through Discrete Fourier Transform (DFT)

[61]:

DFTk ,N−1∑n=0

x[n] exp

(−j2πnk

N

), k = 0, 1, 2, ..., N − 1, (47)

where x[n] denotes the nth sample within the windowed frame, and DFTk the kth spec-

tral sample. The discrete Fourier transform is a complex valued function for analysing

35

the frequency content of a signal when we have a finite number of samples taken at

discrete, linearly-spaced, time-steps. The complex negative exponential in (47) repre-

sents the complex conjugate of a sampled complex sinusoid, which is used to analyse

the frequency information from the original signal at the kth frequency bin. We will

break down the DFT operation in more detail below.

To understand how DFT works, one has to first look into how any signal can be rep-

resented as a sum of sinusoidal components [61]. While a continuous signal can be

thought of as being a sum of infinite number of different sinusoids extending to infini-

ties, a discrete sampled signal can be represented as a finite sum of sinusoids. Formally,

a complex sinusoid is a function with the form [61]:

s(t) , Aej(wt+φ) = A cos(wt+ φ) + Aj sin(wt+ φ), (48)

where A is the peak amplitude (often just amplitude), ω = 2πf the radian frequency

(radians/sec), t the time in seconds and φ the initial phase. Additionally, the sum ωt+φ

is known as the instantaneous phase. This is illustrated in Figure 19. The magnitude

of a signal is defined as follows:

|x(t)| ,√

Re2x(t)+ Im2x(t) ≡ A. (49)

The first, and real part of the complex sinusoid, is known as the in-phase component,

while the imaginary part is known as the phase-quadrature component. The formula-

tion in (48) follows from Euler’s identity:

ejθ = cos(θ) + j sin(θ), (50)

which implies

Aejθ = A cos(θ) + jA sin(θ). (51)

Euler’s identity allows for the representation of a complex number z in its polar form:

z = rejθ = r(cos(θ) + j sin(θ))

z = re−jθ = r(cos(θ)− j sin(θ)),(52)

Where z is the complex conjugate of z. A comparison with the DFT formula in (47)

shows that the multiplication of the real-valued signal at sample n with the sampled

complex sinusoid results in a complex sinusoid with the magnitude determined by the

36

value of x[n], given (51) and (49). From the Euler’s identity, it is easy to show that

Figure 19: The components and parameters of a complex sinusoid

cos(θ) =ejθ + e−jθ

2,

sin(θ) =ejθ − e−jθ

2j.

(53)

The equations (53) suggests that sines and cosines, and hence complex sinusoids, are

composed equally of negative and positive frequencies. This has no physical interpre-

tation and can be considered an artefact of the mathematics of the Fourier transform.

The negative frequencies are, however, present in the output of the normal implemen-

tation of the DFT, thus we discuss it here. The output of the DFT, given a real-valued

input, is conjugate symmetrical, such that

DFTN−k = DFTk, (54)

which follows from (47). DFT can be used to obtain both magnitude and phase of the

analysed signal. Depending on the task, only the magnitude values from the DFT are

retained. One consideration for such judgement is whether it is necessarily to get the

original sampled signal back or not. In the case of the MFCCs, only the magnitudes of

the complex values are retained. In addition to being important for allowing analysis

of the sampled signal in terms of the present frequencies, the DFT is also important for

allowing manipulation of the underlying frequency content, as done with the mel-scale

37

Figure 20: Frames and their corresponding FFTs

38

filterbank in (56).

The original sampled signal can be recovered with the inverse operation of the DFT,

known as the inverse discrete Fourier transform (IDFT):

x[n] ,1

N

K−1∑k=0

DFTk exp

(j2πnk

N

), k = 0, 1, 2, ..., N − 1. (55)

A key property of the DFT is that the sampled signal is assumed to be periodic with a

period of N , which in the case of a complicated signal such as a speech sample results

in so-called spectral leakage. In spectral leakage, the frequency information of the

signal gets spread across multiple frequency bins due to discontinuity in the repeated

signal. One motivation for the use of the window functions described earlier is to

constrain the beginning and the end of the frame towards zero.

When the DFT is used in conjunction with a windowing function and signal framing,

we are actually performing what is known as the short-time Fourier transform (STFT).

The STFT and its visualisation are discussed in subsection 3.2.3. The STFT matrix will

be considered, in this thesis, as a "light-feature", as opposed to the more complicated

signal processing chain presented here for the MFCCs and in subsection 3.2.2 for the

CQCCs. Naive computation of the DFT computed for a signal frame of length N takes

O(N2) computations, but in practice, an optimised algorithm is used, such as the fast

Fourier transform (FFT) [62]. The FFT runs in O(N logN). We will omit the details

in this work.

After the DFT computation, a Mel-scale filterbank is used to extract information about

critically placed frequencies, using:

MF(r) =1

Ar

Ur∑k=Lr

|Vr(k)DFT (k)|2, r = 1, 2, ..., R, (56)

where Vr(k) is the weighting function and Lr and Ur are the first and last frequency-bin

indices of the DFT output, respectively. Ar is the normalising factor used for the rth

filter of the filterbank:

Ar =1

Ar

Ur∑k=Lr

|Vr(k)|2 (57)

The filterbank (Figure 21) consists or R filters, each of which measure the energy of

a particular critical frequency existing in the windowed signal frame being processed.

39

Figure 21: A mel-scale filterbank with 12 filters between 0 Hz and the Nyquist fre-quency. The figure shows the weight of each filter around its centre frequency. Eachfunction in the figure is zero outside its scaled range.

The Mel (Melody) scale is based on studies done on the human auditory system, where

the perception of frequencies are close to linear until 1 kHz, continuing logarithmically

afterwards [63, 52]. Various closed-form mel-scale conversion formulae have been

presented and a popular one was given in [64]:

m = 2595 log10

(1 +

f

700

), (58)

with inverse mapping

f = 700(10

m2595 − 1

). (59)

Finally, the MFCC features are obtained by taking the discrete cosine transform (DCT)

of the log-transformed Mel-spectrum obtained earlier:

MFCC(n) =1

R

R∑r=1

log (MF (r)) cos

[2π

R

(r +

1

2

)n

](60)

The motivation for using the DCT is two-fold. Firstly, it retains most of the energy of

the input signal in the first coefficients (compression), and secondly, the DCT values

are uncorrelated [1]. Normally, in speech processing and indeed ASR (including ASV

40

tasks), only the first 12 to 16 of the coefficients obtained in (60) are retained for use as

features [65]. There are two reason for this – firstly, we want to represent the data with

the least amount of dimensions as possible, and secondly, the DCT output contains

the most significant coefficients in the first indices. This is also related to the so-

called curse of dimensionality [66], which states that in high-dimensional datasets,

the points get distributed around the edges of a d-dimensional hypercube, where d

is the dimensionality of the data. This poses problems for machine learning set-ups

where the data dimensionality is substantial in comparison to the amount of available

data, such that as the dimensionality increases, we need exponentially more data for

the training process, which is not always possible [67]. The curse of dimensionality

is an important motivation behind dimensionality reduction, along with the increased

computational requirements and storage that come with high-dimensional data.

After the basic MFCC computation described above, additional steps may be taken,

depending on the use-case. So-called dynamic features, i.e., velocity and acceleration

coefficients [68, 69, 1] have been used in ASR and ASV by appending them to the

original MFCC vector [70]:

dt =

∑Θθ=1 θ(ct+θ − ct−θ)

2∑Θ

θ=1 θ2

, (61)

where Θ is the size of the delta window. In (61) we have the first derivative, or delta

coefficient, calculation. The second derivative, or delta-delta, is obtained by making

the calculation over the delta coefficients obtained with (61). With the delta coeffi-

cients, we get additional coefficients for each MFCC coefficient – one for the deltas

and second for the delta-deltas. Finally, centring and normalisation may be done on the

features [1]. In cepstral mean subtraction (CMN) we reduce the mean feature vector

from each feature vector before any classification, such that the mean is computed over

some sliding time-window. Other normalisation techniques include cepstral variance

normalisation (CVN) [71], time-warping [72] relative spectral processing (RASTA)

[73], and quantile-based cepstral normalisation [1].

3.2.2 Constant Q cepstral coefficients (CQCCs)

The constant-Q cepstral coefficients (CQCCs) are a recent development in features for

use in ASV [74, 9]. In the CQCC feature extraction process, the constant-Q transform

41

(CQT) is analogous to the windowed DFT (STFT) calculation described in the previous

subsection, but with a varied analysis resolution on both the frequency axis and the time

axis, such that the the resolution is higher on the time axis at higher frequencies and

higher on the frequency axis at lower frequencies. The resolution of a signal in spectral

analysis is the ratio of the sample rate and the length of the window (in samples), which

in the case of the DFT is constant for every signal analysis frame. The CQT of an input

signal x[n] is obtained by the following transform:

CQT(k, n) ,bNk/2c∑

j=n−bNk/2c

x(j)a∗k(j − n+Nk/2), (62)

where k is the frequency bin index, a∗k the complex conjugate of ak and Nk the chang-

ing window lengths. Note that the window length, 2bNk

2c+1, depends on the frequency

k, unlike in the DFT. In (62), ak(n) are basis functions of this transform and are defined

as follows:

ak(n) ,1

C

n

Nk

exp

[i

(2πn

fkfs

+ Φk

)], (63)

where C is the scaling factor (64), fk the frequency of the k th bin, fs the sample rate

and Φ the phase offset of the transform.

C =

Nk/2∑l=−bNk/2c

w

(l +Nk/2

Nk

), (64)

where Nk is as defined earlier and w(·) the chosen window function. The bin frequen-

cies are fundamentally different than those of the DFT. In the DFT, the frequency bins

are linearly placed, such that fk = fk−1 + ∆f . In the CQT they are defined as

fk = f12k−1B , (65)

where f1 = fmin is the frequency of the first bin and B specifies the number of bins

per octave. In contrast, in the DFT the placing of the frequency bins was linear. A

uniform resampling is done on the signal such that geometric placement of the bins is

preserved, but in linear space [9]. The so-called Q-factor of the CQT is tied to the bins

per octave parameter in the following manner:

Q =fk

fk+1 − fk=

1

21/B − 1. (66)

42

The lengths of the windows for different frequencies are related to the Q-factor as

Nk =fsfkQ. (67)

From (67) we see that the analysis window size shrinks as we move upward in the

frequency bins, and vice versa. The overall computation of the CQCC features is as

follows:

CQCC(n) =L∑l=1

log |CQT(l)|2 cos

[n(1− 1

2

)π

L

], n = 0, 1, ..., L− 1, (68)

where L is the number of the frequency bins in the resampled space. Figure 22 shows

in rough terms the way in which the time resolution is increased at the high frequencies

while the frequency resolution is increased at the lower frequencies.

Figure 22: Comparison of time and frequency resolution of DFT and CQT (based onthe illustration in [9])

3.2.3 The spectrogram

So far we have discussed two novel feature extraction methods used in various speech-

related tasks. Unlike in many traditional ASR and ASV set-ups, in the deep learning

approach to ASV (discussed in 3.5), the spectrogram of speech has been used as an

input feature representation. The spectrogram of speech we use in this thesis is a dB-

scale (20 ∗ log transform) intensity plot of the magnitude values of the sequence of

windowed DFT outputs calculated for each frame of the original sampled signal. This

43

Figure 23: STFT and CQT power spectrogram comparison

amounts to the STFT operation with the usual frame-size, shift and overlap parameters,

discussed in 3.2.1. Figure 24 is a spectrogram of speech generated with the speech

processing python-ibrary librosa.7 In Figure 3.2.3, we used a window size of 550

Figure 24: A spectrogram of speech

samples, which with a sample rate of 22050 Hz corresponds to approximately 25 ms.

The overlap was chosen to be half of the window size.

7http://librosa.github.io

44

3.2.4 Other features

In previous subsection we have three popular features used in various ASR tasks, but

many different features have been developed over the years. The linear frequency

cepstral coefficients (LFCCs) [8] are similar to the MFCC features, but do not contain

the mel-filtering phase of the MFCCs:

LFCC(n) =K−1∑k=0

DFTk cos

(πnk

K

), i = 1, ..., N, (69)

where n is the index of the feature vector and k the DFT coefficient index. LFCCs

have been shown to offer complementary discriminatory power in ASR tasks with

MFCCs [75]. The LFCC features are motivated by findings were the vocal tract length

within the human speech production system shows in the higher frequencies where the

mel-scale MFCCs offer less resolution. MFCCs, LFCCs, and their variants rely on

the spectral analysis of the signal, but another classic approach for speech analysis is

linear prediction, where the speech signal is modelled as [57]:

x[n] =L∑l=1

alx(n− l) +Gu(n), n = 1, ..., N, (70)

where n is the sample index, L represents the number of output coefficients and the

order of the predictor, G the gain, and u(n) the excitation signal. This is also known

as the source-filter model of speech. Finally, the aj are the features or coefficients.

The coefficients can be are estimated by the means of forward linear prediction or

backward linear prediction. In the former, the task is to predict the sample x(n) via the

previous samples, x(n − 1), x(n − 2) and so forth. In the latter, we predict the value

of the sample x(n − L) from the future values x(n), x(n − 1), ..., x(n − L + 1). The

linear prediction cepstrum coefficients [8] are obtained as follows:

LPCC(n) = LPC(n) +n−1∑k=1

k − nn

LPCC(n− k)LPC(k). (71)

3.3 Speaker modelling and classification

We have reviewed above some of the most commonly used feature extraction ap-

proaches in ASR and ASV. We now turn the discussion towards speaker modelling

45

and classification. In this subsection, we describe three key approaches: the classic

mixture-based approach (Section 3.3.1), the PLDA-i-vector approach (Section 3.3.2),

and the deep learning approach (Section 3.3.3).

Both ASV and ASV replay attack detection imply a binary classification task, as dis-

cussed in Section 2.1. From the statistical hypothesis testing point of view, these con-

tain analogous, but not equivalent set-ups. We can formally we define the statistical

hypotheses for the ASV case asH0 : The sample comes from claimed identity

H1 : The sample is from another identity.(72)

In the case of replay attack detection, the hypotheses can be set asH0 : The sample is from a genuine speaker

H1 : The sample is from an impostor.(73)

In both cases, "sample" may refer to features extracted from a single frame, or to

features computed over a some period of time (utterance).

3.3.1 Gaussian mixture model approaches to ASV

A GMM-based approach to speaker verification was originally presented in [6] and

further enhanced in[7], where, in the enrollment stage, the input user speech data is

used to adapt a target speaker model from a speaker-independent universal background

model (UBM). At the verification stage, a new test utterance is compared against both

the target model and the UBM using a likelihood ratio test. The baseline evaluation

system used in the ASVSpoof 2017 evaluation [2] was still based on similar core ideas,

which speaks in favour of the relevancy of this kind of speaker verification. We will

describe the relevant ideas in detail due to fact that a similar system is used as a baseline

system for comparison.

Reynolds et al [7] presented an automatic speaker verification system consisting of a

front-end speech processing system connected to binary log-likelihood classifier. The

front-end speech processing system used a framing of 20 ms with a shift of 10 ms.

Additionally, a speech activity detector was employed in order to discard useless non-

46

speech data. From there, MFCC-features were extracted for use in the model estima-

tion.

A key component of this system is the Universal Background Model (UBM) that is

used in the classification task. The role of the UBM is to serve as the alternative

hypothesis of this set up, that is, the verified speaker model is classified to either belong

to the claimed identity or the UBM. The background model can either be trained from

the whole train data set or from a particular sub-population. Reynolds et al noted

that training the UBM with the whole training data may lead to bias towards certain

sub-populations in the data, such as a particular gender or age group.

Figure 25: The mixture models of the canonical GMM-UBM system(Picture used: pixabay)

The GMMs for both the background model and the claimed identity are trained with

the EM algorithm described in Section 2.4, which is based on the statistical principle

of maximum likelihood estimation. An important detail of the system presented by

Reynolds et al was that the claimed user model is not trained from the ground up.

Instead, the claimed identity is adapted from the UBM with the EM algorithm.

The classification in the GMM-UBM system is done with a log-likelihood ratio test of

47

the claimed identity speaker model and the background model:

Λ(X) = log p (X|λhyp)− log p (X|λubm) . (74)

The GMM-UBM system, as a set of basic principles has remained popular in ASR and

ASV since it conception.

3.3.2 Linear statistical model approaches

Early work on so-called utterance level features was done in [76], motivated by the

need for features computed over a length of time that is longer than single speech

frame. The GMM supervector was originally presented in [77], where the compo-

nent means of a trained speaker GMM represent the features computed for variable

length utterances. The supervector model was adapted for speaker recognition in [78].

These high-dimensional vectors have been since used as a key component in con-

junction with factor analysis and support vector machines (SVMs) [14]. SVMs can

Figure 26: GMM supervector

be used to construct a binary classifier where linearly separable data is separated by

a hyperplane such that the distance between the samples for the two classes is the

largest. In specific, we define linearly separable data as follows. For a labelled dataset

(yi, xi), ..., (yN , xN), yi ∈ −1, 1, there exists a vector w and a bias term b such that

the following holds for all i = 1, ..., N :

wTxi + b ≥ 1 if yi = 1,

wTxi + b ≤ −1 if yi = −1.(75)

48

The optimal hyperplane is one where

wT0 x + b = 0. (76)

Support vectors are the vectors for which yi(wTxi + b) = 1. Another way to write the

optimal hyperplane is as

w0 =N∑i=1

yiα0ixi. (77)

In the case of linearly non-separable data, something known as a kernel function can be

used. The purpose of the kernel function is to serve as a mapping function between the

original input space, and a new higher dimensional space, such that the data is at least

approximately linearly separable in the new space. A Gaussian mixture model-support

vector machine system (GMM-SVM) system for speaker verification was presented in

[79], where two GMMs were trained similarly as in the GMM-UBM approach. An

SVM classifier was then used as a classifier with the GMM supervectors. The kernel-

based classifier presented for ASV was defined as

f(x) =N∑i=1

αiyiK(x,xi) + d, (78)

where K(·, ·) is the kernel, d a learned constant, and yi the labels. In (78), the sum∑Ni=1 αiyi = 0 and αi > 0. The xi are the support vectors. The linear kernel in

[79] was based on an approximation of the Kullback–Leibler (KL) divergence of two

speech utterances:

D(ga||gb) =

∫Rn

ga(x) log

(ga(x)

gb(x)

)dx, (79)

where ga and gb are MAP-adapted GMMs from the two speech utterances. The follow-

ing approximation was presented:

d(ma,mb) =1

2

N∑i=1

λi(mai −mb

i)TΣ−1

i (mai −mb

i), (80)

which was then used to construct the following kernel function:

K(utta, uttb) = ΣNi=1λi(m

ai )TΣ−1

i mbi . (81)

49

Factor analysis (FA) modelling for ASV utilising the GMM supervectors was origi-

nally presented in [80] and the development of FA modelling for ASR and ASV tasks

has led to the so-called i-vector approach. In this kind of modelling, the GMM super-

vector that is tied to an individual speaker can be written in the form [1]

ms,h = mo + mspk + mchn + mres, (82)

where m0 is the speaker, channel and environment independent bias term, mspk the

speaker-dependent component, mchn the channel-dependent component, and mres the

residual. The component m0 comes directly from the trained UBM, while mspk, mchn,

and mres are considered random vectors in the statistical sense, and contain the vari-

ance due to speaker and environment variation. The variation in environments comes

from the different possible recording situations – microphones and properties of the

immediate recording environment, such as the amount of echo, and so on.

The eigenvoice adaptation was an early FA-based model for ASR, where the speaker

GMM supervector is modelled as follows:

ms = mo + Vys, (83)

where m0 is the UBM and where Vys = mspk. The ys are the latent factors of the

model.

In the identity vector (i-vector) approach [10], the model is

ms,h = m0 + Tws,h,w ∼ N (0, 1), (84)

where ws,h are the latent factors, also known as the i-vectors. The estimated vector of

latent variables is used as features for a classifier. The model in (84) is known as a total

variability model (TVM), as it models the variance from speakers and channels at the

same time. The parameter matrix T, also known informally as an i-vector extractor,

can be trained with an EM-like process described in [81] for the eigenvoice matrix.

Given a speech utterance, w is computed as follows:

E[w|u] = w = (I + TtΣ−1N(u)T)−1 ·TtΣ−1F(u), (85)

where N(u) contains the Baum-Welch statistics for the UBMs’ Gaussian components

50

given the utterance u:

Nc =L∑i=1

P(c|yt,Ω), (86)

where c denotes the Gaussian component, L the number of feature frames in the utter-

ance, yt the features and Ω the UBM. In (85), F(u) contains the centralised first-order

Baum-Welch statistics for the utterance u, given the UBM Ω:

F(u) =L∑i=1

P(c|yt,Ω)(yt −mc). (87)

This approach can be considered as a dimensionality reduction for the GMM super-

vectors described earlier.

Figure 27: i-vector extraction

3.3.3 Deep learning approaches to ASV

Deep neural architecture-based approaches have shown success in speech recognition,

but using such models directly for ASR or ASV tasks is relatively new [1]. According

to [1], I-vectors have been used in conjunction with a DNN for ASR [82]. DNNs have

also been used to build a phonetically-aware model for ASR [83], as well as to perform

feature extraction [84].

51

3.4 Vulnerabilities and countermeasures

An ASV system as part of an authentication system, or as an independent biometric

authentication system, is vulnerable to a number of possible attack types in various

subsystems. These various attack points are illustrated in Figure 28. The ISO/IEC

Figure 28: ASV vulnerabilities (adapted for ASV from [85])

30107-1:2016 standard offers the following definition for the presentation attack [85]:

”Presentation to the biometric data capture subsystem with the goal of

interfering with the operation of the biometric system.”

The standard notes the existence of multiple different ways to perform this type of

attack. Our focus is the replay attack, which is achieved by introducing a pre-recorded

sample of a legitimate user to the biometric system. The threat of replay attacks has

been identified in a number of independent studies [2, 3, 4, 5]. The easy and cost-

effective nature makes this type of an attack particularly interesting and it has been

shown that, without any spoofing detection in use, the performance of all existing ASV

systems can be degraded.

3.5 Deep learning in replay attack detection

Deep neural architectures are a less studied area of ASV anti-spoofing. Recent work

in utilising DNNs for anti-spoofing [11] show promise, but it remains inconclusive

52

Figure 29: An illustration of the triviality of a replay attack. The phone on the leftplays a recorded utterance of the target speaker, while the phone on the right can beused to call the attacked system. Alternatively, the phone itself may be the attackedsystem, if voice authentication is in use.

whether a feature learning approach presented in [12] allows for better discrimination.

We will now present both approaches here for comparison.

A light-feature DNN approach to anti-spoofing was presented in [11], where direct

classification with a reduced, or light, CNN utilising the STFT spectrogram as input,

achieved performance surpassing the state-of-the-art i-vector system. CQT spectro-

gram showed comparable performance and a combination of a CNN and a recurrent

neural network was also investigated with STFT spectrograms, although with reduced

performance. The CNN used here is not the standard we presented in Section 2.6.3,

where we showed the standard max-pooling operation. Here, in addition to standard

max-pooling, a max-feature-map (MFM) [13] pooling function is utilised, defined as,

ykij = max(xkij, xk+N

2ij ), i = 1, ..., H, j = 1, ...,W, k = 1, ..., N/2. (88)

In (88), k denotes the channel in the convolution8. In the MFM, the N channels of

convolution outputs are split in to two groups, for which element-wise maximums are

taken. In the conventional max-pooling operation, the size of the channels are reduced

as a result, but in the MFM we reduce the number of channels in the previous layer

to a half. The motivation for this addition is such that the MFM is meant to intro-8A channel in a CNN corresponds to the output of the convolution between the layer input and a

particular kernel used at that layer.

53

Figure 30: The max-feature-map layer.

duce feature selection into the network due to the selection between the convolution

channels. A peculiarity of the LCNN is the absence of classical non-linear layers -

the MFM operation replaces these completely. The operation also reduces the number

of optimised parameters, hence the use of the light CNN (LCNN) nomenclature for

this type of network. In the proposed architectures, the MFM operation is used always

after a convolution, and before max-pooling. The proposed architecture in [11] for the

pure LCNN approach used kernels of varying sizes, but all with stride of 1 and max-

pooling with 2× 2 kernels with stride of 2. Recall that stride is the number of indices

the convolution kernel is moved (here 1 for both horizontal and vertical movement.).

Figure 31 summarises the architecture presented in [11]. A problem with the front-end

processing in this type of system comes from the constant input dimension limitation

of the CNN. The most successful approach for spectrogram input has been to set the

CNN input layer to some size, e.g., according to the average sample length, and then

repeating shorter and truncating longer signals to match the desired dimensions [11].

Another limitation of this network is related to the classification process itself. In [11],

it is noted that due to the overfitting capability of the network on the relatively small

ASVspoof 2017 dataset [86], a separate classifier via GMMs was proposed.

54

Figure 31: The LCNN architecture from [11]. Conv is short for convolution. MFMstands for max-feature-map, FC for fully-connected. Conv (5 x 5 / 1 x 1) implies aconvolutional layer with filter height and width of 5 and stride of 1 in both directions.The output dimensions of the various layers are shown on the right.

55

Fundamentally different approach for utilising DNNs in replay attack detection was

presented in [12], where CQCCs and high frequency cepstral coefficients (HFCCs)

were used in conjunction as input to a traditional CNN. HFCCs are obtained from

speech as follows. First, the signal is filtered with a high-pass filter, with the cut-off

frequency being at the upper limit of typical human speech (in [12], 3.5 KHz was

used.). After filtering, conventional MFCC processing follows, with the exception of

Mel-filtering, which is left out, because all the higher frequencies are of interest. The

HFCCs follow the reasoning that artefacts from recording and playback appear in the

higher frequencies, outside the typical speech frequency range [12]. 30 coefficients

were retained from the final DCT output, for which delta and delta-delta coefficients

were calculated. For this system, two classification strategies can be used: classifica-

tion via direct binary classification with the final layer of the network, or via a separate

classifier, such that the final layer of the network acts as a multi-class classifier with

the classes corresponding to the different recording environments within the data. In

the latter set-up, the multi-class information is fed into a SVM-classifier for the final

decision.

3.6 System measurement and evaluation

While different performance measures have been used in biometrics over the years,

the standard for measuring ASV and ASV-antispoofing performance is the equal error

rate (EER) [1]. The two types of errors within a biometric system are false accept

(FA) and false reject (FR). From the perspective of replay detection, a false accept cor-

responds to a situation where an replay sample successfully fools the detector, whereas

a false rejects happens when a legitimate sample is classified as a replay sample. Two

measures are related to these two errors:

Spoofing false acceptance rate (SFAR) =FA errors

replay attempts,

Spoofing false rejection rate (SFRR) =FR errors

legitimate samples.

(89)

In a binary classification set-up, such as in ASV, the system makes the distinction

between a replay sample and a legitimate sample ultimately via the use of an arbitrary

threshold value, such as the log-likelihood threshold described in Section 3.3.1. Such

thresholds have to be based on empirical tuning of the system as well as on desired

56

characteristics of the system. When higher security is desired, the threshold should

be adjusted such that the FAR-score is reduced at the cost of a higher FRR-score,

and vice versa for the opposite case. The measure that describes the behaviour of

the classifier when the decision threshold is adjusted is the EER, which is defined as

the configuration of the system such that the FAR- and FRR-scores are equal. The

behaviour of the system as the threshold is adjusted is often visualised by what is

known as the detection error trade-off (DET) -curve [87].

In practice, the EER can be estimated with the ROCCH-EER algorithm, described in

[88].

57

4 Experimental set-up and results

For this thesis, initial experiments consisted of a comparison between the baseline

GMM system with CQCC features and a reduced and modified variant of the LCNN

system with spectrogram features. The baseline system consisted of a GMM-UBM

system with 90-dimensional CQCC features, which included static coefficients as well

as delta and delta-delta coefficients. The bins per octave parameter was set at 96. The

number of Gaussian components for the spoofed and genuine models was 512. The

first tested LCNN system, adapted from the spectrogram LCNN system in [11], used

129 × 236 dimensional, second order Butterworth high-pass filtered spectrograms as

network input. The first dimension corresponds to the number of frequency bins and

the second to the number of frames. Unified input shape for the network was achieved

by processing 2 seconds of audio from each file either by repeating or truncating the

signal. The best-performing network utilised a total of 9 convolutional layers, 4 of

which had filter size of 1× 1. These 1× 1 layers are effectively so-called network-in-

network layers [89, 11]. Two fully-connected layers were used at the output-end of the

network: one with 64 units, for which a dropout of 0.7 was used, and one with 2 units

serving as the final softmax classifier layer of the network. The original network had

significantly higher parameter-count at over 370K parameters. Consequently, the num-

ber of parameters dropped due to the smaller CNN input. The number of parameters

was further reduced by reducing the number channels in the convolutional layers. In

terms of training parameters, learning rate of 10−4 and Xavier initialisation was used.

The architecture is detailed in Figure 32.

4.1 The ASVspoof 2017 v2 data

The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge

(ASVspoof 2017) database v2 [90] is the improved version of the original 2017

ASVspoof dataset [91], which was developed for the purpose of development and test-

ing of replay detection in ASV. The dataset contains speech from 42 people from 179

sessions and 61 different configurations. The dataset is provided with a predetermined

split into three standard portions: train, development and evaluation, although the split

between training samples and development samples can be adjusted. The speech sam-

ples in the train set are used for model training and this set contains 3014 samples with

58

Figure 32: The reduced LCNN architecture based on the architecture summarised in31.

59

equal split into labelled spoofed and genuine samples. The development set is used to

develop and tune the replay detection system and contains 1710 samples, with labelled

760 genuine and 950 spoofed samples. The evaluation set contains 13306 samples with

12008 spoofed and 1298 genuine samples.

4.2 The results and discussion

The results for the development data obtained with the architecture described earlier are

displayed in Table 4.2. The baseline results here are displayed without any additional

optimisations of the CQCC features.

Table 1: Results comparisonSystem EER % (dev) EER % (eval)CQCC-GMM Baseline 11,28 24.77 [92]LCNNFFT 3.95 [11] 6.73 [11]GD-ResNet-18 + Attention 0.0 [92] 0.0 [92]LCNN reduced 7,33 -LCNN full (STFT) 26.69 41.71LCNN full (GD-gram) 42.89 47.19

The network reliably achieved over 90 percent accuracy on the development set after

45 epochs with batch-size of 64 (the original relatively low learning rate of 10−4 was

kept from [11]). However, for the much larger evaluation set the results were inconclu-

sive and no reliable discrimination between replay and genuine samples was achieved.

The DET-curves of the tested systems are in Figure 33. The discrepancy of the results

between the two portions of the ASVspoof 2017 v2 dataset may be explained by the rel-

ative size as well as the nature of the evaluation set, which consists of completely new

attacks. For these experiments, an additional hypothesis was included setting, where

it was assumed that replay-related noise would be prevalent in the higher frequencies.

This led to the adoption of the high-pass filter described earlier. The idea originates

from the work in [12] with the so-called high frequency cepstral coefficients (HFCCs).

To see whether the lopsided spectrograms (zeroes due to filtering) had an effect, we also

tested the same model with larger spectrograms cut at the same 3.5 Khz point, retaining

only the frequency bin information upwards towards the Nyquist-frequency. However,

no marked improvement was achieved by this change with respect to the discrepancy

between the development and evaluation results. In another attempt to improve the re-

60

Figure 33: DET-curves of the tested systems

sults on the larger dataset, we considered again a much larger spectrogram as the input.

Experiments were done with non-filtered 512× 400 sized spectrograms, where 512 is

the number of frequency bins and 400 the number of frames. With this input size, the

original number of filters and the filter sizes were taken from [11], but working with

this larger model proved difficult due to the space and time requirements. This larger

model was tested towards 100 epochs, but no reasonable results for the evaluation set

were obtained.

The original LCNN architecture described earlier was also tested with two types of

inputs: the STFT spectrogram as well as the group delay gram [92] extracted from the

STFT spectrogram. Recent work in [92] on deep learning approaches to replay attack

detection suggested that the group delay information computed from STFT offered

better discriminatory information for ASV anti-spoofing. In [92], the popular image

classification network, ResNet [93], was adopted for replay attack detection by utilising

the group delay gram in place of the STFT spectrogram. The results are in table 4.2.

Group delay is the negative derivative of the phase-portion of STFT:

τ(ω, t) = −d(θ(ω, t))

dω. (90)

Group delay may also be computed via:

61

τ(ω, t) =XR(ω, t)YR(ω, t) + YI(ω, t)XI(ω, t)

|X(ω, t)|2, (91)

where X(ω, t) is the STFT of x[n] and Y (ω, t) the STFT of nx[x]. The subscripts R

and I denote the real and imaginary parts of STFT. The group delay gram as a repre-

sentation of speech was identified already in [94]. Comparison of STFT and GD-gram

is in figure 34. Both STFT spectrograms and GD-grams were investigated with the

original LCNN architecture. The EER results for the large LCNN systems were ob-

Figure 34: STFT spectrogram and group delay

tained as follows. First, the network was trained on the ASVspoof v2 train data only,

after which the trained network was used to extract features drawn from the fully con-

nected layer before the final softmax layer. This was suggested in [11], but the details

of the procedure were omitted. For the feature extraction, 200 passes of the datasets

were performed during which the spectrogram/GD-gram -based feature is extracted

from a random 2 second chunk of each audio file. This results in datasets extracted

from the train, development and evaluation portions of the ASVspoof v2 that are heav-

ily overlapped due to the overall short lengths of the files. The extracted sets are then

used to evaluate the EER of the development and evaluation sets via GMMs. While

features extracted with the STFT version of the large network offered some discrim-

ination capability on the development set, the performance on the evaluation set re-

mained little better than a random guess. Both large networks utilising either STFT

features or GD-grams suffered heavily from overfitting on the training data. The fol-

62

lowing regularisation techniques were tested with both features: batch normalisation

for both convolutional and fully-connected layers, spatial dropout as well as normal

1D dropout, and random cropping. Random cropping and resizing of images are both

popular regularisation and data augmentation techniques used in computer vision and

image classification tasks. The employed regularisation techniques were found insuf-

ficient in preventing wide-scale memorisation of the network and served only to delay

the eventual memorisation.

Our findings possibly point towards two interesting avenues in terms of the replay at-

tack detection task: data-augmentation and the use of pre-trained models. Given the

current data-restricted situation, the results in [92] are interesting because they point to

the usefulness of pre-trained image classification networks for replay detection. The

other avenue is open to exploration: generative models such as wavenet [95] and gen-

erative adversarial networks, such as SEGAN [96], might allow for new ways of aug-

menting the spoofing detection corpora that exist today.

63

5 Conclusion

Automatic speaker verification is a less-adopted, yet an interesting option for biometric

authentication. The threat of both synthetic spoofing attacks as well as replay attacks

has been recognised in recent years. The cost-effective nature of the replay attacks

makes the detection of such attacks an important problem to be solved. In this work,

we looked at some relatively recent deep learning approaches for replay detection and

attempted to reproduce some of the results.

In section 2 we discussed the machine learning and showed how Gaussian mixture

models could be trained with the expectation-maximisation algorithm. In section 3, we

offered an overview of automatic speaker recognition and verification, and offered a

background on biometrics. Finally, in Section 4 we described the ASVspoof 2017 v2

data as well as our experiments.

In Section 4, two potentially interesting avenues to pursue in replay attack detection

were identified, both of which would largely try to solve the same issue (of limited

data): data-augmentation and the use of pre-trained models. The use of state-of-the-art

generative models for speech might allow for new types of approaches to be pursued

for replay detection, due to larger datasets. Pre-trained models, on the other hand,

potentially allow for good classification performance on restricted datasets. We would

consider both of these two avenues to be interesting to pursue in the future.

Looking at the wider context of biometrics, where ASV and ASV replay detection

both sit, and considering how large digital footprint each individuals leave when they

interact with different kinds of systems, an interesting topic is emerging: biometric de-

identification [97]. For the field of speaker recognition, this topic offers both new chal-

lenges and opportunities, as future speaker recognition system may not only account

for possible spoofing and replay attempts, but also account for highly sophisticated

attempts to evade recognition.

64

References

[1] J. H. L. Hansen and T. Hasan, “Speaker Recognition by Machines and Humans:

A tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99,

Nov. 2015.

[2] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi,

and K. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoof-

ing attack detection,” vol. 2017-August, 2017, pp. 2–6.

[3] F. Alegre, A. Janicki, and N. Evans, “Re-assessing the threat of replay spoofing

attacks against automatic speaker verification,” in 2014 International Conference

of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, Sep.

2014, pp. 1–6.

[4] Z. Wu, S. Gao, E. S. Cling, and H. Li, “A study on replay attack and anti-spoofing

for text-dependent speaker verification,” in Signal and Information Processing

Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Dec.

2014, pp. 1–5.

[5] J. Gałka, M. Grzywacz, and R. Samborski, “Playback attack detection

for text-dependent speaker verification over telephone channels,” Speech

Communication, vol. 67, pp. 143–153, Mar. 2015. [Online]. Available:

http://www.sciencedirect.com/science/article/pii/S0167639314000880

[6] D. A. Reynolds, “Speaker identification and verification using Gaussian

mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91–108,

Aug. 1995. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/016763939500009D

[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification

Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10,

no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/

science/article/pii/S1051200499903615

[8] S. Davis and P. Mermelstein, “Comparison of parametric representations for

monosyllabic word recognition in continuously spoken sentences,” IEEE Trans-

actions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366,

Aug. 1980.

65

[9] M. Todisco, H. Delgado, and N. Evans, “Constant Q cepstral coefficients:

A spoofing countermeasure for automatic speaker verification,” Computer

Speech & Language, vol. 45, pp. 516–535, Sep. 2017. [Online]. Available:


[10] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Fac-

tor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and

Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.

[11] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and

V. Shchemelinin, “Audio replay attack detection with deep learning frameworks,”

vol. 2017-August, 2017, pp. 82–86.

[12] P. Nagarsheth, E. Khoury, K. Patil, and M. Garland, “Replay attack detection

using DNN for channel discrimination,” vol. 2017-August, 2017, pp. 97–101.

[13] X. Wu, R. He, Z. Sun, and T. Tan, “A Light CNN for Deep Face Representation

with Noisy Labels,” arXiv:1511.02683 [cs], Nov. 2015, arXiv: 1511.02683.

[Online]. Available: http://arxiv.org/abs/1511.02683

[14] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning,

vol. 20, no. 3, pp. 273–297, Sep. 1995. [Online]. Available: https:

//link.springer.com/article/10.1023/A:1022627411411

[15] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,

USA: MIT Press, 2014. [Online]. Available: http://ebookcentral.proquest.com/

lib/uef-ebooks/detail.action?docID=3339490

[16] B. M, Christopher, Pattern recognition and machine learning. USA: Springer,

2006.

[17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, Mas-

sachusetts: The MIT Press, Nov. 2016.

[18] V. Vapnik, “Principles of Risk Minimization for Learning Theory,”

in Advances in Neural Information Processing Systems 4, J. E.

Moody, S. J. Hanson, and R. P. Lippmann, Eds. Morgan-Kaufmann,

1992, pp. 831–838. [Online]. Available: http://papers.nips.cc/paper/

506-principles-of-risk-minimization-for-learning-theory.pdf

66

[19] V. N. Vapnik, Statistical learning theory. Wiley, 1998.

[20] D. J. . K. Bartholomew, Latent Variable Models and Factor Analysis : A Unified

Approach. London, UK: Wiley, 2011.

[21] I. Koch, Analysis of multivariate and high-dimensional data, ser. Cambridge se-

ries in statistical and probabilistic mathematics. Cambridge University Press,

2014.

[22] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from

Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society.

Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [Online]. Available:

http://www.jstor.org/stable/2984875

[23] Fisher R. A. and Russell Edward John, “On the mathematical foundations

of theoretical statistics,” Philosophical Transactions of the Royal Society

of London. Series A, Containing Papers of a Mathematical or Physical

Character, vol. 222, no. 594-604, pp. 309–368, Jan. 1922. [Online]. Available:

https://royalsocietypublishing.org/doi/10.1098/rsta.1922.0009

[24] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Informa-

tion Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982.

[25] Steinhaus, “Sur la division des corps mat eriels en parties,” Bulletin de l’Académie

Polonaise des Sciences, vol. IV, no. 12, pp. 801–804, 1956.

[26] J. MacQueen, “Some methods for classification and analysis of multivariate

observations.” The Regents of the University of California, 1967. [Online].

Available: https://projecteuclid.org/euclid.bsmsp/1200512992

[27] C. M. Bishop, Neural networks for pattern recognition. Clarendon, 1995.

[28] B. Widrow and M. E. Hoff, “Neurocomputing: Foundations of Research,” J. A.

Anderson and E. Rosenfeld, Eds. Cambridge, MA, USA: MIT Press, 1988, pp.

123–134. [Online]. Available: http://dl.acm.org/citation.cfm?id=65669.104390

[29] C. V. D. Malsburg, “Frank Rosenblatt: Principles of Neurodynamics: Perceptrons

and the Theory of Brain Mechanisms.” Springer, Berlin, Heidelberg, 1986,

pp. 245–248. [Online]. Available: https://link.springer.com/chapter/10.1007/

978-3-642-70911-1_20

67

[30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by

back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986.

[Online]. Available: https://www.nature.com/articles/323533a0

[31] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec.

1989. [Online]. Available: https://doi.org/10.1007/BF02551274

[32] K. Hornik, “Approximation capabilities of multilayer feedforward networks.”

Neural Networks, vol. 4, no. 2, pp. 251–257, 1991. [Online]. Available:

http://search.proquest.com/docview/25365047

[33] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen,” Ph.D. dis-

sertation, Institut fur Informatik , Technische Universität Munche, Munchen, Ger-

many, 1991.

[34] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradient Flow in Re-

current Nets: the Difficulty of Learning Long-Term Dependencies, 2001.

[35] V. Nair and G. Hinton, “Rectified linear units improve Restricted Boltzmann ma-

chines,” 2010, pp. 807–814.

[36] P. Werbos and P. J. (Paul John), “Beyond regression : new tools for prediction

and analysis in the behavioral sciences,” Jan. 1974.

[37] D. B. Parker, “Learning-logic.” Tech. Rep., 1985.

[38] D. R. Wilson and T. R. Martinez, “The general inefficiency of batch training for

gradient descent learning,” Neural Networks, vol. 16, no. 10, pp. 1429–1451,

Dec. 2003. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S0893608003001382

[39] S. Ruder, “An overview of gradient descent optimization algorithms,”

arXiv:1609.04747 [cs], Sep. 2016, arXiv: 1609.04747. [Online]. Available:

http://arxiv.org/abs/1609.04747

[40] B. T. Polyak, “Some methods of speeding up the convergence of iteration

methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4,

no. 5, pp. 1–17, Jan. 1964. [Online]. Available: http://www.sciencedirect.com/

science/article/pii/0041555364901375

68

[41] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the Importance of

Initialization and Momentum in Deep Learning,” in Proceedings of the 30th

International Conference on International Conference on Machine Learning

- Volume 28, ser. ICML’13. Atlanta, GA, USA: JMLR.org, 2013, pp. III–

1139–III–1147. [Online]. Available: http://dl.acm.org/citation.cfm?id=3042817.

3043064

[42] G. E. Hinton, N. Srivastava, and S. Kevin, “Lecture 6a overview of

mini–batch gradi-ent descent,” Toronto, Canada, 2012. [Online]. Available:

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

[43] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”

arXiv:1412.6980 [cs], Dec. 2014, arXiv: 1412.6980. [Online]. Available:


[44] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for

Online Learning and Stochastic Optimization,” Journal of Machine Learning

Research, vol. 12, no. Jul, pp. 2121–2159, 2011. [Online]. Available:

http://jmlr.org/papers/v12/duchi11a.html

[45] M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,”

arXiv:1212.5701 [cs], Dec. 2012, arXiv: 1212.5701. [Online]. Available:


[46] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in

Advances in Neural Information Processing Systems 27, Z. Ghahramani,

M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.

Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http:

//papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

[47] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Marginal

Value of Adaptive Gradient Methods in Machine Learning,” in Advances in

Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio,

H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates,

Inc., 2017, pp. 4148–4158. [Online]. Available: http://papers.nips.cc/paper/

7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.

pdf

69

[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep

feedforward neural networks,” in Proceedings of the Thirteenth International

Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 249–256.

[Online]. Available: http://proceedings.mlr.press/v9/glorot10a.html

[49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”

Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online].

Available: http://jmlr.org/papers/v15/srivastava14a.html

[50] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied

to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–

2324, Nov. 1998.

[51] A. Karpathy, “Course notes for Stanford CS231: Convolutional Neural

Networks for Visual Recognition,” accessed April 1, 2018. [Online]. Available:

http://cs231n.github.io/convolutional-networks/

[52] H. Beigi, Fundamentals of speaker recognition. New York: Springer US, 2011.

[53] R. Bolle, Ed., Guide to biometrics, ser. Springer professional computing. New

York: Springer, 2004.

[54] R. Clarke, “Human Identification in Information Systems: Management Chal-

lenges and Public Policy Issues,” Information Technology & People, vol. 7, no. 4,

pp. 6–37, Dec. 1994.

[55] S. K. Modi, Biometrics in identity management : concepts to applications, ser.

Artech House information security and privacy series. Artech House, 2011.

[56] B. Miller, “Vital signs of identity [biometrics],” IEEE Spectrum, vol. 31, no. 2,

pp. 22–30, Feb. 1994.

[57] J. Benesty, M. M. Sondhi, and Y. Huang, Eds., Springer handbook of speech

processing. Berlin; London: Springer, 2008.

[58] E. C. Ifeachor, Digital signal processing : a practical approach, 2nd ed. Prentice

Hall, 2002.

[59] F. Nolan, The phonetic bases of speaker recognition. Cambridge: Cambridge

University Press, 1983.

70

[60] J. O. Smith, Spectral audio signal processing. Center for Computer Research in

Music and Acoustics, Department of Music : W3K, 2011.

[61] ——, “Mathematics of the discrete fourier transform (DFT) with audio

applications second edition,” 2007. [Online]. Available: https://ccrma.stanford.

edu/~jos/st/

[62] J. W. Cooley, “An algorithm for the machine calculation of complex Fourier se-

ries,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965.

[63] S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement

of the Psychological Magnitude Pitch,” The Journal of the Acoustical Society

of America, vol. 8, no. 3, pp. 185–190, Jan. 1937. [Online]. Available:

https://asa-scitation-org.ezproxy.uef.fi:2443/doi/abs/10.1121/1.1915893

[64] D. O’Shaughnessy, Speech communication: human and machine. Addison-

Wesley Pub. Co., 1987.

[65] F. Camastra, Machine learning for audio, image and video analysis : theory and

applications, second edition ed., ser. Advanced information and knowledge pro-

cessing. Springer-Verlag London, 2015.

[66] R. Bellman, Dynamic Programming. Princeton University Press, 1957.

[67] S. Theodoridis, Pattern Recognition, 4th ed. Elsevier Science, 2008.

[68] S. Furui, “Comparison of speaker recognition methods using statistical features

and dynamic features,” IEEE Transactions on Acoustics, Speech, and Signal Pro-

cessing, vol. 29, no. 3, pp. 342–350, Jun. 1981.

[69] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau,

S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A.

Reynolds, “A Tutorial on Text-Independent Speaker Verification,” EURASIP

Journal on Advances in Signal Processing, vol. 2004, no. 4, p. 101962, Dec.

2004. [Online]. Available: https://asp-eurasipjournals.springeropen.com/articles/

10.1155/S1110865704310024

[70] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore,

J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book,

Jan. 2002.

71

[71] H. Boril and J. H. L. Hansen, “Unsupervised Equalization of Lombard Effect

for Speech Recognition in Noisy Adverse Environments,” IEEE Transactions on

Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1379–1393, Aug.

2010.

[72] J. W. Pelecanos and S. Sridharan, “Feature warping for robust speaker verifica-

tion,” in Odyssey, 2001.

[73] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transac-

tions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, Oct. 1994.

[74] J. Brown, “Calculation of a Constant-Q Spectral Transform,” Journal of the

Acoustical Society of America, vol. 89, no. 1, pp. 425–434, Jan. 1991,

wOS:A1991ER16200045.

[75] X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, and S. Shamma,

“Linear versus mel frequency cepstral coefficients for speaker recognition,” in

2011 IEEE Workshop on Automatic Speech Recognition Understanding, Dec.

2011, pp. 559–564.

[76] J. Markel, B. Oshika, and A. Gray, “Long-term feature averaging for speaker

recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing,

vol. 25, no. 4, pp. 330–337, Aug. 1977.

[77] R. Kuhn, P. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke,

and M. Contolini, “Eigenvoices for speaker adaptation,” pp. 1771–1774, 1998.

[Online]. Available: http://www.eurecom.fr/publication/198

[78] P. Kenny, M. Mihoubi, and P. Dumouchel, “New MAP estimators for speaker

recognition.” Centre de Recherche Informatique de Montréal (CRIM), Canada:

International Speech Communication Association, 2003, pp. 2961–2964.

[79] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines

using GMM supervectors for speaker verification,” IEEE Signal Processing Let-

ters, vol. 13, no. 5, pp. 308–311, May 2006.

[80] P. Kenny and P. Dumouchel, “Disentangling speaker and channel effects in

speaker verification,” vol. 1, 2004, pp. I37–I40.

72

[81] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse

training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3,

pp. 345–354, May 2005.

[82] O. Ghahabi and J. Hernando, “i-Vector Modeling with Deep Belief Networks for

Multi-Session Speaker Recognition,” 2014.

[83] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker

recognition using a phonetically-aware deep neural network,” in ICASSP, IEEE

International Conference on Acoustics, Speech and Signal Processing - Proceed-

ings, May 2014, pp. 1695–1699.

[84] T. Yamada, L. Wang, and A. Kai, “Improvement of distant-talking speaker iden-

tification using bottleneck features of dnn,” 2013, pp. 3661–3664.

[85] “ISO/IEC 30107-1:2016 - Information technology – Biometric presentation

attack detection – Part 1: Framework,” Jan. 2016. [Online]. Available:

https://www.iso.org/standard/53227.html

[86] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans,

M. Todisco, and H. Delgado, “ASVspoof: The Automatic Speaker Verification

Spoofing and Countermeasures Challenge,” IEEE Journal of Selected Topics in

Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017. [Online]. Available:

http://ieeexplore.ieee.org/document/7858696/

[87] A. F. Martin, G. R. Doddington, T. Kamm, M. Ordowski, and M. A. Przy-

bocki, “The DET curve in assessment of detection task performance,” in EU-

ROSPEECH, 1997.

[88] N. Brümmer and E. de Villiers, “The BOSARIS Toolkit: Theory, Algorithms

and Code for Surviving the New DCF,” arXiv:1304.2865 [cs, stat], Apr. 2013,

arXiv: 1304.2865. [Online]. Available: http://arxiv.org/abs/1304.2865

[89] M. Lin, Q. Chen, and S. Yan, “Network In Network,” arXiv:1312.4400 [cs], Dec.

2013, arXiv: 1312.4400. [Online]. Available: http://arxiv.org/abs/1312.4400

[90] N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Delgado,

and T. Kinnunen, “The 2nd Automatic Speaker Verification Spoofing and

Countermeasures Challenge (ASVspoof 2017) Database, Version 2,” Apr. 2018.

[Online]. Available: https://datashare.is.ed.ac.uk/handle/10283/3055

73

[91] T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamäki,

D. Thomsen, A. Sarkar, Z. H. Tan, H. Delgado, M. Todisco, N. Evans, V. Hau-

tamäki, and K. A. Lee, “RedDots replayed: A new replay spoofing attack cor-

pus for text-dependent speaker verification research,” in 2017 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017,

pp. 5395–5399.

[92] T. Francis, J. Mohit, and D. Prasenjit, “End-To-End Audio Replay Attack Detec-

tion Using Deep Convolutional Networks with Attention,” in Interspeech, Hyder-

abad, India, Sep. 2018.

[93] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image

Recognition,” arXiv:1512.03385 [cs], Dec. 2015, arXiv: 1512.03385. [Online].

Available: http://arxiv.org/abs/1512.03385

[94] H. A. Murthy, R. M. Hegde, and V. R. R. Gadde, “The modified group delay

feature: a new spectral representation of speech,” in INTERSPEECH, 2004.

[95] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,

N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative

Model for Raw Audio,” arXiv:1609.03499 [cs], Sep. 2016, arXiv: 1609.03499.

[Online]. Available: http://arxiv.org/abs/1609.03499

[96] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech Enhancement

Generative Adversarial Network,” arXiv:1703.09452 [cs], Mar. 2017, arXiv:

1703.09452. [Online]. Available: http://arxiv.org/abs/1703.09452

[97] S. Ribaric, A. Ariyaeeinia, and N. Pavesic, “De-identification for privacy

protection in multimedia content: A survey,” Signal Processing: Image

Communication, vol. 47, pp. 131–151, Sep. 2016. [Online]. Available:


74

presentation attack detection in automatic speaker ... · presentation attack detection in...

Documents