presentation attack detection in automatic speaker ... · presentation attack detection in...
TRANSCRIPT
Presentation attack detection in automaticspeaker verification with deep learning
Juhani Seppälä
Master’s Thesis
School of Computing
Computer Science
April 2019
ITÄ-SUOMEN YLIOPISTO, Luonnontieteiden ja metsätieteiden tiedekunta, JoensuuSchool of ComputingTietojenkäsittelytiede
Opiskelija, Juhani Seppälä: Replay attack detection in speaker verification withdeep learningPro gradu -tutkielma, 74 p.Pro gradu -tutkielman ohjaajat: FT Tomi KinnunenHuhtikuu 2019
Tiivistelmä: tietoturvallisuuden kontekstissa perinteiset käyttäjän tunnistamisenmenetelmät perustuvat joko tietämykseen tai fyysiseen objektiin. On kuitenkin ole-massa erilaisia tilanteita, joissa perinteiset tunnistautumisen menetelmät eivät jokoole riittäviä sellaisenaan tai niitä ei voida soveltaa lainkaan. Tarve vaihtoehtoisillebiometriikkaan perustuville tunnistautumismenetelmille on ollut jatkuvasti kasvussaja nykypäivän käyttäjät vaativat järjestelmiltä sekä turvallisuutta että käytettävyyttä.Yritykset ja viranomaiset kaipaavat työkaluja huijaamisen sekä väärinkäytösten hillit-semiseksi. Automaattinen puhujan verifiointi (ASV) on biometrisen tunnistautumisenmenetelmä, jossa hyödynnetään puhetta. Yrityksille ASV antaa keinon petosten ennal-taehkäisyyn ja viranomaisille se tarjoaa uusia työkaluja esimkerkiksi rikospaikkatutk-intaa varten. Ääniohjattujen älykkäiden järjestelmien yleistyessä kasvaa tarve myösääneen perustuvalle tunnistautumiselle. ISO/IEC 30107-1:2016 -standardi määrit-telee niin sanotun presentaatiohyökkäyksen biometrisille järjestelmille. Presentaatio-hyökkäys on ongelma kaikentyyppisissä biometrisen tunnistautumisen järjestelmissä.Eräs tapa toteuttaa tällainen hyökkäys on toistaa kohdehenkilön nauhoitettua puhettabiometrisen tunnistautumisen järjestelmälle. Useassa eri itsenäisessä tutkimuksessa onhavaittu, että järjestelmien toimintakykyä voidaan heikentää toistetuilla äänitteillä. Ke-hittyneimmät nykyiset järjestelmät hyödyntävät niin sanottuja i-vektoreita, kun perin-teiset ASV -järjestelmät perustuivat Gaussin mixtuurimalleihin ja akustisiin piirteisiin.Tässä työssä tutkimme ns. syväoppimismenetelmien toimintaa presentaatiohyökkäyk-sen tunnistamiseen.
Avainsanat: biometriikka, automaattinen puhujan tunnistus, väärennös, väärennöstentunnistus, toistohyökkäys, ASVspoof, DNN, CNN
CCS -luokat (ACM Computing Classification System, 2012 version): Security andprivacy→Biometrics, Computing methodologies→Neural networks, Computingmethodologies→Supervised learning by classification
i
UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, JoensuuSchool of ComputingComputer ScienceStudent, Juhani Seppälä: Replay attack detection in speaker verification with deeplearningMaster’s Thesis, 74 p.Supervisors of the Master’s Thesis: PhD Tomi KinnunenApril 2019
Abstract: In the context of information security, the traditional means of user authen-tication involves either knowledge (password) or physical token (badge or key). Thereare situations, however, where these are either not applicable or insufficient alone. Thedemand for alternate forms of authentication based on biometrics has been increasingand today’s users demand both security and convenience. Businesses and governmentsdemand tools to combat fraud and abuse. Automatic speaker verification (ASV) is a abiometric authentication method utilising speech data. For businesses ASV allows forearly fraud detection, while for law-enforcement, techniques from ASV may be of usein forensics. And, as more voice-operated, intelligent systems become mainstream insociety, the need for voice-based authentication increases. The ISO/IEC 30107-1:2016standard defines a so-called presentation attack for biometric systems. Presentationattacks present a problem for all biometric systems. One method to perform a presen-tation attack against an ASV system is by replaying a recording of the target speaker’sspeech to the biometric authentication system. Multiple independent studies have iden-tified that ASV system performance can be degraded when replay samples are intro-duced. Current state-of-the-art systems for ASV utilise the so-called i-vectors, whilethe classical systems were based on Gaussian mixture modelling of acoustic speechfeatures. In this work we investigate so-called deep learning approaches to replay at-tack detection.
Keywords: biometrics, speaker verification, spoofing, anti-spoofing, replay-attack,ASVspoof, DNN, CNN
CCS concepts (ACM Computing Classification System, 2012 version): Security andprivacy→Biometrics, Computing methodologies→Neural networks, Computingmethodologies→Supervised learning by classification
ii
Acronyms and abbreviations
ADC Analog-to-digital conversion
ASR Automatic speaker recognition
ASV Automatic speaker verification
CNN Convolutional neural network
CQCC Constant-Q cepstral coefficients
CQT Constant-Q transform
DCT Discrete cosine transform
DNN Deep neural network
DFT Discrete Fourier transform
GMM Gaussian mixture model
EER Equal error rate
EM Expectation-maximisation
FFT Fast Fourier transform
HFCC High frequency cepstral coefficients
LCNN Light convolutional neural network
LDA Linear discriminant analysis
LFCC Linear frequency cepstral coefficients
LPC Linear prediction coefficients
LPCC Linear prediction cepstrum coefficients
LVM Latent variable model
MAP Maximum a posteriori
MFCC Mel-frequency cepstral coefficients
MFM Max-feature-map
MLE Maximum likelihood estimate/estimation
MLP Multiple-layer Perceptron
MSE Mean squared error
PLDA Probabilistic linear discriminant analysis
RASTA Relative spectral processing
ReLU Rectified linear unit
STFT Short-time Fourier transform
SVM Support vector machine
TVM Total variability model
UBM Universal background model
iii
Mathematical notation
a The vector a.
M The Matrix M.
p(x) Probability density on x.
N (µ,Σ) Multivariate normal density with mean µ, and covariance Σ
I The identity or unit matrix.
iv
Contents
1 Introduction 1
2 Machine learning 42.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Some properties of machine learning . . . . . . . . . . . . . . . . . . 6
2.4 Latent variable models and the Gaussian mixture model . . . . . . . . 8
2.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 The classic multi-layer perceptron . . . . . . . . . . . . . . . 15
2.6.2 Implications of (32) and discussion . . . . . . . . . . . . . . 21
2.6.3 Convolutional neural networks . . . . . . . . . . . . . . . . . 24
3 Speaker recognition and verification 293.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Authentication and biometrics . . . . . . . . . . . . . . . . . 31
3.1.2 Speaker verification . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Speech processing and feature extraction . . . . . . . . . . . . . . . . 32
3.2.1 Mel-frequency cepstral coefficients (MFCCs) . . . . . . . . . 34
3.2.2 Constant Q cepstral coefficients (CQCCs) . . . . . . . . . . . 41
3.2.3 The spectrogram . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Speaker modelling and classification . . . . . . . . . . . . . . . . . . 45
3.3.1 Gaussian mixture model approaches to ASV . . . . . . . . . 46
3.3.2 Linear statistical model approaches . . . . . . . . . . . . . . 48
3.3.3 Deep learning approaches to ASV . . . . . . . . . . . . . . . 51
3.4 Vulnerabilities and countermeasures . . . . . . . . . . . . . . . . . . 52
3.5 Deep learning in replay attack detection . . . . . . . . . . . . . . . . 52
3.6 System measurement and evaluation . . . . . . . . . . . . . . . . . . 56
4 Experimental set-up and results 584.1 The ASVspoof 2017 v2 data . . . . . . . . . . . . . . . . . . . . . . 58
4.2 The results and discussion . . . . . . . . . . . . . . . . . . . . . . . 60
v
5 Conclusion 64
vi
1 Introduction
The expectation for high usability and the demand for alternative forms of user authen-
tication have been ever increasing and today’s users expect seamless and hassle-free
access to various services. Similarly, due to the increasing demand for services de-
manding high degree of security traditionally handled on-site, such as banking and
various governmental services, service providers have long been interested in the abil-
ity to enhance and, in some cases, simplify their user authentication schemes. Auto-
matic speaker verification (ASV), as an application of automatic speaker recognition
(ASR), allows for the possibility of using relatively easily and, more importantly non-
intrusively, collectable biometric – human speech. The field of ASV is still under
rapid change, with recent developments pointing towards an interesting future from
the perspective of the so-called deep learning evolution of machine learning, where
deep neural architectures are studied for modelling and inference.
Key use-cases for ASV can be found in off-site electronic services, business customer
service, as well as in law-enforcement and forensics [1]. Untapped potential remain in
biometric authentication (including speech), considering all the modern day personal
computing devices. Similarly, voice-based interfaces, such as Apple Siri, have been
rising in popularity. Many modern smart devices already include biometrics, either
in the form of facial recognition, or in the form of fingerprint detection. ASV adds
another possible layer of security of these decides.
The threat of replay attacks on AVS systems have been identified in separate studies
[2, 3, 4, 5]. A replay attack against an ASV system is initiated by playing a recording of
a person’s speech while interfacing with the target ASV system. Figure 1 demonstrates
the relative ease of obtaining a high-quality samples of the speaker of interest in the
age of the smartphone. Figure 2 illustrates the relative ease of launching a replay
attack. In [3], it was shown that systems do not fare well under the possibility of high-
quality impostor attacks, where even the state-of-art systems were heavily impacted
by replay data. A replay attack can be thought to be of special interest due to the low
requirements of performing this kind of an attack. Other types of attacks are noted
to require either high-level technical sophistication or special skills in impersonation.
Speech synthesis and voice conversion for the purpose of fooling ASV systems have
been studied more extensively in comparison to the relatively simple replay attack, but
the recent ASVSpoof 2017 challenge [2] has refocused the interest of researchers in
1
Figure 1: A mock scenario where high-quality recording of the target’s speech is ob-tained. The attacker (right) lures the victim into a (preferably) lengthy conversation,during which they collect speech data for later use during an attack. Disclaimer: thedepicted situation is unrelated to the ASVspoof data collection process.
Figure 2: Two off-the-shelf devices are needed to launch a replay attack against atelephone-based ASV system. The recording and playback device (left) is used to playback pre-recorded audio from the target speaker, while the device on the right is usedto make the call.
2
this area, underlying the criticality of the problem.
Classical ASV systems [6, 7] are based on generative models known as Gaussian mix-
ture models (GMMs), and acoustic features, the most common of which are the Mel-
frequency cepstral coefficients (MFCCs) [8]. More recently, so-called Contant-Q cep-
stral coefficients (CQCCs) [9] have been shown to be successful in ASV replay attack
detection. The current state-of-the-art system in ASV is the factor analysis -based i-
vector system [10], in which the idea of the GMM supervector is utilised. This work
is ultimately motivated by some results obtained from the ASVspoof 2017 challenge.
Several so-called deep learning approaches were presented for ASV replay attack de-
tection [11, 12]. In the former, a so-called light convolutional neural network (LCNN)
[13] was used to directly distinguish replay samples from genuine samples, while in
the latter, the focus was on combining a support vector machine (SVM) [14] classifier
with a CNN feature extractor.
This thesis is structured as follows. Section 2 covers some of the most important ma-
chine learning topics at the heart of ASV. Section 3 offers a contextualisation of the
ASV problem within automatic speaker recognition (ASR) and discusses some of the
popular speech processing, feature extraction and speaker modelling techniques. Sec-
tion 4 describes the system, including the parameters and the overall set-up of the
experiments. Finally, section 5 concludes.
3
2 Machine learning
This section serves the purpose of an introduction to the key concepts and terminology
behind the methods used in automatic speaker recognition. The reasoning here is to
steer the speaker recognition and verification discussion towards certain important in-
novations that led to the forming of particular methods and to focus on the differences,
strengths and weaknesses of these methods. This section starts with the general ideas
and terminology around machine learning. Subsection 2.4 covers the necessary back-
ground on a specific unsupervised model, known as the Gaussian mixture model, while
subsection 2.5 covers the background on factor analysis. Subsection 2.6 covers deep
learning. These three background sections each constitute the necessary backgrounds
to three major different approaches to speaker recognition and spoofing detection, each
of which will be discussed in section 3.
Machine learning refers to a broad class of methods and principles for the purpose of
making future predictions based on observed data and for the purpose of uncovering
underlying structures within data [15]. There are several different types of machine
learning problems [15]. In supervised learning, the task is to learn a mapping function
between the input space and the output space, while in unsupervised learning the learn-
ing task is performed without any mapping (or label) information. Some methods may
combine aspects of both (semi-supervised learning). Finally, there is reinforcement
learning, in which the learning involves reward and punishment mechanisms that guide
the model towards the wanted response determined by the designer. In the context of
this thesis, the two relevant classes of machine learning are supervised and unsuper-
vised learning, both of which will will review in the following two subsections.
Figure 3: Conceptual illustration of supervised (left) and unsupervised (right) learning(based on similar ideas in multiple sources, including [16])
4
2.1 Supervised learning
Supervised learning tasks can be broadly split into regression and classification tasks.
In the case of classification, given the inputs xi ∈ X , i ∈ 1, ..., N, drawn from a
training set X , training outputs yi ∈ 1, ..., C, the task is to learn a mapping between
the inputs and the outputs [15]. Here, C denotes the number of classes and N the
number of training observations. In the case that the training outputs are continuous,
the task becomes a regression problem. An individual xi is also referred to as features,
or to as a feature vector, with the dimensionality of the training data often exceeding 1.
The dimensionality of the training data depends on the learning task and availability of
suitable data, ranging from single-digit to hundreds or even thousands. As an example
of low-dimensional training data, consider that we would record the height, weight and
the shoe size of a person. On the other hand, a high-dimensional feature vector could
contain, for example, the raw pixel values of a digital image. The desired outputs are
known as class (or label) information. The number of classes determines whether a
classification problem is a binary classification or multi-class classification problem.
Additionally, if an input may be classified into multiple classes, the task is known
as multi-label classification. As an example of a binary classification problem, con-
sider a situation where the goal is to classify a speech sample into either pre-recorded
(spoofed) speech or live speech, which is also the learning task this thesis focuses on.
An example of multi-class classification could be a situation where the goal is to predict
the labels of digital images.
Figure 4: An imaginary image labelling scenario, where the most probable labellingsfor the two example images are shown.
Both supervised and unsupervised learning have been widely utilised in speaker recog-
nition, and subsequently, in spoofing detection, and often we will see these used to-
5
gether in some manner to build the actual classifier that solves the problem of interest.
The general speaker recognition problem implies a multi-class classification task, given
that the goal is to tie an incoming speech sample to a single class, i.e., the most likely
speaker. Spoofing detection, in turn is a binary classification problem, as the task is to
determine whether a given speech sample is from a pre-recorded speech, played from
an artificial audio device, or given by an actual human. In this work, spoofing detection
is discussed as an important sub-problem within the wider context of speaker recogni-
tion and, as we will see, the methods for spoofing detection largely rely on the findings
from work done on the general speaker recognition task. A key distinction between
these two problems in terms of the classification task itself is the way in which the
label information is used: in the general task, labels are used to tie samples to individ-
uals, whereas in spoofing detection labels are used to differentiate samples based on
the emitting source (artificial device or human speaker).
2.2 Unsupervised learning
Unsupervised learning differs fundamentally from traditional supervised learning in
that the label information is may be completely omitted from the process [16]. Instead
of thinking in terms of pairs of training samples and the corresponding target values
or classes, in unsupervised learning we are interested in the underlying structure of
the data (clustering) or the hypothesised statistical process that is thought to be related
to the observed data (density estimation) [17]. Dimensionality reduction is also an
important field utilising the concept of unsupervised learning, where the idea is to
find a lower-dimensionality representation for the data that still retains the important
variance within the data [16]. Examples of unsupervised learning include the classic
clustering techniques such as K-means, factor analysis and mixture models such as the
Gaussian mixture model, which is discussed later.
2.3 Some properties of machine learning
While some aspects of machine learning can be difficult to categorise, certain attributes
can be assigned for each learning problem at hand. In [17], it is noted that each learning
problem may be described in terms of three main attributes: 1) the type of the learning
6
task, 2) the metric that is used to evaluate the performance of the model, and 3) the
nature of the learning process: supervised or unsupervised. We have already discussed
two of these: the task can be anything from predicting a single real number (regression)
to density estimation, where we try to approximate an unknown probability density,
and the nature of the learning, or type of experience [17] used for the learning, the
latter of which is often categorised into either supervised or unsupervised learning.
Finally, we have the metric, which in classification is usually related to the accuracy of
the model predictions, and for real-valued predictions in regression-like tasks may be
some sort of distance metric. This is related to the empirical risk minimisation (ERM)
principle discussed below.
Most supervised learning is based on empirical risk minimisation (ERM) [18, 19], in
which the idea is to minimise the expected loss given our model predictions and the
training data [18]:
E(w) =1
N
N∑i=1
L(f(xi,w),yi), (1)
where L is a loss function, f(xi,w) the model’s prediction for a particular input train-
ing sample, given the model parameters w, and yi the training label associated with
that training sample. Under the principle, we should choose a model for which (1)
is lowest. The loss function, often also referred to as objective function or cost func-
tion is a mathematical expression of the wrongness we want associate with a particular
training sample and label pair, and the form of the loss depends on the learning task.
The key issue that machine learning, and especially supervised learning, is confronted
with is the issue of generalisation [17]. In specific, we have a set of training samples
that give us a glimpse into some unknown statistical process, and now we want to build
a model that is able to give reasonable guesses when confronted with completely new
data. The measure that gives us an idea of the goodness of our model in this setting is
the generalisation error (or test error), which is measured in terms a disjoint test set.
In most machine learning settings, the development of a model involves first using the
training data to reduce the training error towards zero and then testing the model on
the separate test set. We want both to be as small as possible, and two errors should
follow each other in a good model [17]. In the case that the training error is lower than
the test error, we have a model that underfits, and in the opposite case we have a model
that overfits.
7
2.4 Latent variable models and the Gaussian mixture model
In a latent variable model (LVM), the observed data is assumed to be affected by one
more more unobservable or latent variables or factors [20, 15]. Normally, in LVMs
and the related topic of factor analysis, the goal is to explain the variance of the ob-
served data with an arbitrary number of unobserved variables, which are found by the
means of factor analysis. The ideas behind LVMs were originally developed in the
social sciences for behavioural modelling [20, 21], but have also found widespread use
within machine learning. LVMs and factor analysis will be discussed in more detail in
subsection 2.4.
The Gaussian mixture model (GMM), or mixture of Gaussians [15, 16], consists of
multiple multivariate Gaussian distributions, each with a mean µk and a covariance
matrix Σk. A GMM is formalised as a weighted sum of its mixture components:
p (xi|θ) =K∑k=1
πkN (xi|µk,Σk) , (2)
whereN is the probability density function (pdf) of the multivariate Gaussian (Normal)
distribution (MVN):
N (x|µ,Σ) ,1
2πD/2|Σ|1/2exp
[−1
2(x− µ)TΣ−1(x− µ)
](3)
In the MVN, the mean vector, µ, defines the centre point of the distribution within
the D-dimensional space that it lies, and the covariance matrix, Σ ∈ RD×D, defines
how the probability density behaves around the mean, i.e., how the probability mass
is distributed. |Σ| is the determinant of Σ. The covariance of a multivariate random
variable is given by
cov[x] = E[(x− E)(x− E)T
], (4)
where for the Gaussian distribution we have E[x] = µ. The covariance can be either
general, diagonal or isotropic (or spherical) [16], of which the first one is the least
restricted in terms of number of parameters. The general covariance matrix is as its
definition in (4), however, we may restrict it to only be the diagonal matrix diag(σ2i )
so that we have a matrix with only the individual variances of each dimension of x.
Lastly, in the most restrictive option, we have Σ = σ2I, so that the covariance is simply
the identity matrix scaled by a variance parameter. The differences between these are
8
illustrated in Figure 5. The restricted covariances are useful due to the requirement of
computing the inverse of the covariance inside the exponent of (3), which is also known
as the precision matrix [15]. In (2), K denotes the number of mixture components and
Figure 5: Three multivariate normal distributions in 3D and 2D plots. The leftmostdistribution is one with the general case covariance matrix, while the central and right-most distributions have spherical and diagonal covariances, respectively. We see thatthe general case covariance may have an arbitrary angle for direction of the highestvariance, while the diagonal one is restricted to follow one of the axes for its shape ofthe variance. Lastly, the spherical covariance has clearly a stright circle shape. Plotsgenerated with GNU Octave (https://www.gnu.org/software/octave/) mesh and contourplots.
D the dimensionality of the vectors (xis). Further, πks are known as the mixing weights
or component priors. Two probabilistic constraints, 0 ≤ πk ≤ 1 and∑K
k=1 πk = 1,
must be satisfied. The πk represent the Bayesian prior knowledge we have regarding
the training data. The model encapsulates a 1-of-K coded latent variable zk, also known
as a one-hot vector, where one of the vector values is one and the rest zeroes. This one-
hot vector can be thought of as tying each observation in to a mixture component, with
the index of the one in the vector pointing to the kth mixture component. The constraint
9
∑k zk = 1 must hold. For the mixing weighs, πk, we have
p(zk = 1) = πk. (5)
The parameters of a GMM can be trained with an iterative algorithm known as the
expectation-maximisation (EM) algorithm [22, 15, 16]. The algorithm consists of two
steps, computed at each iteration. The E-step (expectation) step comes first in the iter-
ation. Here, data inference is done based on the current model parameter values. What
inference means with a generative model such as the GMM, is that we use the current
parameters of the distribution to sample arbitrary number of new samples. The second,
maximisation, or M-step, is an optimisation of the parameters given the previously up-
dated data. The purpose of the EM algorithm is to maximise the log-likelihood of the
Figure 6: GMM training process via the EM-algorithm: here, starting from a randominitial position, the mixture component parameters are shifted towards the clusters. Inthis toy example, we conveniently have three data clusters and three mixture compo-nents.
observed data, xi, the missing or hidden data zi, given the parameters θ. The likelihood
is a function that indicates how reasonable our model is given the data we observe (our
training data) where a higher likelihood is in favour of our model, and vice versa, and
it is
`(θ) =N∑i=1
log p (xi|θ) =N∑i=1
log
[∑zi
p (xi, zi|θ)
](6)
10
As noted in [15], due to the logarithm in front of the sum, the complete data log-
likelihood is used:
`c(θ) =N∑i=1
log p(xi, zi|θ) (7)
The complete data is the set of random variables X,Z, whereX is the observed data
and Z the unobserved data (xi ∈ X and zi ∈ Z). While Z is not directly observable, it
is assumed that there is exactly one Gaussian component in the mixture that generated a
particular observed sample in X . This data-generating process is encoded in Z. As the
zi are latent and therefore unknown, this log-likelihood cannot be directly evaluated.
The EM algorithm solves this problem such that, instead of evaluating (7), we evaluate
its expectation [15]:
Q(θ,θt−1) , E[lc(θ)|X,θt−1
]= E
[∑i
log p(xi, zi|θ)
]=∑i
∑k
rik log π +∑i
∑k
rik log p(xi|θk).
(8)
Note the introduction of the variable zi for the assumed unobserved data. Instead of
evaluating a likelihood, in the EM algorithm we evaluate the expectation of the joint
distribution of the observed data (the xis) and the unobserved data, represented by the
zis. We see that the sufficient statistics, means and covariances, appear only in the
right-hand sum. This is utilised in the steps of the algorithm. Here, t refers to the
current iteration of the algorithm and θt−1 refers to the parameters of the previous
iteration. Q is a new function known as the auxiliary function. The parameters θt in
the M step are optimised via the following:
θt = argmaxθ
Q(θ,θt−1), (9)
that is, we want to find the model parameters that maximise the expected log-likelihood
in (8). In the EM-algorithm, maximum likelihood estimation (MLE) [23] is used to find
the parameters. The basic principle of MLE is that we take the log-likelihood function,
the data, and obtain an estimate (maximum likelihood estimate) for each parameter of
the model. The estimator for a given parameter is obtained by taking the derivative of
the likelihood function with respect to that parameter at zero. This estimator can then
used to compute the estimate for the parameter of interest.
11
Now, the E-step in the EM iteration, based on (8), is done via [15]:
rik =πkN
(xi|θt−1
k
)∑Kk′=1 πk′N
(xi|θt−1
k′
) . (10)
That is, for each observed data point i, the responsibility that the kth component takes
for this data point is obtained with the Bayes rule. The responsibility computation
can be seen as a soft clustering of the data points, in contrast to hard clustering in
the K-means algorithm [16]. Thus, if we wanted to use the EM-GMM framework for
clustering purposes, we could select the data point that has the largest responsibility
value given each cluster to obtain cluster memberships. We will discuss this similarity
briefly later. The M-step is done in multiple phases. First, the mixing weight for the
kth component is set to be the proportion of the observed data that has been assigned
to this component [15]:
πk =1
N
∑i
rik =rkN
(11)
The new mean vector for the kth component is obtained by taking the mean over all
data points weighted by the responsibility of the kth component for each data point
[15]:
µk =
∑i rikxirk
(12)
The updated covariance matrix is obtained by following the MLE-reasoning for
the covariance parameter, in which we set the derivative of the log-likelihood,
l(xi|πk,µk,Σk), with respect to the parameter of interest (here Σk) to zero and solve
for that parameter. This results in the form [15]:
Σk =
∑i rik(xi − µk)(xi − µk)
T
rk
=
∑i rikxix
Ti
rk− µkµ
Tk
(13)
Assuming that the convergence threshold is set appropriately, the algorithm is guar-
anteed to end up in a local maximum in terms of the log-likelihood [15, 16]. For
the initialisation of the model in terms of the means and covariances that describe the
mixture components, it is possible to use the K-means algorithm [24, 15, 16]. Other
approaches include random initialisation and farthest point clustering [15].
12
Algorithm 1 The naive EM algorithm for GMMs1: Initialise each µk, Σk, πk, convergence threshold t, update variable u and evaluate
the initial log-likelihood.2: while u ≥ t do3: Evaluate the riks . E-step4: Using the new riks, update πk (11), µk (12) and Σk (13). . M-step5: Re-evaluate the log-likelihood for the updated model and set the increase in
log-likelihood to be u6: end while
2.4.1 Discussion
The EM-GMM scheme described earlier can be seen as a more robust way of doing
clustering that is conceptually close to the K-means algorithm (in fact, [16] notes that
K-means is a special case of GMM-EM). In the case of GMMs, we get a probabilistic
alignment (soft alignment) of each data point belonging to each mixture component,
whereas in the case of K-means, each data point either belongs to a cluster or does not
(hard alignment). These are often called soft clustering and hard clustering, respec-
tively.
In K-means clustering [25, 26], we first select k points at random and then iterate two
steps, in the first of which we assign data points to their nearest cluster centre according
to some distance metric (usually the euclidean metric, or L2-metric), and then in the
second step we set each cluster centre to be the mean all the points assigned to it. The
first step can be formalised as follows [16]:
rik =
1 if k = arg minj ‖xi − µj‖2
0 otherwise,(14)
where we get a one-hot encoding for the cluster assignments, anagolously to the re-
sponsibilities in EM-GMM. The updated cluster means are obtained such that [16]
µk =
∑i rikxi∑i rik
. (15)
Both (14) and (15) result from the objective or loss function [16]:
J = arg minrik,µk
N∑i=1
K∑k=1
rik‖xn − µk‖2, (16)
13
with (15) being the MLE for µk. Now, suppose that in the EM-GMM computation
we set each Gaussian component’s covariance matrix to be σID, use a constant πk =
1/K, and use an indicator function to set the responsibilities in E-step such that for a
given data point, the component responsibility ends up either being one for the most
likely component, or zero otherwise [15]. The result is the same clustering that is
obtained from K-means as only the means of the components are relevant in the E-step
due to the way we defined the covariances. EM-GMM can be thus seen as a more
robust way of modelling in comparison to K-means due to the possibility of modelling
the individual component variances. Another important difference arising from the
probabilistic responsibility assignments in EM-GMM is that it captures the uncertainty
of each component responsibility, obtained via 1 − maxk(rik) [15], i.e., we can look
at the relative magnitude of the highest assigned component responsibility to obtain a
measure of the uncertainty related to the component (or cluster) assignment.
2.5 Factor analysis
Factor analysis is a field of statistics where the goal is to understand the effect of un-
observed (latent) variables on the observed data and to estimate these latent variables
or factors. The origins of the field are in behavioural sciences, specifically in the study
of human intelligence [21]. Factor analysis is a popular statistical tool in fields where
there are hard-to-define and obtain quantities, such as marketing and economics, but
has also seen wider adoption in terms of its core ideas due to the relation to the dimen-
sionality reduction technique known as principal component analysis. Factor analysis
models are discussed in more detail in Section 3.3.2.
2.6 Deep learning
Many classical machine learning techniques require the use of often complex prepro-
cessing of data in the form of feature extraction, before any classification or regres-
sion task becomes feasible. Deep learning is a branch of machine learning, where the
aim is to move away from complex feature extraction processes into what is known
as representation learning (with multiple layers of representations), where the learn-
ing architecture can be thought of as performing feature extraction on the data [17].
These architectures usually combine together arbitrarily many non-linear components
14
in a layered structure, each layer thought as corresponding to a different level of ab-
straction [17]. We note, however, that this kind of description can be seen as rather
idealised, considering the difficulty of actually interpreting the learned parameters of a
deep neural network. Ideally, this kind of thinking would be reducing the need for in-
depth domain knowledge in adopting a deep architecture for some learning problem.
Deep learning, however, presents an entirely new set of challenges, from interpreta-
tion of the models to computational feasibility due to high parameter count. Finally,
classical machine learning techniques are still utilised in many deep learning systems.
The fundamental ideas behind deep learning can be traced back to the original ideas
of neural networks [27, 28, 29, 30] and the computational modelling of neurons. The
most widely known deep learning model is the classic multi-layer perceptron, which
utilises the simple model neuron, perceptron. While the ideas for training such models
were studied in the 1970s and 1980s, they did not gain wide interest in the machine
learning and pattern recognition communities until relatively recently, because in the
1990s it was widely thought that the training of such models would be difficult in
practice. For this thesis, two highly related deep learning models are presented – the
classical perceptron-based model (discussed next), and its training process, as well as
the so-called convolutional neural network.
2.6.1 The classic multi-layer perceptron
The multi-layer perceptron [27, 16] (deep feedforward network, feedworward neural
network), often abbreviated MLP, can be described in terms of non-linear function
approximation, where arbitrarily many layers of non-linear vector-valued functions,
each with modifiable (to be trained) parameters, are combined to a composite. A well-
known property of the MLP-networks is related to the universal approximation theo-
rem [31, 32, 17], which states that these kind of models can be used to model arbitrary
processes, given enough parameters. Of note is however the fact that this idea does not
consider the optimisation of such models [17], which is still an active area of research.
The first layer of a MLP can be written as [16]:
σ
(aj =
D∑i=1
w(1)ij xi + b
(1)j0
), (17)
15
where aj is the jth activation of this layer. D denotes the dimensionality of the in-
put layer, while wij is the weight term for the ith input layer component connected
to the jth hidden layer component. bj0 is the jth bias term for the first (and only)
layer of this model. Finally, σ denotes the activation function. The neural network
nomenclature for this kind of model becomes apparent when the weights are consid-
ered to be edges connecting the previous layer to the second and each jth component
a neuron. Moreover, the weighted sum with the bias term and the activation is the
perceptron, with the important distinction that in the MLP we use differentiable non-
linearity: z = σ(∑D
j=1 wij + b)
. Thus, the basic component of this kind of a model
is a linear combination, with adjustable parameters, followed by a non-linearity. (17)
can be extended to include multiple layers as follows [16]:
yk(x,w) = σ
(M∑j=1
w(2)kj h
(D∑i=1
w(1)ji xi + b
(1)j0
)+ b
(2)k0
), (18)
where M denotes the number of components on the added second layer and the acti-
vation function, h, is the activation function used for the first hidden layer. The bias
terms can be merged into the into the input vectors such that an input x becomes
x = (1, x1, ..., xd)T [16] and this simplification is used in the following discussion. We
could equivalently and more simply write our model in the matrix form:
G(x) = σ (h (xW1 + b1) W2 + b2) , (19)
where W1 is a weight-matrix containing weights for the first layer and b1 the bias
vector for that layer. The activation functions are not necessarily chosen to be the
Figure 7: A high level view of a MLP-network. Here we see the input layer, withdimensionality D, an arbitrary number of hidden layers, and finally the output layer,with size T .
same, as seen here. Traditionally, the logistic sigmoid function has been presented as a
16
suitable activation function for the hidden layers:
σ(x) =1
1 + exp(−x). (20)
However, the sigmoid function can lead to a problem known as vanishing/exploding
gradient [33, 34] during the training process, where the error signal used for the pa-
rameter update tends to either go too small for sensible updates or explode entirely.
This issue was noted to be especially significant for deeper networks, such as so-called
recurrent neural networks (RNNs). The the hyperbolic tangent, tanh, is another com-
monly used activation function. Generally, the requirements for the hidden layer ac-
tivation functions are such that the function must be differentiable and monotonically
increasing [16]. In the context of modern deep learning, a widely accepted replacement
for the activation function is the so-called rectified linear unit (ReLU) [35, 17] and its
variants:
a(x) = max(0, x). (21)
ReLU is a piecewise function with two sections. This property, coupled with the non-
linearity of the function, makes it a desirable activation function from the perspective
of parameter optimisation [17]. In classification networks, the activation function at
the output layer is typically the softmax function [16]:
yi(x) =eai∑j e
aj, (22)
where ai is the ith activation at the output layer, and the sum is taken over all the activa-
tions at the output layer. The softmax activation constrains the output layer predictions
into probabilities, such that 0 ≤ yi ≤ 1 and∑
i yi = 1, which is what we want to have
in a classification network.
The parameters of an MLP network can be trained with various strategies utilising
gradient descent and back-propagation. In MLP training using these two techniques,
the parameters are adjusted in small steps towards the negative gradient [36, 37, 30].
Back-propagation is used to evaluate the errors for each network parameter, which
are then used to adjust the parameters accordingly. The gradient descent parameter
optimisation can be formalised as follows [16]:
w(τ+1) = w(τ) − η∇E(w(τ)), (23)
17
Figure 8: The rectified linear unit (ReLU) activation function
where w(τ) is the flattened parameter vector of the current time-step, where each layer’s
weights have been concatenated into a single vector of parameters. η is known as the
learning rate and∇E is the vector of partial derivatives (a gradient) of the error (or ob-
jective or loss) function E with respect to all the network parameters, calculated with
back-propagation. The precise formulation of the error function depends on the acti-
vation function used at the output layer of the network as well as on the classification
task. For regression, mean-squared-error (MSE) can be used [16]:
E(w) =1
2
N∑i=1
‖yi − ti‖2, (24)
where w is the vector of weight and bias parameters, yi the predicted output and ti
the corresponding label or target vector. The error or cost functions presented here for
neural networks are based on the principle maximum likelihood [17]. In the case that
our labelling information is in the form of one-hot-coded vectors, and the task is to
perform binary or multi-class classification, binary or categorical cross-entropy can be
used. Binary cross-entropy can be written as follows [16]:
E(w) = −N∑i=1
(ti log(yi) + (1− ti) log(1− yi)). (25)
Categorical cross-entropy, as presented in [16], can be written as
E(w) = −N∑i=1
M∑c=1
ti,c log(yi,c), (26)
18
where M denotes the number of predicted classes, ti,c the target label and yic the pre-
dicted probability of ith sample belonging to the cth class. The cross-entropy losses are
motivated by information theory, and cause the network to bring the model predictions
closer the training data, which represents the underlying true distribution we want to
generalise for.
In the back-propagation technique, the partial derivative of the error function for the
nth input vector, En(θ), with respect to a weight parameter wji, is obtained by utilising
the chain rule for partial derivatives [16]:
∂En∂wji
=∂En∂aj
∂aj∂wji
, (27)
where wji is the ith weight parameter connected to the jth activation, aj , of some layer
of our network. The aj is the weighted sum∑
iwjizi, where zi is the ith output coming
from a previous layer of the network to this component. Let
δj ≡∂En∂aj
. (28)
The δjs are known as the errors [16]. In (28) the error for each network activation
is defined as the gradient of the objective function for the current (nth) sample with
respect to the jth activation at some layer of the network. The errors at the output layer
of the network are simply the differences between the activations at the output layer
and the values of the target vector, or δk = yk − tk. It is now possible to write the
partial derivative of the jth activation with respect to the ith weight connected to that
activation as follows:∂aj∂wji
= zi. (29)
Following the derivation in [16] and substituting the previous two into (27), we get
∂En∂wji
= δjzi. (30)
The errors for a unit of a hidden layer of the network are obtained as follows [16]:
δj ≡∂En∂aj
=∑k
∂En∂ak
∂ak∂aj
. (31)
The sum is evaluated over all hidden layer units that the kth output unit has connection
19
Figure 9: Components of MLP error calculation (based on the illustration in [16]).The green arrow signifies the direction of activation calculation, often referred to as"forward pass". The red arrow shows the direction of the error back-propagation com-putation, starting from the output layer.
with. Finally, we can follow [16], and substitute (28) into (31) and utilise the definition
of the activation, aj =∑
iwjizi and zj = h(aj), to obtain the general form of the
back-propagation for a unit on a hidden layer of the network [16]:
δj = h′(aj)∑k
wkjδk, (32)
where h′ is the derivative of the hidden activation function of the layer this unit resides
at. From (32) we see that, after a forward pass through the network, one may iterate
backwards through the network using the immediately previously computed errors at
some layer to compute the errors for the next layer of the backwards pass through
the network. Importantly, there is no complex dependency between the errors of units
within the same layer of the network, as only the previous layer error values are needed
to be taken into account. Of note is also the fact that the computations in a MLP-
network can be represented as a computational graph [17], as it can be easily seen
that there are no cycles in the forward pass nor in the back-propagation step, and most
modern software tools utilise some kind of a graph.
The algorithm for naive gradient descent and back-propagation over the whole data is
described below. While our basic algorithm utilises the entire dataset before a update,
the size of the batch does not necessarily need to set in this manner. One strategy is, for
example, to select some batch size smaller than N to either partition the data or sample
20
Algorithm 2 Gradient descent and back-propagation1: while i < N do2: Forward-propagate an input vector xi through the network, calculating the ac-
tivations of each hidden and output unit of the network.3: Evaluate the output-layer errors using δk = yk − tk.4: . forward pass5: Using the output-layer errors, δks, back-propagate through the network by util-
ising (31) and (30), summing and maintaining the errors for each parameter.6: . back-propagation7: Adjust the network parameters using (23). . gradient descent8: end while
randomly from it (mini-batch gradient descent). The other extreme is an on-line [16]
version of the algorithm which calculates the gradient in terms of a single input vector,
and adjusts the network parameters each time, instead of going through the entire data
before adjustments (sequential/stochastic gradient descent). The mini-batch approach
is noted to be a popular compromise between using all of the data and a single data
point during error computation [17], where it is noted also that in [38] it was shown
that in terms of the generalisation performance using a single sample is desirable, but
as noted in [17] this often problematic from optimisation perspective. Additionally, the
choice of the mini-batch size may be affected by hardware and parallerisation consid-
erations [17].
2.6.2 Implications of (32) and discussion
From (32), we can see that the iterative computation of the errors in these kinds of
networks can be achieved merely by knowing the derivative of the chosen activation
function(s). However, as noted in [16], due to the multiple nonlinearities found within
neural networks (the activation functions), the overall error function is non-convex.
This implies that only local optima of the error function can be found by utilising the
gradient descent parameter update in (23). For the generalisation performance of the
model, i.e., when the model is tested on unknown new data beyond the training data,
the global optimum may not be a desirable [16].
It is noted in [39] that the stochastic gradient descent (SGD) has been widely popu-
lar as an optimisation technique for neural networks, even as the fine details of the
techniques used in practice have seen refinement in the recent years. These include
the ideas of momemtum [40], Nesterov momemtum or Nesterov accelerated gradient
21
(NAG) [41], RMSsprop [42], ADAM [43], Adamax [43], Adagrad [44], and Adadelta
[45], to name a few. For now, let us assume that we are using the mini-batch vari-
ant of back-propagation and gradient descent. The update rule [39] using momemtum
becomes
vt = γvt−1 + η∇E(w)
w = w − vt,(33)
where γ is an adjustable hyperparameter that controls the amount of the previous up-
date values used in computing the new update. Here we omitted the time-step index for
the parameters w for simplicity and conciseness. The intuition behind momemtum is
that it helps the parameter optimisation process in getting over valleys and hills in the
gradient "landscape" by taking into account the history of past update values during
the update. Many of the variants of SGD discussed here include something similar in
nature.
In the Nesterov variant (34), the previous momemtum values are used together with the
current parameters to obtain an estimate of the future parameter values such that the
historical direction of the parameter updates is taken into account during the gradient
computation itself.
vt = γvt−1 + η∇E(w − γvt−1)
w = w − vt.(34)
Practical challenges in getting the standard SGD to converge into desirable results have
further led into the development of so-called adaptive versions of SGD, where usually
each parameter is updated individually based on some statistic of said parameter. The
update rule for the adaptive method Adagrad is the following [39]:
wt+1,i = wt,i −η√
Gt,ii + ε· gt,i, (35)
where wt,i corresponds to the parameter i at time-step t and gt,i to the derivative of the
parameter. Gt ∈ Rd×d is a diagonal matrix of the sum of squares of the past gradients.
Finally, ε is added to the denominator for numerical stability. The intuition of (35)
is that we scale the learning rate parameter η according to the size of the past errors,
which allows for the optimisation to adapt to the size of the errors on a per-parameter
22
basis. One motivation for this scaling is suggested [39] to be the problem of sparse
data, e.g., high-dimensional data where the variability is concentrated on particular
dimensions, which is common in many real-world situations. While adaptive meth-
ods have been found to be successful empirically in certain tasks such as in training
Generative Adversarial Networks (GANs) [46], in [47] it was suggested that adaptive
methods may not be well-justified as drop-in replacements for standard SGD as these
were shown to give widely differing results from standard SGD.
Beyond the choice of the update rule during optimisation, a strategy for the initial
values for the network parameters must be decided upon. Common approaches include
drawing from normal and uniform distributions, with some heuristics depending on
the choice of non-linearity in the model. One of these heuristical random initialisation
methods was presented in [48], where the initial parameter weights for a particular
layer are drawn such that
W ∼ U(−√
6√nj + nj+1
,
√6
√nj + nj+1
), (36)
where nj is the number of incoming connections ("fan in") to this layer and nj+1 the
number of outgoing connections ("fan out"). The normal variant of this method draws
the parameter values such that
W ∼ N(0, σ), (37)
where
σ =
√2
nj + nj+1
. (38)
The above initialisation is known as Xavier or Glorot initialisation according to one of
the authors and has been found to be effective with deep networks.
MLP-networks are particularly prone to a common problem in machine learning
known as overfitting, where the model’s neurons memorise some parts of the train-
ing set. The result is degraded inference performance when the model is introduced
to previously unseen data. Strategies for guarding against overfitting vary, and include
parameter weight decay [17], data augmentation [17], noise injection [17], and dropout
[49, 17]. In weight decay strategies, the norm of all of the model parameters are con-
strained to a norm (L1, L2) via the use of a regularisation term in the error function. As
overfitting may often arise from insufficient data, data augmentation may be sometimes
used to generate new training samples by applying suitable transformations on the real
23
samples. In noise injection, we add random noise to the input data for the purpose of
making the model more robust to small disturbances in the input. Dropout is a method
for applying noise to the weights of the network, where we randomly "drop" connec-
tions in the network, i.e., we set some of the activations at a particular layer to zero at
random. Finally, we note that controlling for the capacity of the network, in terms of
the number of layers as well as the size of the layers themselves, is an important part
of regularisation.
2.6.3 Convolutional neural networks
Convolutional neural networks (CNNs) [50] constitute a class of neural architectures
for the purpose of allowing for translational invariance [17] for structures in the input
data (especially in the visual domain, such invariance is critically important since ob-
jects may change their location within an image, but they still need to be recognised by
the model.). A CNN is a neural network that utilises an operation known as convolu-
tion in its discrete form at some point of the architecture [17]. The discrete convolution
operation is defined as follows:
s[t] = [x ∗ w][t] =∞∑
a=−∞
x[a]w[t− a]. (39)
For a 2D array-like input, such as a grayscale (single layer) digital image, convolution
becomes [17]
S(i, j) = (K ∗D)(i, j) =∑m
∑n
D(i−m, j − n)K(m,n), (40)
where D is the input matrix and K a two-dimensional kernel that is convolved with
the input. Taken through an activation function, such as a ReLU, the output of this
operation is often called a feature map. [17] notes that, in practice, cross-correlation is
often utilised:
S(i, j) = (K ∗D)(i, j) =∑m
∑n
D(i+m, j + n)K(m,n). (41)
Typically, convolutional neural architectures dealing with digital images have to ac-
count for the three layers of a RGB-image, however, for our purpose of adapting such
a model for speech data in the form of a spectrogram, the above form of convolution
24
Figure 10: Convolution with a 2x2 kernel (Adapted from [17])
applies. Figure 10 is a simplification of the convolution operation. In practice, two de-
Figure 11: Zero-padded input
sign parameters are used for convolution in CNNs: stride and padding [17]. When the
kernel is convolved with the input, the kernel has its centre at the current input value.
In order to make the operation feasible at the borders, with varying input sizes, zeroes
are added to the input (Figure 11). The stride-parameter determines how many input
values the kernel centre moves at each step of the computation in each direction (width
and height) (Figure 12). The dimensions of the output are given by the following [51]:
output width =W − Fw + 2P
Sw+ 1, (42)
and
output height =W − Fh + 2P
Sh+ 1, (43)
25
where W is the width and height of the input, Fw and Fh the width and height of the
kernel respectively and P the amount of zero-padding used.
Figure 12: Height and width stride of 2 with 3x3 kernel and zero-padding
CNNs contain three desirable properties – sparse interactions (or sparse connectivity
or sparse weights), parameter sharing and equivariant representations [17], the last
of which we already hinted at earlier. In a traditional neural network, assuming that
no dropout, where some of the connections are dropped is utilised, each hidden node
is connected to every node in the previous and next layer. In a CNN, because the
convolution kernel used at each network layer is smaller in size than the input, the
parameter count is greatly reduced. Parameter sharing is achieved, similarly, due to
the use of the kernel. Generally speaking, parameter sharing implies the use of the
same parameter for multiple functions within the model. The intuition in the case of
the CNN is that the kernel of each network level ties multiple positions of each input to
its values, while in a MLP-network, each position in the input is only tied to a specific
network node. Finally, we have the equivariance property. Function f is said to be
equivariant with respect to the function g if the following holds:
f(g(x)) = g(f(x)). (44)
The convolutional layer of a CNN has equivariance with respect to translation, where,
26
as the input data is shifted wholly into a certain direction, the result of the convolution
operation changes in a predictable manner.
In addition to the convolutional layers, CNNs normally contain another distinct compo-
nent – the pooling function or layer. A pooling layer can be thought of as performing
downsampling on the input, approximately representing it with a smaller number of
values. A common type of a pooling layer is the max-pool layer, which has some
similarity to our convolution operation. Unlike in the normal convolution operation,
where essentially a dot-product is computed between the kernel and a section of the
input, in max-pooling, a local maximum is simply taken around a section of the input
determined by the kernel and the stride-parameter (Figure 13). Other pooling functions
have been studied, such as taking the average, the L2-norm, or average distance from
the kernel centre position [17]. While an architectural description of a CNN may sepa-
Figure 13: Max-pool with 2x2 kernel and stride 1 (Adapted from [51])
rate these into distinct network layers, a convolutional layer of a CNN can be thought of
as consisting of three distinguishable stages [17], each of which are illustrated in Fig-
ure 14 below. The final layers of a CNN are typically so-called fully connected (FC)
layers, which are exactly like the traditional neural network layers discussed in Section
2.6.1. A complete CNN architecture is illustrated in Figure 14 below. Both CNNs and
MLPs (both types are often mixed in modern networks) are commonly optimised with
some variant of SGD discussed earlier.
In the recent years, a plethora of different higher level frameworks for developing and
training such models have seen rise, including Tensorflow 1, Keras 2, PyTorch 3, Caffe
1tensorflow.org2keras.io3pytorch.org
27
Figure 14: The basic building blocks of a convolutional neural network. The convo-lution layers compute local activations of the input, while the pooling layers computesummarisation of the input at various stages. The max-pooling layer displayed herereduces the size of the channel outputs, but retains the number of channels.
4, and Microsoft Cognitive Toolkit 5. The recent explosion of neural network research
can be largely attributed to the rise of practical and affordable parallel computation of
backpropagation and SGD via the use of graphics processing units (GPUs), the most
popular lower level framework being NVIDIA’s CUDA platform 6. Moreover, certain
higher level frameworks such as the historically more research-oriented PyTorch, make
the development and deployment of various kinds of networks straightforward with
standardised implementations for many layer types, loss functions, and optimisation
methods.
4caffe.berkeleyvision.org5microsoft.com/en-us/cognitive-toolkit6developer.nvidia.com/cuda-zone
28
3 Speaker recognition and verification
Speaker recognition refers to the study of a number of tasks that utilise speech data with
the aim of tying the speech data to individuals in some manner [52]. These involve
speaker identification, speaker verification, speaker or event classification, speaker
segmentation, speaker tracking and speaker detection. Automatic speaker verification
(ASV), as the focus of this work, refers to a process where the identity of an user of a
system is verified using speech data, while identification refers to the process of iden-
tifying a person from an audio stream, possibly containing speech from multiple per-
sons. Speaker/event classification includes problems such as speaker age classification
or, for example in the case of event classification, classifying an audio event to relate
to music (e.g., singing) or a car (not speech). Speaker segmentation involves the prob-
lem of separating different speakers within an audio sample. Finally, we have speaker
detection and speaker tracking. The former involves detecting the presence of a par-
ticular speaker from an audio source, while the latter involves tracking a person across
multiple audio sources. A distinction should be made with automatic speech recogni-
tion (ASR), which is a broader and older problem, and refers to linguistic analysis of
speech, where the task is to predict the message that was spoken. Machine language
translation falls into the purview of speech recognition. Voice recognition has been
historically used as a synonym for both speech recognition and speaker recognition. In
this thesis, unless otherwise mentioned, we focus on ASV.
ASV can be further split into two different problems depending on what kind of in-
formation is used in the recognition task. In text-dependent speaker recognition addi-
tional linguistic information is used in conjunction with speech data, whereas in text-
independent speaker recognition, speech data may be to some degree used indepen-
dently of any linguistic information. [52]
3.1 Biometrics
We have seen that speaker recognition by itself is a wide area of study. This subsection
serves for the purpose of contextualising the problem of interest of this work, namely,
the detection of a spoofed audio sample within a speaker verification system. It is one
of many different forms of biometric authentication.
29
Biometrics is the science of identifying or verifying persons based on physiological
or behavioural characteristics, while the word biometric refers to a specific mode of
biometrics (e.g., fingerprint is a biometric.) [53]. The most well-known biometric is
the fingerprint, which has until recently been somewhat of a synonym for biometrics
general. Examples of behavioural characteristics include handwritten signatures and
indeed certain properties of voice (we will later discuss how human voice contains
actually both physiological and behaviour traits.). The key difference of these two
types of characteristics is the fact that physiological characteristics are, at least to some
degree, directly measurable and less dependent of human mental functioning, while
the latter are more complex and a function of a person’s life experience, development
over time and situational setting. Consider how languages and ways of speaking are
learned, for example. Certain physiological characteristics may potentially contain
substantially more information for the purpose of biometrics in terms of a single mea-
surement, such as a fingerprint. In contrast, a behavioural characteristic, such as voice,
may contain less information in terms of a single measurement, implying the need for
measurements over a length of time (this will be a common theme in the following
sections.).
Desiderata of a particular biometric can be evaluated by looking at the following crite-
ria [54, 53, 55]:
• Universality – The biometric characteristic should be measurable in the case of
every individual.
• Uniqueness – Persons should be uniquely identifiable by the characteristic.
• Permanence – The characteristic should stay similar even as time passes.
• Collectability – There should exist some way to measure the characteristic.
• Acceptability – Measuring and processing the biometric should be acceptable for
humans, i.e., not humiliating, inconvenient or dangerous.
• Performance – The biometric should provide accuracy and consistence in terms
of the measurement.
While the overall quality of a biometric is some kind of a function of all of the above
criteria, it would be impossible to cover all perfectly. Consider, for example, that cer-
tain obvious trade-offs exist between the criteria. An easily collectable biometric, such
30
as a short speech sample, might have limited usefulness in terms of performance [53].
Should be noted that the problem of fooling or evading a system utilising biometrics
is not contained in our desiderata, but the inclusion of such ideas could be contested
as these criteria consider the quality of the biometric itself, independent of any system
implementation considerations.
3.1.1 Authentication and biometrics
In the domain of information security, the three modes of authentication are based on
possession, knowledge and biometrics [56, 53, 55]. Possessions are are any physical
or abstract (i.e., electronic) belongings useful for the purpose of uniquely identifying
a person, while knowledge refers to personal secrets, such as passwords. Biomet-
rics were discussed above: the physiological and behavioural characteristics of a per-
son. These can can also be described in terms of the three phrases: "What you have"
(possession), "What you know" (knowledge) and "What you are" (biometrics). One
or all three of the modes can be utilised in conjunction, depending on the use-case. In
this thesis, we focus primarily on the third mode, in the context of ASV.
While there are numerous potential application areas for biometric authentication, one
way to classify these is into physical access control, logical access control and unique-
ness confirmation [53]. The first one refers to any use-case where the purpose is to
control access to a physical location, such as a particular room within a building, while
in the case of logical access control, authentication is used to control access to a system
or a process (e.g., a course registration system or one’s bank account). The last class of
use-cases refers to situations where biometrics is used to provide an eligibility check
upon user enrollment in a system, e.g., to check against duplicate registrations.
Two methods of biometric authentication can be identified [53]: verification and iden-
tification. Verification implies that an user provides some sort of an identifier in con-
junction with the required biometrics – The task is then for the system to check that the
provided biometrics match given the user-provided identifier and the stored biometrics
for that identifier. Biometric identification on the other hand implies a search through
a database of stored biometrics, where given a match credentials are returned to the
user. There are two main categories of speaker identification: closed-set and open-set
identification [57]. In closed-set identification, the speaker model database is assumed
to contain all relevant speaker data, whereas in open-set identification no such assump-
31
tion holds, allowing for the possibility of no match in terms of the model search. The
open-set problem can be considered to be the more difficult one of the two.
3.1.2 Speaker verification
Figure 15: A High-level view of the ASV process
Let us now look at the ASV task setting in more detail. The ASV process is as follows:
a user provides the system with an unique identifier and a speech sample. The system
then builds a model (test speaker model) from the user-provided speech sample and re-
trieves a model (target speaker model) from a database of models corresponding to the
user-provided identifier. The test speaker model, along with the target speaker model,
is then introduced to a binary classifier where the output of the classifier decision de-
termines the outcome of the verification. Typical systems also involve a third model
known as a Universal Background Model (UBM) that acts as a competing model such
that likeness of the test speaker model to the UBM is evidence against the test speaker
model originating from the claimed speaker. Traditional ASV systems, such as those
resembling the GMM-UBM system [7], may use the UBM directly in the classifier, as
shown in Figure 15.
3.2 Speech processing and feature extraction
While deep learning techniques adapted for speaker recognition may abandon much
in terms of hand-crafted feature extraction, the fundamental concepts related to speech
processing and feature extraction remain an important topic of discussion.
Computer processing of speech begins with the conversion of an analog signal that
is continuous in both time and amplitude (energy) into a digital signal that is both
32
discrete-time and discrete-valued [58]. In this process known as analog-to-digital con-
version (ADC), a band-limited signal is sampled at time intervals, governed by the
sampling theorem, resulting in a signal that consists of continuous amplitude values at
discrete time-steps. The signal then goes through quantisation, in which the contin-
uous values are approximately represented with B bits, such that each sample can be
quantised into 2B distinct levels. Finally, we have the so-called front-end of the ASV
system, in which the actual feature extraction before modelling and classification is
done. For the purposes of this thesis, we will constrain our discussion on the rightmost
section (ASV front-end) of the overall signal processing chain illustrated in Figure 16.
The properties of good features for ASR were identified by Nolan in [59], essentially
Figure 16: A High level view of the processes between a recording microphone andthe final feature vector
containing a lot of the attributes we discussed in subsection 3.1. Nolan, however, al-
ready identified the problem of fooling the system with mimicry or disguise. Five
attributes of good features for ASR in forensics were identified:
• High between-speaker variability, low within-speaker variability
• Resistance to disguise or mimicry
• High likelihood of presence in samples
• Robustness in transmission
• Ease of extraction
These attributes were presented from the perspective of the forensic applications of
ASR, but as is noted in [1], they are relevant regardless of the application area. Fea-
tures in ASR and ASV can be split into auditory and acoustic features [1]. Auditory
features are those that can be identified by a human listener, as opposed to mathe-
matically defined low-level features, the latter of which are of interest in this thesis.
Furthermore, these two groups both contain features that can either be linguistic or
33
non-linguistic in nature – linguistic features deal with language and contain informa-
tion that is phonological, morphological or syntactic. These linguistic properties may
be present both in acoustic and auditory features. Examples of non-linguistic features
could be the rate and the lengths of pauses during speech. Finally, features in ASR can
be either short-term or long-term features, depending on the length of the speech seg-
ment used to process the features. Typical acoustic features, such as those presented in
the following subsections, are short-term in nature, but features may also be extracted
over a longer speech segment – these are called utterance-level features.
Many short-term acoustic features have been developed over the years and the most
popular of these can be considered the Mel-frequency cepstral coefficients (MFCCs)
and linear-predictive coding (LPC) -based features [1].
3.2.1 Mel-frequency cepstral coefficients (MFCCs)
Spectral analysis of a speech signal is at the heart of traditional speaker recognition,
and while numerous different low-level features have been developed over the years in
speech processing, we will approach this discussion from the perspective of the widely
used Mel-frequency cepstral coefficients [8, 57, 52] (MFCCs) and the newer constant-q
cepstral coefficients (CQCC) [9] features. In a typical speech processing setting, the
Figure 17: MFCC feature extraction
common operation are short-term framing and windowing of the signal. A spectral
feature vector is normally not computed from the whole speech utterance, but from a
relatively short ’snapshot’ of the signal, in the range of 20 to 30 ms local context, known
as a frame. The justification for the framing comes from the assumption that, due to
the physical constraints of the voice-producing vocal-tract of humans, the signal can
be assumed to be approximately statistically stationary during this short time-window.
The analysis frame is moved across the whole signal in a overlapping manner such that
34
significant portion (usually 50 %) of any frame overlaps with the previous and the next
frame.
The next step is windowing, in which the signal frame is multiplied point-wise by a
suitable window function. The Hamming window can be regarded as the most com-
monly utilised windowing function in speech processing [52] (Figure 45). It is defined
as
w[n] = 0.54− 0.46 cos
(2πn
N − 1
), (45)
where n is the nth input sample value and N the length of the window. A related
Figure 18: A Hamming window with N = 120
similar window function is the Hann (Hanning) window [52]:
w[n] = 0.5
[1− cos
(2πn
N − 1
)](46)
Many different windows have been developed for different use-cases [60] and these in-
clude Bartlett, Poisson, Kaiser, Dolph-Chebyshev and Gaussian windows among oth-
ers. The windowed signal frame then goes through Discrete Fourier Transform (DFT)
[61]:
DFTk ,N−1∑n=0
x[n] exp
(−j2πnk
N
), k = 0, 1, 2, ..., N − 1, (47)
where x[n] denotes the nth sample within the windowed frame, and DFTk the kth spec-
tral sample. The discrete Fourier transform is a complex valued function for analysing
35
the frequency content of a signal when we have a finite number of samples taken at
discrete, linearly-spaced, time-steps. The complex negative exponential in (47) repre-
sents the complex conjugate of a sampled complex sinusoid, which is used to analyse
the frequency information from the original signal at the kth frequency bin. We will
break down the DFT operation in more detail below.
To understand how DFT works, one has to first look into how any signal can be rep-
resented as a sum of sinusoidal components [61]. While a continuous signal can be
thought of as being a sum of infinite number of different sinusoids extending to infini-
ties, a discrete sampled signal can be represented as a finite sum of sinusoids. Formally,
a complex sinusoid is a function with the form [61]:
s(t) , Aej(wt+φ) = A cos(wt+ φ) + Aj sin(wt+ φ), (48)
where A is the peak amplitude (often just amplitude), ω = 2πf the radian frequency
(radians/sec), t the time in seconds and φ the initial phase. Additionally, the sum ωt+φ
is known as the instantaneous phase. This is illustrated in Figure 19. The magnitude
of a signal is defined as follows:
|x(t)| ,√
Re2x(t)+ Im2x(t) ≡ A. (49)
The first, and real part of the complex sinusoid, is known as the in-phase component,
while the imaginary part is known as the phase-quadrature component. The formula-
tion in (48) follows from Euler’s identity:
ejθ = cos(θ) + j sin(θ), (50)
which implies
Aejθ = A cos(θ) + jA sin(θ). (51)
Euler’s identity allows for the representation of a complex number z in its polar form:
z = rejθ = r(cos(θ) + j sin(θ))
z = re−jθ = r(cos(θ)− j sin(θ)),(52)
Where z is the complex conjugate of z. A comparison with the DFT formula in (47)
shows that the multiplication of the real-valued signal at sample n with the sampled
complex sinusoid results in a complex sinusoid with the magnitude determined by the
36
value of x[n], given (51) and (49). From the Euler’s identity, it is easy to show that
Figure 19: The components and parameters of a complex sinusoid
cos(θ) =ejθ + e−jθ
2,
sin(θ) =ejθ − e−jθ
2j.
(53)
The equations (53) suggests that sines and cosines, and hence complex sinusoids, are
composed equally of negative and positive frequencies. This has no physical interpre-
tation and can be considered an artefact of the mathematics of the Fourier transform.
The negative frequencies are, however, present in the output of the normal implemen-
tation of the DFT, thus we discuss it here. The output of the DFT, given a real-valued
input, is conjugate symmetrical, such that
DFTN−k = DFTk, (54)
which follows from (47). DFT can be used to obtain both magnitude and phase of the
analysed signal. Depending on the task, only the magnitude values from the DFT are
retained. One consideration for such judgement is whether it is necessarily to get the
original sampled signal back or not. In the case of the MFCCs, only the magnitudes of
the complex values are retained. In addition to being important for allowing analysis
of the sampled signal in terms of the present frequencies, the DFT is also important for
allowing manipulation of the underlying frequency content, as done with the mel-scale
37
Figure 20: Frames and their corresponding FFTs
38
filterbank in (56).
The original sampled signal can be recovered with the inverse operation of the DFT,
known as the inverse discrete Fourier transform (IDFT):
x[n] ,1
N
K−1∑k=0
DFTk exp
(j2πnk
N
), k = 0, 1, 2, ..., N − 1. (55)
A key property of the DFT is that the sampled signal is assumed to be periodic with a
period of N , which in the case of a complicated signal such as a speech sample results
in so-called spectral leakage. In spectral leakage, the frequency information of the
signal gets spread across multiple frequency bins due to discontinuity in the repeated
signal. One motivation for the use of the window functions described earlier is to
constrain the beginning and the end of the frame towards zero.
When the DFT is used in conjunction with a windowing function and signal framing,
we are actually performing what is known as the short-time Fourier transform (STFT).
The STFT and its visualisation are discussed in subsection 3.2.3. The STFT matrix will
be considered, in this thesis, as a "light-feature", as opposed to the more complicated
signal processing chain presented here for the MFCCs and in subsection 3.2.2 for the
CQCCs. Naive computation of the DFT computed for a signal frame of length N takes
O(N2) computations, but in practice, an optimised algorithm is used, such as the fast
Fourier transform (FFT) [62]. The FFT runs in O(N logN). We will omit the details
in this work.
After the DFT computation, a Mel-scale filterbank is used to extract information about
critically placed frequencies, using:
MF(r) =1
Ar
Ur∑k=Lr
|Vr(k)DFT (k)|2, r = 1, 2, ..., R, (56)
where Vr(k) is the weighting function and Lr and Ur are the first and last frequency-bin
indices of the DFT output, respectively. Ar is the normalising factor used for the rth
filter of the filterbank:
Ar =1
Ar
Ur∑k=Lr
|Vr(k)|2 (57)
The filterbank (Figure 21) consists or R filters, each of which measure the energy of
a particular critical frequency existing in the windowed signal frame being processed.
39
Figure 21: A mel-scale filterbank with 12 filters between 0 Hz and the Nyquist fre-quency. The figure shows the weight of each filter around its centre frequency. Eachfunction in the figure is zero outside its scaled range.
The Mel (Melody) scale is based on studies done on the human auditory system, where
the perception of frequencies are close to linear until 1 kHz, continuing logarithmically
afterwards [63, 52]. Various closed-form mel-scale conversion formulae have been
presented and a popular one was given in [64]:
m = 2595 log10
(1 +
f
700
), (58)
with inverse mapping
f = 700(10
m2595 − 1
). (59)
Finally, the MFCC features are obtained by taking the discrete cosine transform (DCT)
of the log-transformed Mel-spectrum obtained earlier:
MFCC(n) =1
R
R∑r=1
log (MF (r)) cos
[2π
R
(r +
1
2
)n
](60)
The motivation for using the DCT is two-fold. Firstly, it retains most of the energy of
the input signal in the first coefficients (compression), and secondly, the DCT values
are uncorrelated [1]. Normally, in speech processing and indeed ASR (including ASV
40
tasks), only the first 12 to 16 of the coefficients obtained in (60) are retained for use as
features [65]. There are two reason for this – firstly, we want to represent the data with
the least amount of dimensions as possible, and secondly, the DCT output contains
the most significant coefficients in the first indices. This is also related to the so-
called curse of dimensionality [66], which states that in high-dimensional datasets,
the points get distributed around the edges of a d-dimensional hypercube, where d
is the dimensionality of the data. This poses problems for machine learning set-ups
where the data dimensionality is substantial in comparison to the amount of available
data, such that as the dimensionality increases, we need exponentially more data for
the training process, which is not always possible [67]. The curse of dimensionality
is an important motivation behind dimensionality reduction, along with the increased
computational requirements and storage that come with high-dimensional data.
After the basic MFCC computation described above, additional steps may be taken,
depending on the use-case. So-called dynamic features, i.e., velocity and acceleration
coefficients [68, 69, 1] have been used in ASR and ASV by appending them to the
original MFCC vector [70]:
dt =
∑Θθ=1 θ(ct+θ − ct−θ)
2∑Θ
θ=1 θ2
, (61)
where Θ is the size of the delta window. In (61) we have the first derivative, or delta
coefficient, calculation. The second derivative, or delta-delta, is obtained by making
the calculation over the delta coefficients obtained with (61). With the delta coeffi-
cients, we get additional coefficients for each MFCC coefficient – one for the deltas
and second for the delta-deltas. Finally, centring and normalisation may be done on the
features [1]. In cepstral mean subtraction (CMN) we reduce the mean feature vector
from each feature vector before any classification, such that the mean is computed over
some sliding time-window. Other normalisation techniques include cepstral variance
normalisation (CVN) [71], time-warping [72] relative spectral processing (RASTA)
[73], and quantile-based cepstral normalisation [1].
3.2.2 Constant Q cepstral coefficients (CQCCs)
The constant-Q cepstral coefficients (CQCCs) are a recent development in features for
use in ASV [74, 9]. In the CQCC feature extraction process, the constant-Q transform
41
(CQT) is analogous to the windowed DFT (STFT) calculation described in the previous
subsection, but with a varied analysis resolution on both the frequency axis and the time
axis, such that the the resolution is higher on the time axis at higher frequencies and
higher on the frequency axis at lower frequencies. The resolution of a signal in spectral
analysis is the ratio of the sample rate and the length of the window (in samples), which
in the case of the DFT is constant for every signal analysis frame. The CQT of an input
signal x[n] is obtained by the following transform:
CQT(k, n) ,bNk/2c∑
j=n−bNk/2c
x(j)a∗k(j − n+Nk/2), (62)
where k is the frequency bin index, a∗k the complex conjugate of ak and Nk the chang-
ing window lengths. Note that the window length, 2bNk
2c+1, depends on the frequency
k, unlike in the DFT. In (62), ak(n) are basis functions of this transform and are defined
as follows:
ak(n) ,1
C
n
Nk
exp
[i
(2πn
fkfs
+ Φk
)], (63)
where C is the scaling factor (64), fk the frequency of the k th bin, fs the sample rate
and Φ the phase offset of the transform.
C =
Nk/2∑l=−bNk/2c
w
(l +Nk/2
Nk
), (64)
where Nk is as defined earlier and w(·) the chosen window function. The bin frequen-
cies are fundamentally different than those of the DFT. In the DFT, the frequency bins
are linearly placed, such that fk = fk−1 + ∆f . In the CQT they are defined as
fk = f12k−1B , (65)
where f1 = fmin is the frequency of the first bin and B specifies the number of bins
per octave. In contrast, in the DFT the placing of the frequency bins was linear. A
uniform resampling is done on the signal such that geometric placement of the bins is
preserved, but in linear space [9]. The so-called Q-factor of the CQT is tied to the bins
per octave parameter in the following manner:
Q =fk
fk+1 − fk=
1
21/B − 1. (66)
42
The lengths of the windows for different frequencies are related to the Q-factor as
Nk =fsfkQ. (67)
From (67) we see that the analysis window size shrinks as we move upward in the
frequency bins, and vice versa. The overall computation of the CQCC features is as
follows:
CQCC(n) =L∑l=1
log |CQT(l)|2 cos
[n(1− 1
2
)π
L
], n = 0, 1, ..., L− 1, (68)
where L is the number of the frequency bins in the resampled space. Figure 22 shows
in rough terms the way in which the time resolution is increased at the high frequencies
while the frequency resolution is increased at the lower frequencies.
Figure 22: Comparison of time and frequency resolution of DFT and CQT (based onthe illustration in [9])
3.2.3 The spectrogram
So far we have discussed two novel feature extraction methods used in various speech-
related tasks. Unlike in many traditional ASR and ASV set-ups, in the deep learning
approach to ASV (discussed in 3.5), the spectrogram of speech has been used as an
input feature representation. The spectrogram of speech we use in this thesis is a dB-
scale (20 ∗ log transform) intensity plot of the magnitude values of the sequence of
windowed DFT outputs calculated for each frame of the original sampled signal. This
43
Figure 23: STFT and CQT power spectrogram comparison
amounts to the STFT operation with the usual frame-size, shift and overlap parameters,
discussed in 3.2.1. Figure 24 is a spectrogram of speech generated with the speech
processing python-ibrary librosa.7 In Figure 3.2.3, we used a window size of 550
Figure 24: A spectrogram of speech
samples, which with a sample rate of 22050 Hz corresponds to approximately 25 ms.
The overlap was chosen to be half of the window size.
7http://librosa.github.io
44
3.2.4 Other features
In previous subsection we have three popular features used in various ASR tasks, but
many different features have been developed over the years. The linear frequency
cepstral coefficients (LFCCs) [8] are similar to the MFCC features, but do not contain
the mel-filtering phase of the MFCCs:
LFCC(n) =K−1∑k=0
DFTk cos
(πnk
K
), i = 1, ..., N, (69)
where n is the index of the feature vector and k the DFT coefficient index. LFCCs
have been shown to offer complementary discriminatory power in ASR tasks with
MFCCs [75]. The LFCC features are motivated by findings were the vocal tract length
within the human speech production system shows in the higher frequencies where the
mel-scale MFCCs offer less resolution. MFCCs, LFCCs, and their variants rely on
the spectral analysis of the signal, but another classic approach for speech analysis is
linear prediction, where the speech signal is modelled as [57]:
x[n] =L∑l=1
alx(n− l) +Gu(n), n = 1, ..., N, (70)
where n is the sample index, L represents the number of output coefficients and the
order of the predictor, G the gain, and u(n) the excitation signal. This is also known
as the source-filter model of speech. Finally, the aj are the features or coefficients.
The coefficients can be are estimated by the means of forward linear prediction or
backward linear prediction. In the former, the task is to predict the sample x(n) via the
previous samples, x(n − 1), x(n − 2) and so forth. In the latter, we predict the value
of the sample x(n − L) from the future values x(n), x(n − 1), ..., x(n − L + 1). The
linear prediction cepstrum coefficients [8] are obtained as follows:
LPCC(n) = LPC(n) +n−1∑k=1
k − nn
LPCC(n− k)LPC(k). (71)
3.3 Speaker modelling and classification
We have reviewed above some of the most commonly used feature extraction ap-
proaches in ASR and ASV. We now turn the discussion towards speaker modelling
45
and classification. In this subsection, we describe three key approaches: the classic
mixture-based approach (Section 3.3.1), the PLDA-i-vector approach (Section 3.3.2),
and the deep learning approach (Section 3.3.3).
Both ASV and ASV replay attack detection imply a binary classification task, as dis-
cussed in Section 2.1. From the statistical hypothesis testing point of view, these con-
tain analogous, but not equivalent set-ups. We can formally we define the statistical
hypotheses for the ASV case asH0 : The sample comes from claimed identity
H1 : The sample is from another identity.(72)
In the case of replay attack detection, the hypotheses can be set asH0 : The sample is from a genuine speaker
H1 : The sample is from an impostor.(73)
In both cases, "sample" may refer to features extracted from a single frame, or to
features computed over a some period of time (utterance).
3.3.1 Gaussian mixture model approaches to ASV
A GMM-based approach to speaker verification was originally presented in [6] and
further enhanced in[7], where, in the enrollment stage, the input user speech data is
used to adapt a target speaker model from a speaker-independent universal background
model (UBM). At the verification stage, a new test utterance is compared against both
the target model and the UBM using a likelihood ratio test. The baseline evaluation
system used in the ASVSpoof 2017 evaluation [2] was still based on similar core ideas,
which speaks in favour of the relevancy of this kind of speaker verification. We will
describe the relevant ideas in detail due to fact that a similar system is used as a baseline
system for comparison.
Reynolds et al [7] presented an automatic speaker verification system consisting of a
front-end speech processing system connected to binary log-likelihood classifier. The
front-end speech processing system used a framing of 20 ms with a shift of 10 ms.
Additionally, a speech activity detector was employed in order to discard useless non-
46
speech data. From there, MFCC-features were extracted for use in the model estima-
tion.
A key component of this system is the Universal Background Model (UBM) that is
used in the classification task. The role of the UBM is to serve as the alternative
hypothesis of this set up, that is, the verified speaker model is classified to either belong
to the claimed identity or the UBM. The background model can either be trained from
the whole train data set or from a particular sub-population. Reynolds et al noted
that training the UBM with the whole training data may lead to bias towards certain
sub-populations in the data, such as a particular gender or age group.
Figure 25: The mixture models of the canonical GMM-UBM system(Picture used: pixabay)
The GMMs for both the background model and the claimed identity are trained with
the EM algorithm described in Section 2.4, which is based on the statistical principle
of maximum likelihood estimation. An important detail of the system presented by
Reynolds et al was that the claimed user model is not trained from the ground up.
Instead, the claimed identity is adapted from the UBM with the EM algorithm.
The classification in the GMM-UBM system is done with a log-likelihood ratio test of
47
the claimed identity speaker model and the background model:
Λ(X) = log p (X|λhyp)− log p (X|λubm) . (74)
The GMM-UBM system, as a set of basic principles has remained popular in ASR and
ASV since it conception.
3.3.2 Linear statistical model approaches
Early work on so-called utterance level features was done in [76], motivated by the
need for features computed over a length of time that is longer than single speech
frame. The GMM supervector was originally presented in [77], where the compo-
nent means of a trained speaker GMM represent the features computed for variable
length utterances. The supervector model was adapted for speaker recognition in [78].
These high-dimensional vectors have been since used as a key component in con-
junction with factor analysis and support vector machines (SVMs) [14]. SVMs can
Figure 26: GMM supervector
be used to construct a binary classifier where linearly separable data is separated by
a hyperplane such that the distance between the samples for the two classes is the
largest. In specific, we define linearly separable data as follows. For a labelled dataset
(yi, xi), ..., (yN , xN), yi ∈ −1, 1, there exists a vector w and a bias term b such that
the following holds for all i = 1, ..., N :
wTxi + b ≥ 1 if yi = 1,
wTxi + b ≤ −1 if yi = −1.(75)
48
The optimal hyperplane is one where
wT0 x + b = 0. (76)
Support vectors are the vectors for which yi(wTxi + b) = 1. Another way to write the
optimal hyperplane is as
w0 =N∑i=1
yiα0ixi. (77)
In the case of linearly non-separable data, something known as a kernel function can be
used. The purpose of the kernel function is to serve as a mapping function between the
original input space, and a new higher dimensional space, such that the data is at least
approximately linearly separable in the new space. A Gaussian mixture model-support
vector machine system (GMM-SVM) system for speaker verification was presented in
[79], where two GMMs were trained similarly as in the GMM-UBM approach. An
SVM classifier was then used as a classifier with the GMM supervectors. The kernel-
based classifier presented for ASV was defined as
f(x) =N∑i=1
αiyiK(x,xi) + d, (78)
where K(·, ·) is the kernel, d a learned constant, and yi the labels. In (78), the sum∑Ni=1 αiyi = 0 and αi > 0. The xi are the support vectors. The linear kernel in
[79] was based on an approximation of the Kullback–Leibler (KL) divergence of two
speech utterances:
D(ga||gb) =
∫Rn
ga(x) log
(ga(x)
gb(x)
)dx, (79)
where ga and gb are MAP-adapted GMMs from the two speech utterances. The follow-
ing approximation was presented:
d(ma,mb) =1
2
N∑i=1
λi(mai −mb
i)TΣ−1
i (mai −mb
i), (80)
which was then used to construct the following kernel function:
K(utta, uttb) = ΣNi=1λi(m
ai )TΣ−1
i mbi . (81)
49
Factor analysis (FA) modelling for ASV utilising the GMM supervectors was origi-
nally presented in [80] and the development of FA modelling for ASR and ASV tasks
has led to the so-called i-vector approach. In this kind of modelling, the GMM super-
vector that is tied to an individual speaker can be written in the form [1]
ms,h = mo + mspk + mchn + mres, (82)
where m0 is the speaker, channel and environment independent bias term, mspk the
speaker-dependent component, mchn the channel-dependent component, and mres the
residual. The component m0 comes directly from the trained UBM, while mspk, mchn,
and mres are considered random vectors in the statistical sense, and contain the vari-
ance due to speaker and environment variation. The variation in environments comes
from the different possible recording situations – microphones and properties of the
immediate recording environment, such as the amount of echo, and so on.
The eigenvoice adaptation was an early FA-based model for ASR, where the speaker
GMM supervector is modelled as follows:
ms = mo + Vys, (83)
where m0 is the UBM and where Vys = mspk. The ys are the latent factors of the
model.
In the identity vector (i-vector) approach [10], the model is
ms,h = m0 + Tws,h,w ∼ N (0, 1), (84)
where ws,h are the latent factors, also known as the i-vectors. The estimated vector of
latent variables is used as features for a classifier. The model in (84) is known as a total
variability model (TVM), as it models the variance from speakers and channels at the
same time. The parameter matrix T, also known informally as an i-vector extractor,
can be trained with an EM-like process described in [81] for the eigenvoice matrix.
Given a speech utterance, w is computed as follows:
E[w|u] = w = (I + TtΣ−1N(u)T)−1 ·TtΣ−1F(u), (85)
where N(u) contains the Baum-Welch statistics for the UBMs’ Gaussian components
50
given the utterance u:
Nc =L∑i=1
P(c|yt,Ω), (86)
where c denotes the Gaussian component, L the number of feature frames in the utter-
ance, yt the features and Ω the UBM. In (85), F(u) contains the centralised first-order
Baum-Welch statistics for the utterance u, given the UBM Ω:
F(u) =L∑i=1
P(c|yt,Ω)(yt −mc). (87)
This approach can be considered as a dimensionality reduction for the GMM super-
vectors described earlier.
Figure 27: i-vector extraction
3.3.3 Deep learning approaches to ASV
Deep neural architecture-based approaches have shown success in speech recognition,
but using such models directly for ASR or ASV tasks is relatively new [1]. According
to [1], I-vectors have been used in conjunction with a DNN for ASR [82]. DNNs have
also been used to build a phonetically-aware model for ASR [83], as well as to perform
feature extraction [84].
51
3.4 Vulnerabilities and countermeasures
An ASV system as part of an authentication system, or as an independent biometric
authentication system, is vulnerable to a number of possible attack types in various
subsystems. These various attack points are illustrated in Figure 28. The ISO/IEC
Figure 28: ASV vulnerabilities (adapted for ASV from [85])
30107-1:2016 standard offers the following definition for the presentation attack [85]:
”Presentation to the biometric data capture subsystem with the goal of
interfering with the operation of the biometric system.”
The standard notes the existence of multiple different ways to perform this type of
attack. Our focus is the replay attack, which is achieved by introducing a pre-recorded
sample of a legitimate user to the biometric system. The threat of replay attacks has
been identified in a number of independent studies [2, 3, 4, 5]. The easy and cost-
effective nature makes this type of an attack particularly interesting and it has been
shown that, without any spoofing detection in use, the performance of all existing ASV
systems can be degraded.
3.5 Deep learning in replay attack detection
Deep neural architectures are a less studied area of ASV anti-spoofing. Recent work
in utilising DNNs for anti-spoofing [11] show promise, but it remains inconclusive
52
Figure 29: An illustration of the triviality of a replay attack. The phone on the leftplays a recorded utterance of the target speaker, while the phone on the right can beused to call the attacked system. Alternatively, the phone itself may be the attackedsystem, if voice authentication is in use.
whether a feature learning approach presented in [12] allows for better discrimination.
We will now present both approaches here for comparison.
A light-feature DNN approach to anti-spoofing was presented in [11], where direct
classification with a reduced, or light, CNN utilising the STFT spectrogram as input,
achieved performance surpassing the state-of-the-art i-vector system. CQT spectro-
gram showed comparable performance and a combination of a CNN and a recurrent
neural network was also investigated with STFT spectrograms, although with reduced
performance. The CNN used here is not the standard we presented in Section 2.6.3,
where we showed the standard max-pooling operation. Here, in addition to standard
max-pooling, a max-feature-map (MFM) [13] pooling function is utilised, defined as,
ykij = max(xkij, xk+N
2ij ), i = 1, ..., H, j = 1, ...,W, k = 1, ..., N/2. (88)
In (88), k denotes the channel in the convolution8. In the MFM, the N channels of
convolution outputs are split in to two groups, for which element-wise maximums are
taken. In the conventional max-pooling operation, the size of the channels are reduced
as a result, but in the MFM we reduce the number of channels in the previous layer
to a half. The motivation for this addition is such that the MFM is meant to intro-8A channel in a CNN corresponds to the output of the convolution between the layer input and a
particular kernel used at that layer.
53
Figure 30: The max-feature-map layer.
duce feature selection into the network due to the selection between the convolution
channels. A peculiarity of the LCNN is the absence of classical non-linear layers -
the MFM operation replaces these completely. The operation also reduces the number
of optimised parameters, hence the use of the light CNN (LCNN) nomenclature for
this type of network. In the proposed architectures, the MFM operation is used always
after a convolution, and before max-pooling. The proposed architecture in [11] for the
pure LCNN approach used kernels of varying sizes, but all with stride of 1 and max-
pooling with 2× 2 kernels with stride of 2. Recall that stride is the number of indices
the convolution kernel is moved (here 1 for both horizontal and vertical movement.).
Figure 31 summarises the architecture presented in [11]. A problem with the front-end
processing in this type of system comes from the constant input dimension limitation
of the CNN. The most successful approach for spectrogram input has been to set the
CNN input layer to some size, e.g., according to the average sample length, and then
repeating shorter and truncating longer signals to match the desired dimensions [11].
Another limitation of this network is related to the classification process itself. In [11],
it is noted that due to the overfitting capability of the network on the relatively small
ASVspoof 2017 dataset [86], a separate classifier via GMMs was proposed.
54
Figure 31: The LCNN architecture from [11]. Conv is short for convolution. MFMstands for max-feature-map, FC for fully-connected. Conv (5 x 5 / 1 x 1) implies aconvolutional layer with filter height and width of 5 and stride of 1 in both directions.The output dimensions of the various layers are shown on the right.
55
Fundamentally different approach for utilising DNNs in replay attack detection was
presented in [12], where CQCCs and high frequency cepstral coefficients (HFCCs)
were used in conjunction as input to a traditional CNN. HFCCs are obtained from
speech as follows. First, the signal is filtered with a high-pass filter, with the cut-off
frequency being at the upper limit of typical human speech (in [12], 3.5 KHz was
used.). After filtering, conventional MFCC processing follows, with the exception of
Mel-filtering, which is left out, because all the higher frequencies are of interest. The
HFCCs follow the reasoning that artefacts from recording and playback appear in the
higher frequencies, outside the typical speech frequency range [12]. 30 coefficients
were retained from the final DCT output, for which delta and delta-delta coefficients
were calculated. For this system, two classification strategies can be used: classifica-
tion via direct binary classification with the final layer of the network, or via a separate
classifier, such that the final layer of the network acts as a multi-class classifier with
the classes corresponding to the different recording environments within the data. In
the latter set-up, the multi-class information is fed into a SVM-classifier for the final
decision.
3.6 System measurement and evaluation
While different performance measures have been used in biometrics over the years,
the standard for measuring ASV and ASV-antispoofing performance is the equal error
rate (EER) [1]. The two types of errors within a biometric system are false accept
(FA) and false reject (FR). From the perspective of replay detection, a false accept cor-
responds to a situation where an replay sample successfully fools the detector, whereas
a false rejects happens when a legitimate sample is classified as a replay sample. Two
measures are related to these two errors:
Spoofing false acceptance rate (SFAR) =FA errors
replay attempts,
Spoofing false rejection rate (SFRR) =FR errors
legitimate samples.
(89)
In a binary classification set-up, such as in ASV, the system makes the distinction
between a replay sample and a legitimate sample ultimately via the use of an arbitrary
threshold value, such as the log-likelihood threshold described in Section 3.3.1. Such
thresholds have to be based on empirical tuning of the system as well as on desired
56
characteristics of the system. When higher security is desired, the threshold should
be adjusted such that the FAR-score is reduced at the cost of a higher FRR-score,
and vice versa for the opposite case. The measure that describes the behaviour of
the classifier when the decision threshold is adjusted is the EER, which is defined as
the configuration of the system such that the FAR- and FRR-scores are equal. The
behaviour of the system as the threshold is adjusted is often visualised by what is
known as the detection error trade-off (DET) -curve [87].
In practice, the EER can be estimated with the ROCCH-EER algorithm, described in
[88].
57
4 Experimental set-up and results
For this thesis, initial experiments consisted of a comparison between the baseline
GMM system with CQCC features and a reduced and modified variant of the LCNN
system with spectrogram features. The baseline system consisted of a GMM-UBM
system with 90-dimensional CQCC features, which included static coefficients as well
as delta and delta-delta coefficients. The bins per octave parameter was set at 96. The
number of Gaussian components for the spoofed and genuine models was 512. The
first tested LCNN system, adapted from the spectrogram LCNN system in [11], used
129 × 236 dimensional, second order Butterworth high-pass filtered spectrograms as
network input. The first dimension corresponds to the number of frequency bins and
the second to the number of frames. Unified input shape for the network was achieved
by processing 2 seconds of audio from each file either by repeating or truncating the
signal. The best-performing network utilised a total of 9 convolutional layers, 4 of
which had filter size of 1× 1. These 1× 1 layers are effectively so-called network-in-
network layers [89, 11]. Two fully-connected layers were used at the output-end of the
network: one with 64 units, for which a dropout of 0.7 was used, and one with 2 units
serving as the final softmax classifier layer of the network. The original network had
significantly higher parameter-count at over 370K parameters. Consequently, the num-
ber of parameters dropped due to the smaller CNN input. The number of parameters
was further reduced by reducing the number channels in the convolutional layers. In
terms of training parameters, learning rate of 10−4 and Xavier initialisation was used.
The architecture is detailed in Figure 32.
4.1 The ASVspoof 2017 v2 data
The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge
(ASVspoof 2017) database v2 [90] is the improved version of the original 2017
ASVspoof dataset [91], which was developed for the purpose of development and test-
ing of replay detection in ASV. The dataset contains speech from 42 people from 179
sessions and 61 different configurations. The dataset is provided with a predetermined
split into three standard portions: train, development and evaluation, although the split
between training samples and development samples can be adjusted. The speech sam-
ples in the train set are used for model training and this set contains 3014 samples with
58
Figure 32: The reduced LCNN architecture based on the architecture summarised in31.
59
equal split into labelled spoofed and genuine samples. The development set is used to
develop and tune the replay detection system and contains 1710 samples, with labelled
760 genuine and 950 spoofed samples. The evaluation set contains 13306 samples with
12008 spoofed and 1298 genuine samples.
4.2 The results and discussion
The results for the development data obtained with the architecture described earlier are
displayed in Table 4.2. The baseline results here are displayed without any additional
optimisations of the CQCC features.
Table 1: Results comparisonSystem EER % (dev) EER % (eval)CQCC-GMM Baseline 11,28 24.77 [92]LCNNFFT 3.95 [11] 6.73 [11]GD-ResNet-18 + Attention 0.0 [92] 0.0 [92]LCNN reduced 7,33 -LCNN full (STFT) 26.69 41.71LCNN full (GD-gram) 42.89 47.19
The network reliably achieved over 90 percent accuracy on the development set after
45 epochs with batch-size of 64 (the original relatively low learning rate of 10−4 was
kept from [11]). However, for the much larger evaluation set the results were inconclu-
sive and no reliable discrimination between replay and genuine samples was achieved.
The DET-curves of the tested systems are in Figure 33. The discrepancy of the results
between the two portions of the ASVspoof 2017 v2 dataset may be explained by the rel-
ative size as well as the nature of the evaluation set, which consists of completely new
attacks. For these experiments, an additional hypothesis was included setting, where
it was assumed that replay-related noise would be prevalent in the higher frequencies.
This led to the adoption of the high-pass filter described earlier. The idea originates
from the work in [12] with the so-called high frequency cepstral coefficients (HFCCs).
To see whether the lopsided spectrograms (zeroes due to filtering) had an effect, we also
tested the same model with larger spectrograms cut at the same 3.5 Khz point, retaining
only the frequency bin information upwards towards the Nyquist-frequency. However,
no marked improvement was achieved by this change with respect to the discrepancy
between the development and evaluation results. In another attempt to improve the re-
60
Figure 33: DET-curves of the tested systems
sults on the larger dataset, we considered again a much larger spectrogram as the input.
Experiments were done with non-filtered 512× 400 sized spectrograms, where 512 is
the number of frequency bins and 400 the number of frames. With this input size, the
original number of filters and the filter sizes were taken from [11], but working with
this larger model proved difficult due to the space and time requirements. This larger
model was tested towards 100 epochs, but no reasonable results for the evaluation set
were obtained.
The original LCNN architecture described earlier was also tested with two types of
inputs: the STFT spectrogram as well as the group delay gram [92] extracted from the
STFT spectrogram. Recent work in [92] on deep learning approaches to replay attack
detection suggested that the group delay information computed from STFT offered
better discriminatory information for ASV anti-spoofing. In [92], the popular image
classification network, ResNet [93], was adopted for replay attack detection by utilising
the group delay gram in place of the STFT spectrogram. The results are in table 4.2.
Group delay is the negative derivative of the phase-portion of STFT:
τ(ω, t) = −d(θ(ω, t))
dω. (90)
Group delay may also be computed via:
61
τ(ω, t) =XR(ω, t)YR(ω, t) + YI(ω, t)XI(ω, t)
|X(ω, t)|2, (91)
where X(ω, t) is the STFT of x[n] and Y (ω, t) the STFT of nx[x]. The subscripts R
and I denote the real and imaginary parts of STFT. The group delay gram as a repre-
sentation of speech was identified already in [94]. Comparison of STFT and GD-gram
is in figure 34. Both STFT spectrograms and GD-grams were investigated with the
original LCNN architecture. The EER results for the large LCNN systems were ob-
Figure 34: STFT spectrogram and group delay
tained as follows. First, the network was trained on the ASVspoof v2 train data only,
after which the trained network was used to extract features drawn from the fully con-
nected layer before the final softmax layer. This was suggested in [11], but the details
of the procedure were omitted. For the feature extraction, 200 passes of the datasets
were performed during which the spectrogram/GD-gram -based feature is extracted
from a random 2 second chunk of each audio file. This results in datasets extracted
from the train, development and evaluation portions of the ASVspoof v2 that are heav-
ily overlapped due to the overall short lengths of the files. The extracted sets are then
used to evaluate the EER of the development and evaluation sets via GMMs. While
features extracted with the STFT version of the large network offered some discrim-
ination capability on the development set, the performance on the evaluation set re-
mained little better than a random guess. Both large networks utilising either STFT
features or GD-grams suffered heavily from overfitting on the training data. The fol-
62
lowing regularisation techniques were tested with both features: batch normalisation
for both convolutional and fully-connected layers, spatial dropout as well as normal
1D dropout, and random cropping. Random cropping and resizing of images are both
popular regularisation and data augmentation techniques used in computer vision and
image classification tasks. The employed regularisation techniques were found insuf-
ficient in preventing wide-scale memorisation of the network and served only to delay
the eventual memorisation.
Our findings possibly point towards two interesting avenues in terms of the replay at-
tack detection task: data-augmentation and the use of pre-trained models. Given the
current data-restricted situation, the results in [92] are interesting because they point to
the usefulness of pre-trained image classification networks for replay detection. The
other avenue is open to exploration: generative models such as wavenet [95] and gen-
erative adversarial networks, such as SEGAN [96], might allow for new ways of aug-
menting the spoofing detection corpora that exist today.
63
5 Conclusion
Automatic speaker verification is a less-adopted, yet an interesting option for biometric
authentication. The threat of both synthetic spoofing attacks as well as replay attacks
has been recognised in recent years. The cost-effective nature of the replay attacks
makes the detection of such attacks an important problem to be solved. In this work,
we looked at some relatively recent deep learning approaches for replay detection and
attempted to reproduce some of the results.
In section 2 we discussed the machine learning and showed how Gaussian mixture
models could be trained with the expectation-maximisation algorithm. In section 3, we
offered an overview of automatic speaker recognition and verification, and offered a
background on biometrics. Finally, in Section 4 we described the ASVspoof 2017 v2
data as well as our experiments.
In Section 4, two potentially interesting avenues to pursue in replay attack detection
were identified, both of which would largely try to solve the same issue (of limited
data): data-augmentation and the use of pre-trained models. The use of state-of-the-art
generative models for speech might allow for new types of approaches to be pursued
for replay detection, due to larger datasets. Pre-trained models, on the other hand,
potentially allow for good classification performance on restricted datasets. We would
consider both of these two avenues to be interesting to pursue in the future.
Looking at the wider context of biometrics, where ASV and ASV replay detection
both sit, and considering how large digital footprint each individuals leave when they
interact with different kinds of systems, an interesting topic is emerging: biometric de-
identification [97]. For the field of speaker recognition, this topic offers both new chal-
lenges and opportunities, as future speaker recognition system may not only account
for possible spoofing and replay attempts, but also account for highly sophisticated
attempts to evade recognition.
64
References
[1] J. H. L. Hansen and T. Hasan, “Speaker Recognition by Machines and Humans:
A tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99,
Nov. 2015.
[2] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi,
and K. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoof-
ing attack detection,” vol. 2017-August, 2017, pp. 2–6.
[3] F. Alegre, A. Janicki, and N. Evans, “Re-assessing the threat of replay spoofing
attacks against automatic speaker verification,” in 2014 International Conference
of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, Sep.
2014, pp. 1–6.
[4] Z. Wu, S. Gao, E. S. Cling, and H. Li, “A study on replay attack and anti-spoofing
for text-dependent speaker verification,” in Signal and Information Processing
Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Dec.
2014, pp. 1–5.
[5] J. Gałka, M. Grzywacz, and R. Samborski, “Playback attack detection
for text-dependent speaker verification over telephone channels,” Speech
Communication, vol. 67, pp. 143–153, Mar. 2015. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167639314000880
[6] D. A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91–108,
Aug. 1995. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/016763939500009D
[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification
Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10,
no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/S1051200499903615
[8] S. Davis and P. Mermelstein, “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences,” IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366,
Aug. 1980.
65
[9] M. Todisco, H. Delgado, and N. Evans, “Constant Q cepstral coefficients:
A spoofing countermeasure for automatic speaker verification,” Computer
Speech & Language, vol. 45, pp. 516–535, Sep. 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0885230816303114
[10] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Fac-
tor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.
[11] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and
V. Shchemelinin, “Audio replay attack detection with deep learning frameworks,”
vol. 2017-August, 2017, pp. 82–86.
[12] P. Nagarsheth, E. Khoury, K. Patil, and M. Garland, “Replay attack detection
using DNN for channel discrimination,” vol. 2017-August, 2017, pp. 97–101.
[13] X. Wu, R. He, Z. Sun, and T. Tan, “A Light CNN for Deep Face Representation
with Noisy Labels,” arXiv:1511.02683 [cs], Nov. 2015, arXiv: 1511.02683.
[Online]. Available: http://arxiv.org/abs/1511.02683
[14] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning,
vol. 20, no. 3, pp. 273–297, Sep. 1995. [Online]. Available: https:
//link.springer.com/article/10.1023/A:1022627411411
[15] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,
USA: MIT Press, 2014. [Online]. Available: http://ebookcentral.proquest.com/
lib/uef-ebooks/detail.action?docID=3339490
[16] B. M, Christopher, Pattern recognition and machine learning. USA: Springer,
2006.
[17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, Mas-
sachusetts: The MIT Press, Nov. 2016.
[18] V. Vapnik, “Principles of Risk Minimization for Learning Theory,”
in Advances in Neural Information Processing Systems 4, J. E.
Moody, S. J. Hanson, and R. P. Lippmann, Eds. Morgan-Kaufmann,
1992, pp. 831–838. [Online]. Available: http://papers.nips.cc/paper/
506-principles-of-risk-minimization-for-learning-theory.pdf
66
[19] V. N. Vapnik, Statistical learning theory. Wiley, 1998.
[20] D. J. . K. Bartholomew, Latent Variable Models and Factor Analysis : A Unified
Approach. London, UK: Wiley, 2011.
[21] I. Koch, Analysis of multivariate and high-dimensional data, ser. Cambridge se-
ries in statistical and probabilistic mathematics. Cambridge University Press,
2014.
[22] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from
Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society.
Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [Online]. Available:
http://www.jstor.org/stable/2984875
[23] Fisher R. A. and Russell Edward John, “On the mathematical foundations
of theoretical statistics,” Philosophical Transactions of the Royal Society
of London. Series A, Containing Papers of a Mathematical or Physical
Character, vol. 222, no. 594-604, pp. 309–368, Jan. 1922. [Online]. Available:
https://royalsocietypublishing.org/doi/10.1098/rsta.1922.0009
[24] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Informa-
tion Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982.
[25] Steinhaus, “Sur la division des corps mat eriels en parties,” Bulletin de l’Académie
Polonaise des Sciences, vol. IV, no. 12, pp. 801–804, 1956.
[26] J. MacQueen, “Some methods for classification and analysis of multivariate
observations.” The Regents of the University of California, 1967. [Online].
Available: https://projecteuclid.org/euclid.bsmsp/1200512992
[27] C. M. Bishop, Neural networks for pattern recognition. Clarendon, 1995.
[28] B. Widrow and M. E. Hoff, “Neurocomputing: Foundations of Research,” J. A.
Anderson and E. Rosenfeld, Eds. Cambridge, MA, USA: MIT Press, 1988, pp.
123–134. [Online]. Available: http://dl.acm.org/citation.cfm?id=65669.104390
[29] C. V. D. Malsburg, “Frank Rosenblatt: Principles of Neurodynamics: Perceptrons
and the Theory of Brain Mechanisms.” Springer, Berlin, Heidelberg, 1986,
pp. 245–248. [Online]. Available: https://link.springer.com/chapter/10.1007/
978-3-642-70911-1_20
67
[30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by
back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986.
[Online]. Available: https://www.nature.com/articles/323533a0
[31] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec.
1989. [Online]. Available: https://doi.org/10.1007/BF02551274
[32] K. Hornik, “Approximation capabilities of multilayer feedforward networks.”
Neural Networks, vol. 4, no. 2, pp. 251–257, 1991. [Online]. Available:
http://search.proquest.com/docview/25365047
[33] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen,” Ph.D. dis-
sertation, Institut fur Informatik , Technische Universität Munche, Munchen, Ger-
many, 1991.
[34] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradient Flow in Re-
current Nets: the Difficulty of Learning Long-Term Dependencies, 2001.
[35] V. Nair and G. Hinton, “Rectified linear units improve Restricted Boltzmann ma-
chines,” 2010, pp. 807–814.
[36] P. Werbos and P. J. (Paul John), “Beyond regression : new tools for prediction
and analysis in the behavioral sciences,” Jan. 1974.
[37] D. B. Parker, “Learning-logic.” Tech. Rep., 1985.
[38] D. R. Wilson and T. R. Martinez, “The general inefficiency of batch training for
gradient descent learning,” Neural Networks, vol. 16, no. 10, pp. 1429–1451,
Dec. 2003. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0893608003001382
[39] S. Ruder, “An overview of gradient descent optimization algorithms,”
arXiv:1609.04747 [cs], Sep. 2016, arXiv: 1609.04747. [Online]. Available:
http://arxiv.org/abs/1609.04747
[40] B. T. Polyak, “Some methods of speeding up the convergence of iteration
methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4,
no. 5, pp. 1–17, Jan. 1964. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/0041555364901375
68
[41] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the Importance of
Initialization and Momentum in Deep Learning,” in Proceedings of the 30th
International Conference on International Conference on Machine Learning
- Volume 28, ser. ICML’13. Atlanta, GA, USA: JMLR.org, 2013, pp. III–
1139–III–1147. [Online]. Available: http://dl.acm.org/citation.cfm?id=3042817.
3043064
[42] G. E. Hinton, N. Srivastava, and S. Kevin, “Lecture 6a overview of
mini–batch gradi-ent descent,” Toronto, Canada, 2012. [Online]. Available:
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
[43] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
arXiv:1412.6980 [cs], Dec. 2014, arXiv: 1412.6980. [Online]. Available:
http://arxiv.org/abs/1412.6980
[44] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for
Online Learning and Stochastic Optimization,” Journal of Machine Learning
Research, vol. 12, no. Jul, pp. 2121–2159, 2011. [Online]. Available:
http://jmlr.org/papers/v12/duchi11a.html
[45] M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,”
arXiv:1212.5701 [cs], Dec. 2012, arXiv: 1212.5701. [Online]. Available:
http://arxiv.org/abs/1212.5701
[46] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in
Advances in Neural Information Processing Systems 27, Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.
Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http:
//papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
[47] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Marginal
Value of Adaptive Gradient Methods in Machine Learning,” in Advances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates,
Inc., 2017, pp. 4148–4158. [Online]. Available: http://papers.nips.cc/paper/
7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.
69
[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 249–256.
[Online]. Available: http://proceedings.mlr.press/v9/glorot10a.html
[49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online].
Available: http://jmlr.org/papers/v15/srivastava14a.html
[50] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied
to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–
2324, Nov. 1998.
[51] A. Karpathy, “Course notes for Stanford CS231: Convolutional Neural
Networks for Visual Recognition,” accessed April 1, 2018. [Online]. Available:
http://cs231n.github.io/convolutional-networks/
[52] H. Beigi, Fundamentals of speaker recognition. New York: Springer US, 2011.
[53] R. Bolle, Ed., Guide to biometrics, ser. Springer professional computing. New
York: Springer, 2004.
[54] R. Clarke, “Human Identification in Information Systems: Management Chal-
lenges and Public Policy Issues,” Information Technology & People, vol. 7, no. 4,
pp. 6–37, Dec. 1994.
[55] S. K. Modi, Biometrics in identity management : concepts to applications, ser.
Artech House information security and privacy series. Artech House, 2011.
[56] B. Miller, “Vital signs of identity [biometrics],” IEEE Spectrum, vol. 31, no. 2,
pp. 22–30, Feb. 1994.
[57] J. Benesty, M. M. Sondhi, and Y. Huang, Eds., Springer handbook of speech
processing. Berlin; London: Springer, 2008.
[58] E. C. Ifeachor, Digital signal processing : a practical approach, 2nd ed. Prentice
Hall, 2002.
[59] F. Nolan, The phonetic bases of speaker recognition. Cambridge: Cambridge
University Press, 1983.
70
[60] J. O. Smith, Spectral audio signal processing. Center for Computer Research in
Music and Acoustics, Department of Music : W3K, 2011.
[61] ——, “Mathematics of the discrete fourier transform (DFT) with audio
applications second edition,” 2007. [Online]. Available: https://ccrma.stanford.
edu/~jos/st/
[62] J. W. Cooley, “An algorithm for the machine calculation of complex Fourier se-
ries,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965.
[63] S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement
of the Psychological Magnitude Pitch,” The Journal of the Acoustical Society
of America, vol. 8, no. 3, pp. 185–190, Jan. 1937. [Online]. Available:
https://asa-scitation-org.ezproxy.uef.fi:2443/doi/abs/10.1121/1.1915893
[64] D. O’Shaughnessy, Speech communication: human and machine. Addison-
Wesley Pub. Co., 1987.
[65] F. Camastra, Machine learning for audio, image and video analysis : theory and
applications, second edition ed., ser. Advanced information and knowledge pro-
cessing. Springer-Verlag London, 2015.
[66] R. Bellman, Dynamic Programming. Princeton University Press, 1957.
[67] S. Theodoridis, Pattern Recognition, 4th ed. Elsevier Science, 2008.
[68] S. Furui, “Comparison of speaker recognition methods using statistical features
and dynamic features,” IEEE Transactions on Acoustics, Speech, and Signal Pro-
cessing, vol. 29, no. 3, pp. 342–350, Jun. 1981.
[69] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau,
S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A.
Reynolds, “A Tutorial on Text-Independent Speaker Verification,” EURASIP
Journal on Advances in Signal Processing, vol. 2004, no. 4, p. 101962, Dec.
2004. [Online]. Available: https://asp-eurasipjournals.springeropen.com/articles/
10.1155/S1110865704310024
[70] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore,
J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book,
Jan. 2002.
71
[71] H. Boril and J. H. L. Hansen, “Unsupervised Equalization of Lombard Effect
for Speech Recognition in Noisy Adverse Environments,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1379–1393, Aug.
2010.
[72] J. W. Pelecanos and S. Sridharan, “Feature warping for robust speaker verifica-
tion,” in Odyssey, 2001.
[73] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transac-
tions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, Oct. 1994.
[74] J. Brown, “Calculation of a Constant-Q Spectral Transform,” Journal of the
Acoustical Society of America, vol. 89, no. 1, pp. 425–434, Jan. 1991,
wOS:A1991ER16200045.
[75] X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, and S. Shamma,
“Linear versus mel frequency cepstral coefficients for speaker recognition,” in
2011 IEEE Workshop on Automatic Speech Recognition Understanding, Dec.
2011, pp. 559–564.
[76] J. Markel, B. Oshika, and A. Gray, “Long-term feature averaging for speaker
recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. 25, no. 4, pp. 330–337, Aug. 1977.
[77] R. Kuhn, P. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke,
and M. Contolini, “Eigenvoices for speaker adaptation,” pp. 1771–1774, 1998.
[Online]. Available: http://www.eurecom.fr/publication/198
[78] P. Kenny, M. Mihoubi, and P. Dumouchel, “New MAP estimators for speaker
recognition.” Centre de Recherche Informatique de Montréal (CRIM), Canada:
International Speech Communication Association, 2003, pp. 2961–2964.
[79] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines
using GMM supervectors for speaker verification,” IEEE Signal Processing Let-
ters, vol. 13, no. 5, pp. 308–311, May 2006.
[80] P. Kenny and P. Dumouchel, “Disentangling speaker and channel effects in
speaker verification,” vol. 1, 2004, pp. I37–I40.
72
[81] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse
training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3,
pp. 345–354, May 2005.
[82] O. Ghahabi and J. Hernando, “i-Vector Modeling with Deep Belief Networks for
Multi-Session Speaker Recognition,” 2014.
[83] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker
recognition using a phonetically-aware deep neural network,” in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing - Proceed-
ings, May 2014, pp. 1695–1699.
[84] T. Yamada, L. Wang, and A. Kai, “Improvement of distant-talking speaker iden-
tification using bottleneck features of dnn,” 2013, pp. 3661–3664.
[85] “ISO/IEC 30107-1:2016 - Information technology – Biometric presentation
attack detection – Part 1: Framework,” Jan. 2016. [Online]. Available:
https://www.iso.org/standard/53227.html
[86] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans,
M. Todisco, and H. Delgado, “ASVspoof: The Automatic Speaker Verification
Spoofing and Countermeasures Challenge,” IEEE Journal of Selected Topics in
Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017. [Online]. Available:
http://ieeexplore.ieee.org/document/7858696/
[87] A. F. Martin, G. R. Doddington, T. Kamm, M. Ordowski, and M. A. Przy-
bocki, “The DET curve in assessment of detection task performance,” in EU-
ROSPEECH, 1997.
[88] N. Brümmer and E. de Villiers, “The BOSARIS Toolkit: Theory, Algorithms
and Code for Surviving the New DCF,” arXiv:1304.2865 [cs, stat], Apr. 2013,
arXiv: 1304.2865. [Online]. Available: http://arxiv.org/abs/1304.2865
[89] M. Lin, Q. Chen, and S. Yan, “Network In Network,” arXiv:1312.4400 [cs], Dec.
2013, arXiv: 1312.4400. [Online]. Available: http://arxiv.org/abs/1312.4400
[90] N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Delgado,
and T. Kinnunen, “The 2nd Automatic Speaker Verification Spoofing and
Countermeasures Challenge (ASVspoof 2017) Database, Version 2,” Apr. 2018.
[Online]. Available: https://datashare.is.ed.ac.uk/handle/10283/3055
73
[91] T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamäki,
D. Thomsen, A. Sarkar, Z. H. Tan, H. Delgado, M. Todisco, N. Evans, V. Hau-
tamäki, and K. A. Lee, “RedDots replayed: A new replay spoofing attack cor-
pus for text-dependent speaker verification research,” in 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017,
pp. 5395–5399.
[92] T. Francis, J. Mohit, and D. Prasenjit, “End-To-End Audio Replay Attack Detec-
tion Using Deep Convolutional Networks with Attention,” in Interspeech, Hyder-
abad, India, Sep. 2018.
[93] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition,” arXiv:1512.03385 [cs], Dec. 2015, arXiv: 1512.03385. [Online].
Available: http://arxiv.org/abs/1512.03385
[94] H. A. Murthy, R. M. Hegde, and V. R. R. Gadde, “The modified group delay
feature: a new spectral representation of speech,” in INTERSPEECH, 2004.
[95] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative
Model for Raw Audio,” arXiv:1609.03499 [cs], Sep. 2016, arXiv: 1609.03499.
[Online]. Available: http://arxiv.org/abs/1609.03499
[96] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech Enhancement
Generative Adversarial Network,” arXiv:1703.09452 [cs], Mar. 2017, arXiv:
1703.09452. [Online]. Available: http://arxiv.org/abs/1703.09452
[97] S. Ribaric, A. Ariyaeeinia, and N. Pavesic, “De-identification for privacy
protection in multimedia content: A survey,” Signal Processing: Image
Communication, vol. 47, pp. 131–151, Sep. 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0923596516300856
74