disclaimers-space.snu.ac.kr/bitstream/10371/168036/1/000000160062.pdf · 2020-05-19 · deep...
TRANSCRIPT
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
공학박사학위논문
Environmental Sound Classification
and Disentangled Factor Learning
for Speech Enhancement
음성 향상을 위한 환경음 분류 및 팩터 분리 학습
2020년 2월
서울대학교 대학원
전기 ․ 컴퓨터공학부
배 수 현
Abstract
Sounds carry a large amount of information about our everyday environment, es-
pecially human speech. However, environmental sound can also be an important fac-
tor in understanding the surrounding environment for user-customized services. The
environmental sound acts as noise to be removed to the application for extracting
speech information and is an object to be recognized to the application for extracting
environmental information. From this perspective, we propose deep learning-based
acoustic environment classification and speech enhancement techniques.
The goal of acoustic scene classification is to classify a test recording into one
of the predefined acoustic scene classes. In the last few years, deep neural networks
(DNNs) have achieved great success in various learning tasks and have also been
used for the classification of environmental sounds. While DNNs are showing their
potential in the classification task, they cannot fully utilize the temporal informa-
tion. In this thesis, we propose a neural network architecture for the purpose of
using sequential information. The long short-term memory (LSTM) layers extract
the sequential information from consecutive audio features. The convolutional neural
network (CNN) layers learn the spectro-temporal locality from spectrogram images,
and the fully connected layers summarize the outputs of two networks to take ad-
vantage of the complementary features of the LSTM and CNN by combining them.
i
By using the proposed combination structure, we achieved higher performance com-
pared to the conventional DNN, CNN, and LSTM architectures.
Overlapping acoustic event classification is the task of estimating multiple acous-
tic events in a mixed source. In the case of non-overlapping event classification, many
approaches have achieved great success using various feature extraction methods and
deep learning models. However, in most real-life situations, acoustic events are over-
lapped, and different events may share similar properties. Simultaneously detecting
mixed sources is a challenging problem. In this thesis, we propose a classification
method for overlapping acoustic events that incorporates joint training with the
source separation framework. Since overlapping acoustic events are mixed in multi-
ple sources, we train the source separation model and multi-label classification model
for estimating the type of overlapping acoustic events. The source separation model
is trained to reconstruct the target sources by minimizing the interference of over-
lapping events. Joint training can be conducted to achieve end-to-end optimization
between the acoustic event source separation and multi-label estimation.
Speech enhancement techniques aim to improve the quality and intelligibility of
a given speech degraded by certain additive noise in the background. Most of the
recently proposed deep learning-based speech enhancement techniques have focused
on designing the neural network architectures as a black box. However, it is often
beneficial to understand what kinds of hidden representations the model has learned.
Since the real-world speech data are drawn from a generative process involving
multiple entangled factors, disentangling the speech factor can encourage the trained
model to result in better performance for speech enhancement. With the recent
success in learning disentangled representation using neural networks, we explore
a framework for disentangling speech and noise, which has not been exploited in
ii
conventional speech enhancement algorithms. In this thesis, we propose a novel
noise-invariant speech enhancement method that manipulates the latent features to
distinguish between the speech and noise features in the intermediate layers using an
adversarial training scheme. Experimental results show that our model successfully
disentangles the speech and noise latent features. Consequently, the proposed model
not only achieves better enhancement performance but also offers more robust noise-
invariant property than conventional speech enhancement techniques.
Keywords: Speech enhancement, acoustic scene classification, overlapping acous-
tic event classification, source separation, disentangled factor learning, deep
learning
Student number: 2012-20781
iii
Contents
Abstract i
Contents iv
List of Figures vii
List of Tables x
1 Introduction 1
1.1 Environmental Sound Classification . . . . . . . . . . . . . . . . . . . 1
1.2 Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Disentangled Factor Learning . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Deep Learning Models for Acoustic Scene Classification 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Parallel Combination of LSTM and CNN . . . . . . . . . . . . . . . 10
2.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 LSTM Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
2.3.3 CNN layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Connected Layer of LSTM and CNN . . . . . . . . . . . . . . 14
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Dataset and Measurement . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Neural Networks Setup . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Overlapping Acoustic Event Classification Based on Joint Training
with Source Separation 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Source Separation of Overlapping Acoustic Event . . . . . . . . . . . 22
3.3 Proposed Method Using Joint Training . . . . . . . . . . . . . . . . . 24
3.3.1 Source Separation Model . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Multi-Label Classification Model . . . . . . . . . . . . . . . . 26
3.3.3 Joint Training Method . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Dataset and Data Augmentation . . . . . . . . . . . . . . . . 27
3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Evaluation of Source Separation . . . . . . . . . . . . . . . . 30
3.4.4 Acoustic Event Classification Results . . . . . . . . . . . . . . 32
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Disentangled Feature Learning for Noise-Invariant Speech Enhance-
ment 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
4.2 Masking-Based Speech Enhancement . . . . . . . . . . . . . . . . . . 39
4.3 Concept of Domain Adversarial Training . . . . . . . . . . . . . . . . 40
4.4 Disentangling Speech and Noise factors . . . . . . . . . . . . . . . . 42
4.4.1 Neural Network Architecture . . . . . . . . . . . . . . . . . . 42
4.4.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Adversarial Training for Disentangled Features . . . . . . . . 46
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.1 Dataset and Feature Extraction . . . . . . . . . . . . . . . . . 48
4.5.2 Network Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.3 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 52
4.5.5 Analysis of Noise-Invariant Speech Enhancement . . . . . . . 58
4.5.6 Disentangled Feature Representations . . . . . . . . . . . . . 59
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Conclusions 63
Bibliography 65
요 약 78
감사의 글 81
vii
List of Figures
2.1 Scheme of the proposed method. . . . . . . . . . . . . . . . . . . . . 10
2.2 Neural network structure for the proposed technique. . . . . . . . . . 12
3.1 Scheme of the proposed method. . . . . . . . . . . . . . . . . . . . . 24
3.2 Joint training structure for the proposed technique. . . . . . . . . . . 25
3.3 Comparison between separated and integrated source separation mod-
els. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 The source separation performance (SDR [dB]) . . . . . . . . . . . . 31
3.5 The source separation performance (SIR [dB]) . . . . . . . . . . . . . 31
3.6 Results of source separation in the time domain. . . . . . . . . . . . 32
3.7 Multi-task learning for overlapping acoustic event classification. . . . 34
4.1 Scheme of DNN-based speech enhancement method. . . . . . . . . . 40
4.2 The architecture of the proposed model for disentangled feature learn-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Plot of losses on training the proposed model. . . . . . . . . . . . . . 49
4.4 The architectures of the baseline models. . . . . . . . . . . . . . . . . . . 51
ix
4.5 (From top to bottom) The spectrograms of noisy speech degraded by
metro noise with −3 dB SNR, enhanced speech by the snT model,
enhanced speech by the snDT model, and the corresponding clean
speech, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Results of subjective preference test (%) comparing the speech quality
for the snT and snDT models with various SNR values. . . . . . . . 58
4.7 Variances of PESQ scores for the 14 different noise types in various
SNR environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Visualization of speech latent feature (zs) using t-SNE in the matched noise
condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.9 Visualization of speech latent feature (zs) using t-SNE in the mismatched
noise condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
x
List of Tables
2.1 Frame-based classification accuracy (%) on IEEE DCASE 2016 Chal-
lenge Task 1 Development Dataset. . . . . . . . . . . . . . . . . . . . 17
2.2 Segment-based (30s) classification accuracy (%) on IEEE DCASE
2016 Challenge Task 1 Development Dataset. Asterisk(*) CNN-LSTM
represents the accuracy on Evaluation Dataset. . . . . . . . . . . . . 18
3.1 Precision performance of overlapping acoustic event classification. . . 33
3.2 Recall performance of overlapping acoustic event classification. . . . 33
3.3 F-score of overlapping acoustic event classification. . . . . . . . . . . 33
4.1 Results of PESQ, segSNR, eSTOI, and SDR values of the proposed
and baseline networks in the matched noise type conditions, where
−6 and 9 dB cases are unseen SNR conditions. . . . . . . . . . . . . 53
4.2 Results of PESQ, segSNR, eSTOI, and SDR values of the proposed
and baseline networks in the mismatched noise type conditions, where
−6 and 9 dB cases are unseen SNR conditions. . . . . . . . . . . . . 54
xi
Chapter 1
Introduction
1.1 Environmental Sound Classification
Environmental sound classification, which attempts to classify or detect audio
signals into predetermined classes, constitutes one of the main tasks of the emerg-
ing research field named “machine hearing.” Environmental sound is classified into
two categories. An acoustic scene is a complex environmental sound from multiple
sources, and an acoustic event is a single sound from a specific source.
The goal of acoustic scene classification is to classify a sound into one of the
predefined classes that characterize the environment in which it was recorded. To
deal with the acoustic scene classification, many approaches have been proposed,
including feature representation and classification models. A variety of acoustic fea-
tures have been used to represent the acoustic scenes and events. Examples include
single- or multi-dimensional log-mel spectrogram, wavelet spectrogram, and a kind
of i-vector extraction from the traditional features like mel-frequency cepstral co-
efficients (MFCC) [1]. Moreover, many methods for combining multiple acoustic
1
features have been proposed, such as MFCC, gammatone filter, and log-mel en-
ergy [2], or even a wide range of features. Before deep learning was actively studied,
the support vector machine (SVM) was one of the most successful learning models
in a number of scene classification tasks. [3], [4]. Recently, many deep learning-based
scene classification techniques have been proposed and have shown outstanding per-
formance in classifying acoustic scenes.
Acoustic event is a segment of environmental audio that easily occur in human
life, such as coughing, phone ringing, clash sound and so on. Acoustic event clas-
sification (AEC) and detection (AED) aim to recognize the audio elements inside
an audio clip. Recognizing acoustic events in audio can be utilized in various appli-
cations, including indoor environment recognition [5], surveillance systems [6] and
automatic audio indexing [7]. Recently, as the interest in this area increases, large
datasets [8] were released and challenges such as the detection and classification of
acoustic scenes and events (DCASE) challenge have been held. Research on AED
can be separated into two main scenarios, overlapping and non-overlapping. Over-
lapping AED is a much more challenging problem due to the mixture of acoustic
sources and is considered to be more important because acoustic events often overlap
with each other in real life recordings.
1.2 Speech Enhancement
Speech enhancement techniques aim to improve the quality and intelligibility of
a given speech degraded by certain additive noise in the background. In a variety
of applications, speech enhancement is considered as an essential pre-processing
step. This technique can be directly employed to improve the quality of mobile
2
communications in noisy environments or to enhance speech signals for hearing aid
devices [9], [10] before amplification. Speech enhancement has also been widely used
as a pre-processing technique in automatic speech recognition (ASR) [11], [12] and
speaker recognition systems [13] for more robust performances.
Over the past several decades, myriads of approaches have been developed in
the speech research community for better speech enhancement. Spectral subtraction
method [14] suppresses stationary noise from the input noisy speech by subtract-
ing the spectral noise bias computed during the non-speech activity periods. The
minimum mean-square error (MMSE) based spectral amplitude estimator [15], [16]
showed promising results in terms of reducing residual noise as compared to the
spectral subtraction method or Wiener filtering-based algorithm [17]. Least mean
square adaptive filtering (LMSAF) based speech enhancement approaches have the
best filtering performances of Wiener filter. Meanwhile, they do not need a priori
knowledge, and can be adapted to the external environment by self-learning. But
these approaches have some disadvantages including low constringency, strong sensi-
tivity to non-stationary noise and a contradiction between constringency and stabil-
ity [18], [19]. The minima controlled recursive averaging (MCRA) noise estimation
was also introduced in [20] of which the performance is known to be reasonably com-
petitive under the environments with relatively high signal-to-noise ratios (SNR).
However, since these statistical models are constructed based on a stationarity as-
sumption, their performances generally tend to deteriorate in low SNR or highly
non-stationary noise conditions. Non-negative matrix factorization (NMF) is one
of the most common template-based approaches to speech enhancement [21], [22],
which models noisy observations as a weighted sum of non-negative source bases.
NMF-based speech enhancement methods are more robust to non-stationary noise
3
conditions as compared to the statistical model-based methods. These approaches,
however, often result in signal distortion in the enhanced speech since they are based
on an unrealistic assumption that speech spectrograms are linear combinations of the
basis spectra. Due to the complex nature of the noise corruption process, non-linear
models such as deep neural networks (DNNs) have been suggested as an alternative
choice for modeling the relationship between the noisy and the corresponding clean
speech utterances.
1.3 Disentangled Factor Learning
The real-world speech data are drawn from a generative process involving mul-
tiple entangled factors. A challenge in understanding speech data is learning to dis-
entangle the underlying factors of variation that give rise to the observations. The
factors of variation involved in generating a speech recording include the speaker’s
attributes as well as noise and channel information. The difficulty of disentangling
these hidden factors is that, in most real-world situations, each can influence the
observation in a different and unpredictable way. By separating the desired factors,
disentangled factor learning can be helpful to improve the performance of the task
to be solved. In this thesis, we propose a method to disentangle a factor with speech
components and a factor with noise properties from the noisy speech input.
1.4 Outline of the thesis
In this thesis, motivated by the success of DNN in speech processing area, we
adopt the deep learning approaches to the environmental sound classification and
speech enhancement.
4
In Chapter 2, we propose a neural network architecture for the purpose of using
sequential information. The proposed structure is composed of two separated lower
networks and one upper network. We refer to these as long short-term memory
(LSTM) layers, convolutional neural network (CNN) layers and connected layers,
respectively. The LSTM layers extract the sequential information from consecutive
audio features. The CNN layers learn the spectro-temporal locality from spectrogram
images. Finally, the connected layers summarize the outputs of two networks to
take advantage of the complementary features of the LSTM and CNN by combining
them. To compare the proposed method with other neural networks, we conducted
a number of experiments on the TUT acoustic scenes 2016 dataset which consists
of recordings from various acoustic scenes.
In Chapter 3, we propose a classification method for overlapping acoustic events
which incorporates joint training with source separation framework. Since overlap-
ping acoustic events are mixed in multiple sources, we train the source separation
model and multi-label classification model for estimating the type of overlapping
acoustic events. The source separation model is trained to reconstruct the target
sources by minimizing the interference of overlapping events. Joint training can be
conducted to achieve end-to-end optimization between the acoustic event source sep-
aration and multi-label estimation. To evaluate the proposed method, we conducted
a number of experiments using artificially mixed data.
In Chapter 4, we propose a novel noise-invariant speech enhancement method
which manipulates the latent features to distinguish between the speech and noise
features in the intermediate layers using adversarial training scheme. To compare
the performance of the proposed method with other conventional algorithms, we
conducted experiments in both the matched and mismatched noise conditions using
5
TIMIT and TSPspeech datasets.
The rest of the thesis is organized as follows: The next Chapter introduces the
proposed acoustic scene classification method using parallel combination of LSTM
and CNN. In Chapter 3, a joint training with source separation is proposed for over-
lapping acoustic event classification. Finally, a novel speech enhancement method
using disentangled feature learning is proposed in Chapter 4. The conclusions are
drawn in Chapter 5.
6
Chapter 2
Deep Learning Models for
Acoustic Scene Classification
2.1 Introduction
Acoustic scene classification aims to recognize the environmental sounds that oc-
cur for a period of time. Many approaches have been proposed for acoustic scene clas-
sification including feature representation, classification models, and post-processing.
The support vector machine (SVM) was one of the most successful learning model in
a number of scene classification tasks. As SVM is a binary classifier, some additional
methods must be combined to apply them to the multi-class problems, such as the
use of tree or clustering schemes [3], [4]. Furthermore, many machine learning-based
scene classification techniques were proposed in the detection and classification of
acoustic scenes and events (DCASE) challenge 2013 [23]–[25].
However, as deep learning techniques have been widely used on various learning
tasks, researchers have started to apply them to acoustic scene classification as well
7
[26], [27]. In [28], a DNN-based sound event classification algorithm was performed
with several image features.
Deep neural networks (DNNs) are powerful pattern classifier which enables the
networks to learn the highly nonlinear relationships between the input features and
output targets. Though the DNNs work well in the classification task, they cannot
be used to map sequences to sequences because of their structural limitations. To
overcome this shortcoming, recurrent neural networks (RNNs) and long short-term
memory (LSTM), which is a special type of RNN, have been applied to sequence
learning [29].
DNNs can only map from present input vector to output vector, whereas LSTM
can map from sequence to output sequence or vector. Therefore, LSTM can learn
the temporal information through consecutive input vectors. The authors in [30]
and [31] proposed sound event detection techniques based on bi-directional LSTM
which yielded higher performance compared to the DNNs. Unlike sound events which
occur in a short time frame, acoustic scenes are maintained for relatively longer
range. Thus, applying RNNs to the acoustic scene classification will improve the
performance.
Other approaches were proposed to use convolutional neural networks (CNNs)
with spectrogram image features (SIF) [32]. In [33], the authors addressed the im-
portance of spectro-temporal locality and proposed a CNN-based acoustic event
detection algorithm.
In this chapter, we propose to combine the LSTM and CNNs in parallel as
lower networks in order to exploit sequential correlation and local spectro-temporal
information. In the LSTM layers, sequences of Mel-frequency cepstral coefficients
(MFCCs) features are utilized as input in order to extract the sequential information.
8
The CNN layers learn the spectro-temporal locality from SIF, and SIF clips are
set to have the same length with the timestep of LSTM inputs. The outputs of
the two separated layers are combined by the connected layers which are able to
learn complementary features of LSTM and CNN. To compare the performance
of the proposed method with various neural networks, we conducted a number of
experiments on the TUT acoustic scenes 2016 dataset [34]. The results revealed that
the combination of LSTM and CNN outperforms the conventional DNN, CNN and
LSTM architecture with respect to classification accuracy.
2.2 Long Short-Term Memory
The key idea of RNN is that the recurrent connections between the hidden layers
allow the memory of previous inputs to retain internal state, which can affect the
outputs. However, RNN mainly have two issues to solve in the training phase: vanish-
ing gradient and exploding gradient problems [35]. When computing the derivatives
of activation function in the back propagation process, long-term components may
go exponentially fast to zero. This makes the model hard to learn the correlation
between temporally distant inputs. Meanwhile, when the gradient grows exponen-
tially during training, the exploding gradient problem occurs. In order to solve this
problem, the LSTM architecture was proposed [36]. LSTM layers are composed of
recurrently connected memory blocks in which one memory cell contains three mul-
tiplicative gates. The gates perform continuous analogues of write, read and reset
operations which enable the network to utilize the temporal information over a pe-
riod of time.
9
Figure 2.1: Scheme of the proposed method.
2.3 Parallel Combination of LSTM and CNN
In this section, we describe our approach to improve the classification accuracy
of acoustic scene. The schematic of the proposed training and test procedure is
illustrated in Figure 2.1 and the neural networks structure can be seen in Figure
2.2.
2.3.1 Feature Extraction
In the proposed system, different types of neural networks are combined in par-
allel. Thus, each network accept different form of input feature. The LSTM layers
utilize sequence of acoustic feature, but the CNN layers use spectrogram images. As
inputs for the CNN layers, the SIF are extracted from the sound spectrogram [28],
[32], [37]. Firstly, a spectrogram is generated by short-time Fourier transform. Given
audio frame s(n) segmented by length N and Hamming window w(n), the short time
10
spectral column F(f, t) at time t is computed as,
F(f, t) =
∣∣∣∣∣N−1∑n=0
s(n)w(n)e−j2πnf
N
∣∣∣∣∣ (2.1)
for f = 0, ..., N/2. In order to generate a spectrogram image which has K-bin
frequency resolution, down sampling is performed by using a window of length
W = N/2K as follows:
Fdown(f, t) =W−1∑i=0
F(f + i, t)/W, (2.2)
for f = 0, ..., (K−1). Finally, a simple de-noising method is performed by subtracting
each minimum frequency bin value in a frame-wise manner as follows:
Fdn(f, t) = Fdown(f, t)−mint{Fdown(f, t)} (2.3)
for f = 0, ..., (K − 1). In the proposed system, the extracted SIF has size of K × τ ,
where τ represents the time resolution which is also identical to the timesteps in the
LSTM layers.
11
1st LSTM layer
1st convolutional layer
1st pooling layer
2nd convolutional layer
2nd pooling layer
down sampling
& denoising
2nd LSTM layer
𝑥𝑡𝐿𝑆𝑇𝑀
𝑥𝑡𝐶𝑁𝑁
1st fully connected layer
2nd fully connected layer
Softmax Layer
flattening
class probability 𝑦𝑡
𝑥𝑡−𝜏+1 𝑥𝑡𝑥𝑡−𝜏+2 𝑥𝑡−1
ℎ𝑡−𝜏+1,1 ℎ𝑡,1ℎ𝑡−𝜏+2,1 ℎ𝑡−1,1
ℎ𝑡−𝜏+1,2 ℎ𝑡,2ℎ𝑡−𝜏+2,2 ℎ𝑡−1,2
𝑧𝑡𝐿𝑆𝑇𝑀
feature sequence 𝑥𝑡𝐿𝑆𝑇𝑀 for 𝜏 timesteps
unrolling
𝑧𝑡𝐶𝑁𝑁
𝑧𝑡𝑐𝑜𝑛𝑐𝑎𝑡
𝑧𝑡𝐿𝑆𝑇𝑀
Figure 2.2: Neural network structure for the proposed technique.
12
2.3.2 LSTM Layers
The hidden layers of LSTM have self-recurrent weights. These enable the cell
in the memory block to retain previous information. In the proposed system, τ
vectors are used for sequential learning. The lower part in Figure 2.2 depicts how
the sequences are trained through the LSTM layers. Previous τ − 1 vectors and one
present vector are forwarded to the recurrent layer sequentially. If the MFCC vectors
from xt−τ+1 to xt are used as the present inputs, vectors from xt−τ+2 to xt+1 will be
used as the next input sequence. The output vector zLSTMt is extracted from input
MFCC sequence xLSTMt through the LSTM layers, where xLSTMt = [xt−τ+1, ..., xt].
2.3.3 CNN layers
From Section 3.3.1, SIF xCNNt , which is a F × τ matrix, are extracted. The con-
volutional layer performs 2-dimensional convolution between the spectrogram image
and the pre-defined linear filters. To enable the network to extract complementary
features and learn the characteristics of input SIF, a number of filters with different
functions are used. Thus, if we apply K different filters to the spectrogram image,
K different filtered images are generated in the convolutional layer. The filtered
spectrogram images are forwarded to the pooling layer which conducts down sam-
pling. Especially, max pooling divides the input image into a set of non-overlapping
sub-regions and selects the maximum value. By reducing the spatial size of repre-
sentation via pooling, the most dominant feature in the sub-region is extracted. The
pooling layer operates independently on every filtered image and resizes them spa-
tially. In the last pooling layer, the resized outputs are rearranged in order to fully
connect with the upper layer. The flattened output vector zCNNt is extracted from
13
xCNNt through the CNN layers.
2.3.4 Connected Layer of LSTM and CNN
In [38], long-term recurrent convolution network (LRCN) model was proposed
for visual recognition. LRCN is a consecutive structure of CNN and LSTM. LRCN
processes the variable-length input with a CNN, whose outputs are fed into LSTM
network, which finally predicts the class of the input. In [39], a cascade structure
was used for voice search. Compared to the method mentioned above, the proposed
network forms a parallel structure in which LSTM and CNN accept different inputs
separately. Concatenated vector zconcatt is forwarded to the fully connected layer,
where zconcatt = [zLSTMt , zCNNt ]. The connected layers can train the complementary
information of LSTM and CNN. These enable the proposed model to learn the
sequential information and spectro-temporal information, simultaneously. Finally,
the class probability yt is predicted through the softmax layer.
2.4 Experiments
2.4.1 Dataset and Measurement
To assess the performance of the proposed method, we conducted a number of
experiments on the TUT acoustic scenes 2016 dataset which consists of recordings
from various acoustic scenes. The dataset contains 1170 recordings of total 9.75 hours
with 15 different classes. Audio signals sampled at 44.1 kHz sampling frequency were
divided into 40 ms frames with 50% hop size. Experiments were conducted using 4-
fold cross validation. The final results were obtained by averaging over all evaluation
folds.
14
We evaluated the classification accuracy using two measures: frame-based ac-
curacy and segment (30s)-based accuracy. Due to the softmax output layer of our
networks, probability distributions among the J class labels were obtained individ-
ually. Given zconcatt , the predicted class label at t frame was computed by,
Cframe = arg maxj
P (yt = j|zconcatt ) (2.4)
where j denotes class index. To obtain the class label of the entire audio segment,
the likelihood was computed follows as:
Csegment = arg maxj
T∑t=1
log(P (yt = j|zconcatt )), (2.5)
where T represents the number of frames in the one audio segment.
2.4.2 Neural Networks Setup
All networks in our experiments were trained using mean squared error as the
loss function supervised by one-hot encoding class vectors. The randomly ordered
mini-batches in each epoch was set to be 256. After a mini-batch was processed,
the weights were updated using adadelta [40]. In order to mitigate the over-fitting
problem in the training phase, we used the dropout technique which has already
proved its regularization capability [41]. The output layer contained 15 softmax
nodes identical to the number of scenes.
As a baseline system, we built a DNN which has three hidden layers with 512
hidden units each and used the ReLU activation in the hidden layers. The input
features were 60-dimensional MFCC features including both delta and acceleration
15
MFCC coefficients. Input layer was composed of a concatenation of 9 input frames
(the current frame and the four previous and four next frames) resulting in 540 input
units. To regularize the network, we used dropout with a probability of 40% for all
hidden layers.
The CNN architecture for the baseline system comprised two convolutional lay-
ers, two pooling layers and one fully connected layer with softmax layer on the top.
The input features were F × τ size SIF, where F=40 and τ=40. In the first con-
volutional layer, the input SIF is convolved with 32 filters of fixed size 5×5. The
first pooling layer then reduce the size of filtered SIF. We utilized max-pooling with
kernel size 2×2 for all pooling layers. As an activation function, ReLU was applied.
The second convolutional layer perform convolution between the output of the pool-
ing layer and 16 filters of fixed size 5×5. After the second pooling is performed,
the flattened output is combined with fully connected layer with 512 units. Dropout
was only used after the second pooling layer and the fully connected layer with
probabilities 30% and 40%, respectively.
The network had two hidden layers with 256 LSTM units each and one feed-
forward layer with 512 ReLU units. The structure of two LSTM layers is identical
to the lower part in Figure 2.2. The input sequence consisted of 40 frames of 60-
dimensional MFCC features. Dropout was applied with a probability of 40% for all
layers. The output layer was identical to the mentioned in the previous section.
As a proposed system, we built a combined structure of LSTM and CNN in
parallel. The network setup and structure of LSTM part and CNN part was identical
to the aforementioned networks. To combine and further train the two separated
networks, we used fully connected layers. The connected layers were consisted of
two hidden layers with 512 ReLU units each.
16
Table 2.1: Frame-based classification accuracy (%) on IEEE DCASE 2016 ChallengeTask 1 Development Dataset.
Scene DNN CNN LSTMCNN-LSTM
beach 76.56 65.29 79.86 81.26bus 44.69 62.61 56.21 60.99
cafe/restaurant 47.79 61.89 57.72 57.12car 75.49 71.11 85.51 80.57
city center 80.41 79.13 89.26 91.25forest path 87.24 72.15 91.69 92.22
grocery store 77.19 57.39 83.07 84.71home 66.28 72.71 52.70 55.39
library 64.07 71.27 69.29 72.55metro station 85.71 85.76 82.52 82.47
office 83.40 78.93 82.97 89.09park 38.24 36.11 48.89 43.88
residential area 61.87 51.71 52.54 57.74train 22.46 38.87 24.42 38.21tram 73.57 56.82 72.99 76.46
Overall acc 65.66 64.12 68.64 70.92
2.4.3 Results and Discussion
We compared the average accuracies over all scenes for the conventional DNN,
CNN, LSTM, and the proposed network. The frame-based classification results are
given in Table 2.1. Table 2.2 shows the segment-based classification accuracy, where
the correct represents the number of correctly classified segments among the total
1170 segments. The proposed method achieved higher accuracy than other networks
in both frame-based and segment-based classification.
Though the combined neural network achieved higher performance on average,
it did not give the best classification results across all scenes. In the bus case, CNN
outperformed other networks. In the park case, LSTM had better result. In the
17
Table 2.2: Segment-based (30s) classification accuracy (%) on IEEE DCASE 2016Challenge Task 1 Development Dataset. Asterisk(*) CNN-LSTM represents the ac-curacy on Evaluation Dataset.
Scene Base. DNN CNN LSTMCNN-LSTM
*CNN-LSTM
beach 69.3 84.62 73.08 88.46 88.46 84.6bus 79.6 51.28 88.46 67.95 65.38 100
cafe/rest. 83.2 58.97 73.08 67.95 60.26 61.5car 87.2 78.21 73.08 88.46 89.74 88.5
city center 85.5 92.31 91.03 93.59 97.44 92.3forest path 81.0 93.59 82.05 98.72 97.44 100
grocery store 65.0 83.33 71.79 85.90 91.03 96.2home 82.1 80.77 89.74 64.10 70.51 88.5
library 50.4 75.64 83.33 76.92 76.92 46.2metro station 94.7 94.87 100.0 92.31 94.87 88.5
office 98.6 93.59 96.15 87.18 96.15 100park 13.9 41.03 43.59 57.69 52.56 96.2
resident. area 77.7 87.18 75.64 73.08 74.36 65.4train 34.9 25.64 46.15 29.49 43.59 53.8tram 85.4 88.46 82.05 88.46 88.46 100
correct - 881 912 905 926 -
Overall acc 72.6 75.30 77.95 77.35 79.15 84.1
residential area case, DNN achieved higher performance. This can be interpreted
that the proposed network cannot fully train some acoustic scenes, and these scenes
may not contain enough temporal information. Future research will deal with a more
robust network architecture to extract distinct features of acoustic scenes.
The proposed method was found to improve classification performance and achieved
an average accuracy of 79.15%. The baseline accuracy of audio scene classification
task in DCASE 2016 challenge [34], which was based on MFCCs and GMMs, was
72.6%. Our method improved the performance by relative 6.6%. Finally, The accu-
racy on the evaluation dataset was 84.1%.
18
2.5 Summary
In this chapter, in order to enhance the classification accuracy of acoustic scenes,
we proposed a novel neural network structure which achieved higher performance
compared with the conventional DNN, CNN and LSTM architecture in terms of
both frame-based and segment-based accuracy. In the segment-based classification
results, the proposed technique obtained improvement of 3.85%, 1.2% and 1.8%
in comparison with DNN, CNN and LSTM architecture, respectively. By combining
different networks in parallel, the proposed method was able to learn complementary
information of LSTM and CNN.
19
Chapter 3
Overlapping Acoustic Event
Classification Based on Joint
Training with Source Separation
3.1 Introduction
For a decade, there have been many studies to address the problem of detect-
ing overlapping events from audio. In [42], the author proposed context-dependent
hidden Markov models (HMMs) with multiple path decoding. Also non-negative ma-
trix factorization (NMF) approach has been utilized in order to separate overlapping
events via dictionary learning [43]. Other approaches were proposed, such as using
connectionist temporal classification (CTC) [44], linear dynamical systems for over-
lapping sound event tracking [45] and feature representation for AED [46]. More
recently, various neural network models have been quite successful in AED area.
In [47], the multi-label deep neural networks (DNNs) were proposed for detecting of
21
temporally overlapping sound events, and the author in [31] used bi-directional long
short term memory (BLSTM).
With regard to AED, although neural networks are able to learn the non-linear
relationship between the input and output, they cannot fully utilize each source in-
formation from the mixture source. The additive property of sound sources makes it
difficult to find the robust features to recognize them in overlapping audio. Thus, we
propose a neural network for overlapping AEC which is optimized by the joint train-
ing with source separation model and multi-label classification model. The source
separation model is trained to reconstruct the target sources from unknown overlap-
ping event. It helps the model to decompose the mixture event. The classification
model learns the properties of overlapping event from the reference sources. After
that, two models are combined and jointly trained, so that the model can be op-
timized to minimize the interference of overlapping events and estimate labels of
mixed events directly.
The remainder of this chapter is organized as follows: Section 3.2 presents the
problem formulation of source separation for overlapping AEC. The proposed ap-
proach of using joint training for AEC is described in Section 3.3. Section 3.4 presents
the experimental results, and Section 3.5 provides conclusions and future work.
3.2 Source Separation of Overlapping Acoustic Event
The main objective of source separation is to estimate one or more sources from
a given mixed source signal. This can serve as an intermediate step for other tasks.
Since overlapping acoustic events are also mixture of multiple signals, source sep-
aration framework can be applied to AEC. In [48], unsupervised source separation
22
was used as a pre-processor for overlapping AED. Unlike this approach, the pro-
posed system is trained as a single model including source separation and event
classification.
In this section, we focus on source separation of overlapping acoustic events.
Given target sources s1(t) and s2(t), we define S1(t, f), S2(t, f) and Y (t, f) as the
short time Fourier transform(STFT) coefficients of s1(t), s2(t) and mixed signal y(t),
respectively, where t represents the frame index and f is the frequency-bins. Due to
the linearity of the STFT, source separation problem can be defined as follows:
y(t) = s1(t) + s2(t),
Y (t, f) = S1(t, f) + S2(t, f).
(3.1)
In the source separation framework, the magnitude spectrogram of the mixture
signal can be approximated as the sum of the magnitude spectra of each source as
follows:
|Y (t, f)| ≈ |S1(t, f)|+ |S2(t, f)|. (3.2)
For a specific time frame t, the magnitude spectrogram can be written in vector
form as follows:
yt ≈ s1t + s2t , (3.3)
where yt ∈ RF , s1t ∈ RF and s2t ∈ RF denote the magnitude spectrum of the
mixture and the two target acoustic events at time frame t, respectively. F is the
spectral magnitude dimension. Hence, the goal of event separation is to find s1 and
s2 using the mixture training data and reference event data.
23
Figure 3.1: Scheme of the proposed method.
3.3 Proposed Method Using Joint Training
In this section, we describe the proposed neural network training scheme for
improving the AEC performance. The schematic of the proposed training and test
procedure is illustrated in Figure 3.1 and the neural networks structure can be seen
in Figure 3.2.
3.3.1 Source Separation Model
Various DNN based approaches have been proposed to address the monaural
source separation problem [49]–[51]. In order to obtain the estimated single event
from overlapping acoustic events, we exploit the DNN framework for source separa-
tion. Given the input mixture features yt from the mixture, we obtain the output
estimates y1t and y2t from the network. In the training process, the discriminative
objective function is used in order to regularize the reconstruction error as defined
24
Overlapping acoustic event
Source estimate
Multi label estimate
Source separation model training
Event classification model training
Joint training
Figure 3.2: Joint training structure for the proposed technique.
25
in [49]
L(t) = ‖y1t − s1t‖2 + ‖y2t − s2t‖2 − γ‖y1t − s2t‖2 − γ‖y2t − s1t‖2, (3.4)
where ‖ · ‖ indicates the l2-norm and γ denotes the regularization parameter which
adjusts the trade-off between the reconstruction error and the discrimination infor-
mation. In order to estimate each source, the soft time-frequency mask mt ∈ RF is
calculated as follows:
mt =
∣∣y1t
∣∣∣∣y1t
∣∣+∣∣y2t
∣∣ . (3.5)
Then Wiener filtering can be used to reconstruct the magnitude spectra of each
acoustic event source as follows:
s1t = mt ⊗ yt,
s2t = (1−mt)⊗ yt,
(3.6)
where the division is performed element-wise and ⊗ indicates element-wise multipli-
cation. The source separation model is trained through the mixture source yt as an
input and reference source [st1st2 ] as a target. This process is described in Figure
3.2 by the solid blue line box.
3.3.2 Multi-Label Classification Model
Multi label neural networks are utilized for detection of temporally overlapping
acoustic events [47]. In the training stage of multi-label classification, the network
learns the mapping between reference source [st1st2 ] as an input and the correspond-
ing target output at, where at ∈ RI indicates true multi-label vector of overlapping
26
acoustic events. I is the number of acoustic events. This process is shown in Figure
3.2 by the red dashed line box.
3.3.3 Joint Training Method
Jointly trained models have achieved improvement in various learning tasks, es-
pecially in the speech recognition area. Motivated by the good performance of the
joint training scheme shown in [52]–[54], we use this technique in order to improve
AEC performance. AEC is also suitable enough to adopt the joint training because
source separation and event classification are trained through the difference objec-
tives.
After two networks are trained, they are combined to form a single network and
further trained jointly. In the training phase, the network is trained with mixture
source yt as input and true label at as output. As shown in Figure 3.2, the weights
of the unified network are adjusted using back-propagation. As a result, the network
is trained to utilize the information of separated source implicitly. This helps the
network to estimate acoustic events from the mixture source.
3.4 Experiments
3.4.1 Dataset and Data Augmentation
In order to evaluate the performance of the proposed method, we conducted a
set of acoustic event source separation experiments using the IEEE DCASE 2016
Challenge Task 2 Train Datasets [55]. The training dataset consists of 20 isolated
sound events per event class. We selected four acoustic events: coughing, keyboard
typing, page turning and speech, and used them to construct a mixed source dataset.
27
Six different types of dataset were generated in the source mixing process (4C2 =
6). Unlike most speech datasets which usually consist of hours of data or more,
conventional sound event datasets are not sufficiently long enough to train a robust
DNN model. In order to tackle the insufficient data problem to train a DNN model,
data augmentation approach was used for training the DNN. To construct the diverse
source mixtures from a small dataset, acoustic events were artificially generated by
time stretching. Finally, various mixture combination of two acoustic events were
produced with SNR 0 dB scale.
3.4.2 Experimental Setup
The dataset were sampled at 16 kHz, and the magnitude spectrograms were
calculated using STFT. Hamming window with 512-point length and 75% overlap
was applied and the FFT was taken at 512 points. Only the first 257 FFT points
were used since the conjugate of the remaining 255 FFT point are symmetric with
the first half.
As a regression model for source separation training, we built a DNN with two
hidden layers with 1000 Rectified Linear Unit (ReLU). The input features were
257×7-dimension (current frame and the three previous and next frames of mixture
source), and the output was 257×2-dimension (regression of two target sources)
with sigmoid unit. Equation (3.4) was used as the loss function, and the number of
randomly ordered mini-batches in each epoch was set to be 100. After processing
each mini-batch, the weights were updated using Adam [56]. In order to mitigate the
over-fitting problem in the training phase, dropout was applied with a probability
of 30% for all hidden layer.
In order to predict the labels of overlapping acoustic events, we also trained a
28
C-K mixture
EstimateC
C-K network
EstimateK
EstimateC
C-P network
EstimateP
EstimateP
P-S network
EstimateS
← C-K, C-P, C-S, K-P, K-S, P-S
C-P mixture
P-S mixture
Estimatesource
network
Estimatesource
mixture
Separated training
Integrated training
Figure 3.3: Comparison between separated and integrated source separation models.
DNN consisting of two hidden layer with 1000 ReLU node. The input features were
257×2 (two separated source), and output was 4-dimensional (each acoustic event
label) softmax layer. Mean squared error was used as the loss function and other
setup was equivalent to the source separation model.
After training the source separation model and the multi-label classification
model, two networks were cascaded to form a single larger network and the weights
of the unified network were adjusted using back-propagation.
29
3.4.3 Evaluation of Source Separation
In many two source separation tasks, a single network is trained to estimate a
source pair. However, in the proposed source separation network, a single network es-
timates six source pairs (4C2 = 6). This means that if the source separation network
do not estimate the target sources well, the jointly trained network may show simi-
lar performance to the baseline network which has an identical structure including
model size and hyper-parameters, but without applying the joint training scheme.
In order to verify this point, we compared the source separation performance of the
proposed method and networks which were trained using a mixture dataset, where
each recordings consist of only two target sources as shown in Figure 3.3. Alphabets
denote the acoustic event name, C: Coughing, K: Keyboard typing, P: Page turning
and S: Speech. The ‘C-K’ means that the mixture source includes coughing sound
and keyboard typing sound. The ‘separated training’ indicates that a single network
was trained using only a mixture dataset. Thus, total six networks were produced.
The ‘integrated training’ means that a single network was trained using whole six
pair datasets.
The performance of source separation was evaluated in terms of the signal to
distortion ratio (SDR) and signal to interference ratio (SIR) [57]. Figure 3.4 and
Figure 3.5 show the source separation performance. As shown in the figures, although
the performance is degraded, the proposed source separation network is enough to
provide each source information to multi-label classification network. Finally, An
example of source separation in the time domain can be seen in Figure 3.6.
30
<average>: 7.08: 5.81
Figure 3.4: The source separation performance (SDR [dB])
<average>: 12.64: 9.88
Figure 3.5: The source separation performance (SIR [dB])
31
Overlapping
Reconstruction1 Reconstruction2
Source1 Source2
Figure 3.6: Results of source separation in the time domain.
3.4.4 Acoustic Event Classification Results
To evaluate the performance of proposed method, we calculated the number of
correct, missed and false alarm events. The precision, recall and F–score are
calculated as follows:
precision =correct
correct+ false arlam, (3.7)
recall =correct
correct+missed, (3.8)
F–score =2× precision× recallprecision+ recall
. (3.9)
Table 3.1, 3.2 and 3.3 show the overlapping AEC performance. ‘2L-DNN’ and
‘5L-DNN’ denote DNN structures which have two and five hidden layers. In addi-
32
Table 3.1: Precision performance of overlapping acoustic event classification.
Event classPrecision
2L-DNN 5L-DNN MTL Proposed
C 0.8677 0.9220 0.9645 0.9785
K 0.9085 0.8987 0.9116 0.8898
P 0.9477 0.9341 0.9723 0.9852
S 0.9526 0.9377 0.9552 0.9685
average 0.9191 0.9231 0.9509 0.9555
Table 3.2: Recall performance of overlapping acoustic event classification.
Event classRecall
2L-DNN 5L-DNN MTL Proposed
C 0.9294 0.9067 0.9246 0.9333
K 0.9308 0.9467 0.9540 0.9841
P 0.8767 0.8821 0.9135 0.8967
S 0.8667 0.9067 0.9378 0.9765
average 0.9009 0.9106 0.9325 0.9477
Table 3.3: F-score of overlapping acoustic event classification.
Event classF-score
2L-DNN 5L-DNN MTL Proposed
C 0.8975 0.9143 0.9441 0.9554
K 0.9195 0.9221 0.9323 0.9346
P 0.9108 0.9074 0.9420 0.9389
S 0.9076 0.9219 0.9464 0.9725
average 0.9089 0.9164 0.9412 0.9503
33
Main task
Sub task
𝒚𝑡−𝜏:𝑡+𝜏 𝒂𝑡
𝒚1𝑡
𝒚2𝑡
Source estimate
Multi-label estimate
Figure 3.7: Multi-task learning for overlapping acoustic event classification.
tion, we compared the performance with model using multi-task learning. ’MTL’
denotes DNN structure which adopted multi-task learning as seen Figure 3.7. These
baseline networks did not apply the joint training with source separation. The pro-
posed method was found to improve the classification performance and achieve an
average F–score of 0.9503. In the each acoustic source, the joint training with source
separation achieved higher performance.
34
3.5 Summary
In this chapter, we have proposed a neural network for overlapping AEC based on
joint training between source separation model and multi-label classification model.
By adopting the source separation framework into the overlapping AEC task, the
jointly trained network can minimize the interference of overlapping events. From
the experimental results, it has been found that the proposed technique outperforms
the baseline networks which do not apply the joint training with source separation.
35
Chapter 4
Disentangled Feature Learning
for Noise-Invariant Speech
Enhancement
4.1 Introduction
DNNs have been successful in solving the speech enhancement tasks under var-
ious noise environments since its introduction. Early literature using DNNs as a
nonlinear mapping function for estimating clean speech had reported better enhance-
ment results [58]–[60] compared to the NMF-based algorithms. Various neural net-
work structures have been employed for speech enhancement, such as multi-context
stacking networks for ensemble learning [61], recurrent neural networks (RNNs) [49],
[62], and convolutional neural networks (CNNs) [63], [64].
More recently, generative adversarial network (GAN) [65] has become popular
in the area of deep learning, and it has been also applied to speech enhancement.
37
Pascual et al. proposed end-to-end speech enhancement GAN (SEGAN) in which the
generator learns to model the mapping from the noisy speech samples to their clean
counterparts, while the discriminator learns to distinguish between the enhanced and
the target clean samples within the context of a mini-max game [66]. The underlying
idea of GAN has been adopted in many GAN-based speech enhancement algorithms
including the time-frequency mask estimation using the minimum mean square error
GAN (MMSE-GAN) [67] and the conditional GAN (cGAN) [68].
Though deep learning-based speech enhancement models have achieved consider-
able improvements, the performance is usually degraded in the case of mismatched
conditions caused by different types of noises or SNR levels between the training
and test set samples. Moreover, the performance varies depending on the types of
noises. In order to address such issues, disentangled feature learning can be consid-
ered as a possible solution. Most of the previous studies, which have focused mainly
on the mapping between the noisy and the clean speech, rarely consider how input
features are learned in the hidden layers. The model based on disentangled feature
learning, on the other hand, manipulates the latent features to distinguish between
the speech and noise in the intermediate layers, hence resulting in better enhance-
ment performance even in the mismatched noise conditions. Moreover, the quality
of noise-invariant attribute can also be improved.
In this chapter, we propose a novel deep learning-based noise-invariant speech
enhancement algorithm which employs an adversarial training framework designed
to disentangle the latent features of speech and noise, under the concept of domain
adversarial training (DAT) [69]. Although DAT was originally introduced for the
domain adaptation task, the proposed algorithm exploits the DAT framework for
use in the regression task, i.e.,speech enhancement. Experimental results show that
38
the proposed method successfully disentangles the speech and noise latent features.
Moreover, the results also reveal that our model outperforms the conventional DNN-
based algorithms. The main contributions of this paper are summarized as follows:
• We modify the DAT framework in order to solve the speech enhancement
task in a supervised manner. The proposed model achieves better performance
in speech enhancement as compared to the baseline models under both the
matched and mismatched noise conditions.
• By reducing the performance gap among different noise types, we show that
our method is more robust to noise variability.
• By visualizing feature representations, we demonstrate that our model suc-
cessfully disentangles speech and noise latent features.
4.2 Masking-Based Speech Enhancement
When training neural networks in a supervised manner and regression approach
as seen Figure 4.1, it is essential to define a proper training target in order to
ensure effective learning. The training targets for speech enhancement can be mainly
categorized into two groups: (i) mapping-based, and (ii) masking-based approaches.
The mapping-based methods learn a regression function relating a noisy speech to
the corresponding clean speech directly while the masking-based methods estimate
time-frequency masks given a noisy speech. A variety of training targets have been
studied. Wang et al. evaluated and compared the performance of various mapping-
based and masking-based targets [70]. It may be controversial to argue which method
is better, yet many cases have shown that the masking-based methods (e.g. ideal
39
Figure 4.1: Scheme of DNN-based speech enhancement method.
ratio masks) tend to perform better than the mapping-based methods [61], [70], [71]
in terms of enhancement results. In this work, we design the proposed model within
a masking-based framework. We use the time-frequency masking functions as an
extra layer in the neural network [49]. This way, the model implicitly incorporates
the masking functions when optimizing the network which will be detailed in Section
4.4.1.
4.3 Concept of Domain Adversarial Training
Domain adaption [72] addresses the problem of mismatch between the training
and test datasets by transferring the knowledge learned from the source domain to
a robust model in the target domain. DAT is one of the approaches that attempts
to match the data distributions across different domains. In [69], DAT exploits an
adversarial training method in order to learn intermediate features which are invari-
40
ant to the shifts in data from different domains. Here, the neural network learns
two different classifiers: (i) a classifier for the main classification task, and (ii) the
domain classifier. The training objective of the domain classifier, in particular, is
to learn whether the input sample is from the source or target domain, given fea-
tures extracted using labeled data from the source domain and unlabeled data from
the target domain. The feature extractor is shared by both the main task and the
domain classifiers. In implementation, a gradient reversal layer (GRL) is employed
to act as an identity transformer in the forward-propagation and to reverse the
gradient by multiplying a negative scalar during the back-propagation [69]. Conse-
quently, the GRL encourages the latent features to act discriminatively when solving
the main classification task, yet act indiscriminately towards the shifts across dif-
ferent domains. In other words, the feature extractor is trained so that the model
maps data from different domains to the latent features with similar distributions
via adversarial learning.
Many speech processing frameworks have adopted the idea of DAT in order to ex-
tract domain-invariant features. Under the noise robust speech recognition scheme,
the clean speech was regarded as the source domain data and was used to train the
senone label classifier, while the noisy speech played the role of the target domain
data to be adjusted by the feature extractor [73], [74]. DAT was also used to learn
speaker-invariant senone labels, as shown in [75] where the adversarial training suc-
cessfully aligned the feature representation of different speakers. In [76], the authors
demonstrated that accent-invariant features could be learned for the ASR system.
For speaker recognition tasks, DAT was adopted to tackle the channel mismatch
problem. In particular, the latent features were extracted in order to learn channel-
invariant, yet speaker-discriminative representations [77]. In [78], the authors showed
41
Noisy
speech 𝐱Encoder
(𝐄)
Speech latent
feature 𝐳s
Noise latent
feature 𝐳n
Speech Decoder(𝐃𝐬)
Noise Decoder(𝐃𝐧)
Speech
Disentangler(𝐃𝐄𝐬)
Noise
Disentangler(𝐃𝐄𝐧)
Predicted
speech 𝐬
Predicted
noise 𝐧
Gradient reversal
layer
Gradient reversal
layer
Predicted
speech power 𝐦s
Predicted
noise power 𝐦n
Learnable weights
Deterministic operation
Predicted
speech 𝐬
Predicted
noise 𝐧
Figure 4.2: The architecture of the proposed model for disentangled feature learning.
that DAT was able to adapt multiple forms of mismatches (e.g. speaker, acoustic
conditions, and emotional content) when solving the acoustic emotion recognition
task. As for the speech enhancement problems, a noise adaptive method exploiting
DAT was proposed in [79]. In their work, however, DAT was only used to classify
stationary and non-stationary noises, and the authors did not make use of various
noise components for domain-invariant regression.
4.4 Disentangling Speech and Noise factors
4.4.1 Neural Network Architecture
Our neural network consists of five sub-networks: (i) an encoder (E), (ii) a speech
decoder (Ds), (iii) a noise decoder (Dn), (iv) a noise disentangler (DEn), and (v)
a speech disentangler (DEs). The overall architecture of the proposed model is il-
lustrated in Figure 4.2. We extract the magnitude spectra as the raw features of
all signal components. Only the magnitude spectra are estimated while the phase
42
parts of the noisy speech are kept intact. Let us denote the magnitude spectra of the
noisy speech, clean speech, and noise as x ∈ RF×(2τ+1), s ∈ RF×1, and n ∈ RF×1,
respectively, where F denotes the number of frequency bins and τ represents an
input context expansion parameter (i.e., one current frame, τ previous and τ next
frames). The encoder E learns a function that maps x into speech and noise latent
features, defined by neural network parameter θE as follows:
(zs, zn) = E(x; θE) (4.1)
where zs ∈ RM×1 and zn ∈ RM×1 indicate M -dimensional speech and noise latent
features, respectively. Similarly, Ds and Dn learn mappings parameterized by θDs
and θDn , respectively, as follows:
ms = Ds(zs; θDs),
mn = Dn(zn; θDn)
(4.2)
where ms ∈ RF×1 and mn ∈ RF×1 denote the estimated speech and noise masks,
respectively. The time-frequency masks are constrained such that the sum of the
estimated values should be equal to the input noisy speech. Given the masks from
both of the decoders, we can obtain the predicted speech and noise through a de-
terministic layer [49]. Given ms and mn, the predicted magnitude spectra of speech
s ∈ RF×1 and noise n ∈ RF×1 can be calculated as
s =ms
ms + mn⊗ x,
n =mn
ms + mn⊗ x
(4.3)
43
where the addition, division, and product (⊗) operators are executed element-wisely.
Finally, DEn and DEs are trained to separate the noise attributes from the
speech latent features, and vice versa. DEn and DEs are respectively parameterized
by θDEn and θDEs as follows:
n = DEn(zs; θDEn),
s = DEs(zn; θDEs)
(4.4)
where s ∈ RF×1 and n ∈ RF×1 represent the speech and noise components, respec-
tively estimated from the latent features. Note that s and n differ from s and n
in Equation (4.3). s and n are generated by the disentanglers which are trained to
make the encoder difficult to predict the speech and noise. The GRLs are inserted
between the encoder and the disentanglers to establish an adversarial setting. On
the other hand, s and n are well estimated by the corresponding decoders.
In the final speech enhancement stage, after obtaining s from the decoders, the
estimated clean speech spectrum S is reconstructed by
S = s⊗ exp (j]x) (4.5)
where ]x denotes the phase of the corresponding input noisy speech. S is then
transformed to the time-domain signal through inverse discrete Fourier transform
(IDFT). Finally, an overlap-add method as in [80] is used to synthesize the waveform
of the enhanced speech.
44
4.4.2 Training Objectives
Given the estimates s and n of the clean speech s and noise n, we optimize the
neural network parameters of the encoder and decoders by minimizing the mean
squared error defined as follows:
LDs(θE , θDs) =1
K
K∑k=1
‖sk − sk‖2,
LDn(θE , θDn) =1
K
K∑k=1
‖nk − nk‖2(4.6)
where ‖ · ‖ indicates the l2-norm, K is the number of mini-batch size, and sk (nk) is
the estimate of the k-th speech (noise) sample sk (nk) in the mini-batch. Similarly,
we also train the encoder and the disentanglers by using the following objective
functions:
LDEn(θE , θDEn) =1
K
K∑k=1
‖nk − nk‖2,
LDEs(θE , θDEs) =1
K
K∑k=1
‖sk − sk‖2(4.7)
where nk and sk are obtained through Equation (4.4). To obtain disentangled fea-
tures, we minimize LDEn and LDEs defined in Equation (4.7) with respect to θDEn
and θDEs , while maximizing them with respect to θE simultaneously. Combining
Equations (4.6) and (4.7), the total loss of the proposed network is formulated as
LT (θE , θDs , θDn , θDEn , θDEs)
= [LDs(θE , θDs)− λ1LDEn(θE , θDEn)] + α[LDn(θE , θDn)− λ2LDEs(θE , θDEs)](4.8)
45
where λ1 and λ2 are positive hyper-parameters which control the amount of gradient
reversal in the back-propagation step, and α denotes the weight controlling the
contribution of the noise estimate.
In recent studies, GRL has only been used for domain predictions under narrowly
restricted settings (with only two possible domains, e.g., the source and the target)
or for classifications of channels, speakers, and noise types. The proposed model
distinguishes itself from the past approaches by using two GRLs to disentangle the
speech and noise latent features in a regression manner.
4.4.3 Adversarial Training for Disentangled Features
Neural network parameters are optimized by using the objective function given
in Equation (4.8) via adversarial learning. Ds and Dn are trained to minimize LDs
and LDn , and DEn and DEs are also trained to minimize LDEn and LDEs . As for
the optimization of E, it is essential to ensure that it should produce disentangled
features. This idea is implemented by minimizing LDs and LDn while maximizing
LDEn and LDEs in an adversarial manner with respect to the encoder parameter θE .
Such a mini-max competition eventually converges to the point where the encoder
network generates the noise-confusing latent feature zs and the speech-confusing
latent feature zn, disentangled in the latent feature space. Ds and Dn then use
zs and zn as input respectively and produce noise-invariant speech s. In summary,
optimizations of the network parameters are given by
(θE , θDs , θDn) = arg minθE ,θDs ,θDn
LT (θE , θDs , θDn , θDEn , θDEs),
(θDEn , θDEs) = arg maxθDEn ,θDEs
LT (θE , θDs , θDn , θDEn , θDEs)
(4.9)
46
where θ(·) denotes the optimal parameters for each given network (·).
The network parameters defined by Equation (4.9) can be found as a stationary
point of the following gradient updates:
θE ←− θE − µ(∂LDs∂θE
+ α∂LDn∂θE
− λ1∂LDEn∂θE
− αλ2∂LDEs∂θE
),
θDs ←− θDs − µ∂LDs∂θDs
,
θDn ←− θDn − µα∂LDn∂θDn
,
θDEn ←− θDEn − µ∂LDEn∂θDEn
,
θDEs ←− θDEs − µα∂LDEs∂θDEs
(4.10)
where µ indicates the learning rate. The updates of Equation (4.10) are very similar
to stochastic gradient descent (SGD) updates for the feed-forward deep model that
comprises the encoder fed into the decoders and into the disentanglers. The difference
is that the gradients from the decoders and disentanglers are subtracted with loss
weighted by λ1, λ2, and α, instead of being summed. The negative coefficient −λ1
and −λ2 enable the encoder to induce the maximization of LDEn and LDEs by
reversing the gradients during the back-propagation. If both λ1 and λ2 are set to
zero, the neural network structure presented in Figure 4.2 becomes equivalent to the
conventional DNN structure. The optimized networks E, Ds and Dn are then used
during the test stage for generating the clean speech estimates given the noisy test
speech samples.
47
4.5 Experiments and Results
4.5.1 Dataset and Feature Extraction
We used 6, 300 utterances of clean speech data from the TIMIT database [81]
to train the neural networks. TIMIT database consists of 10 sentences each spoken
by 630 English speakers. In order to make sure that various noisy utterances are
considered during simulations, we selected 10 different noise types including: car,
construction, office, railway, cafeteria, street, incar, train, bus from ITU-T recom-
mendation P.501 database and white noise from NOISEX-92 database [82]. In the
case of matched noise conditions, two-thirds of each noise clip was used for training
and the rest for testing. For each pair of the clean speech utterance and the noise
waveform, a noisy speech utterance was artificially generated with an SNR value
randomly chosen from −3 to 6 dB in 1 dB scale. As a result, a total of 63, 000 utter-
ances (about 54 hours) were used so that the entire database was mixed with each
noise type.
The test set consisted of 1, 400 utterances of clean speech data from TSPspeech
[83], spoken by 12 male and 12 female English speakers. For the experiments in the
matched noise conditions, we used the same noise types as used for training. For
the experiments in the mismatched noise conditions, noises including kids, traffic,
metro, and restaurant from ITU-T recommendation P.501 database were applied.
Noisy speech utterances were generated with the SNR value ranging from −6 to 9 dB
with 3 dB step in which −6 and 9 dB cases represented the unseen SNR conditions.
The input and target features of the networks were extracted in the following way.
First, we extracted the magnitude spectra from the noisy speech, the corresponding
clean speech, and noise. A 512-point Hamming window with 50% overlap was ap-
48
Figure 4.3: Plot of losses on training the proposed model.
plied to the audio signals, sampled at 16kHz, and then short-time Fourier transform
(STFT) was applied. 512 points STFT magnitudes were reduced to 257 points by
removing the symmetric half. F and τ were fixed to 257 and 5, respectively. Thus,
input feature vectors, extracted from 11 consecutive frames, were concatenated in a
similar manner as in [59].
4.5.2 Network Setup
The network architecture of the proposed model is presented in Figure 4.2 which
we refer to the speech-noise disentangled training (snDT ) model. The encoder E
was constructed by stacking two hidden layers with 2, 048 leaky rectified linear units
(ReLUs) [84] in each layer. The number of the input nodes of E was 257×11 = 2, 827.
The output layer generated two separated outcomes of 512 nodes (i.e., the dimension
M of zs and zn) with leaky ReLUs.
The decoders Ds and Dn also had two hidden layers with 2, 048 leaky ReLUs
49
in each layer. The numbers of the input and output nodes in each network were
512 and 257, respectively. For the output activations, Sigmoid was used to restrict
the output mask (ms and mn) values to be in [0,1], yet s and n were determined
implicitly by Equation (4.3). The structures of DEn and DEs were identical to that
of Ds except for the output activation functions. ReLUs were used for the output
magnitudes (s and n).
The snDT model was trained with Adam optimizer [56], with a learning rate
of 1e − 3, using a mini-batch size of 10 utterances. Batch normalization [85] was
applied to all of the hidden and output layers for regularization and stable training.
As for the hyper-parameters λ1 and λ2 of Equation (4.8), we took an approach
similar to [69]. λ1 and λ2 were initialized with 0 for the first 50K training iterations,
and then their values were gradually increased until reaching (λ1, λ2) = 0.3 by
the end of the training. α in Equation (4.8) was fixed at 0.4. Figure 4.3 shows the
training losses obtained from the snDT model, and it is seen that the model was
trained properly. Through the adversarial training as defined by Equation (4.9), the
speech and noise estimation losses decreased, and the disentangling losses increased
gradually to convergence.
To evaluate the performance of the disentangled feature learning technique, we
implemented three baseline models for comparison. These baseline systems are as
follows:
• Speech training (sT ) model, as shown in Figure 4.4a, is a deep denoising
autoencoder [58], and it takes a regression approach closely resembling [59].
• Speech-noise training (snT ) model, as shown in Figure 4.4b, utilizes noise
components to construct the time-frequency masks. This approach is similar
50
Noisy
speech 𝐱
Encoder(𝐄)
Speech latent
feature 𝐳s
Speech Decoder(𝐃𝐬)
Predicted
speech 𝐬
Predicted
speech mask 𝐦s
Learnable
weights
Deterministic
operation
(a) sT model
Noisy
speech 𝐱
Encoder(𝐄)
Speech latent
feature 𝐳s
Speech Decoder(𝐃𝐬)
Predicted
speech 𝐬
Predictedspeech mask
𝐦s
Noise latent
feature 𝐳n
Noise Decoder(𝐃𝐧)
Predictednoise mask
𝐦n
Predicted
noise 𝐧
(b) snT model
Noisy
speech 𝐱
Encoder(𝐄)
Speech latent
feature 𝐳s
Speech Decoder(𝐃𝐬)
Predicted
speech 𝐬
Predictedspeech mask
𝐦s Noise
Disentangler
(𝐃𝐄𝐧)
Predicted
noise 𝐧
Gradient reversal
layer
(c) nDT model
Figure 4.4: The architectures of the baseline models.
to the method suggested in [49]. Unlike the snDT model, however, the snT
model does not exploit disentangled feature learning.
• Noise disentangled training (nDT ) model, as shown in Figure 4.4c, is trained
so that the noise components are disentangled from the speech latent features
without using noise latent features.
The baseline models were configured similarly in terms of hyper-parameters, the
number of layers and nodes in each module, to ensure a fair comparison with the
snDT model. We implemented all the networks using Tensorflow [86].
51
4.5.3 Objective Measures
For the evaluation of the models’ performances, we considered four different
aspects, speech quality, noise reduction, speech intelligibility, and speech distortion.
The tested objective measures are summarized as in the following:
• PESQ: Perceptual evaluation of speech quality defined in the ITU-T P.862
standard [87]
• segSNR: Segmental SNR, which is the average of the SNR per frame for the
two speech signals
• eSTOI: Extended short-time objective intelligibility [88]
• SDR: Signal-to-distortion ratio [57]
All metric values for the enhanced speech were compared with the corresponding
clean reference of the test set.
4.5.4 Performance Evaluation
In case of the matched noise conditions, we measured the objective metrics and
averaged them over each SNR environment to evaluate performance for ten different
noise types. Table 4.1 presents the PESQ scores, segSNR, eSTOI, and SDR values
obtained in the matched noise conditions where the column “noisy” refers to the
results obtained from the clean and the unprocessed noisy speech. The cases with
SNR equal to −6 and 9 dB indicate the unseen SNR conditions that were not in-
cluded during the training phase. Firstly, we investigated whether the use of noise
information improves performance for speech enhancement. The results show that
52
Table 4.1: Results of PESQ, segSNR, eSTOI, and SDR values of the proposed andbaseline networks in the matched noise type conditions, where −6 and 9 dB casesare unseen SNR conditions.
(a) PESQ
SNR(dB)
noisy sT snT nDT snDT
-6 1.53 2.00 2.12 2.06 2.22-3 1.71 2.23 2.35 2.30 2.450 1.90 2.44 2.57 2.52 2.663 2.11 2.64 2.76 2.72 2.856 2.33 2.83 2.95 2.90 3.029 2.54 2.99 3.10 3.05 3.17
Aver. 2.02 2.52 2.64 2.59 2.73
(b) segSNR
SNR(dB)
noisy sT snT nDT snDT
-6 -6.87 1.49 3.18 2.85 3.53-3 -5.39 3.06 4.31 3.93 4.920 -3.65 4.57 5.58 5.27 6.293 -1.80 6.08 7.03 6.79 7.866 0.32 7.41 8.33 8.14 9.209 2.57 8.64 9.56 9.38 10.43
Aver. -2.47 5.21 6.33 6.06 7.04
(c) eSTOI
SNR(dB)
noisy sT snT nDT snDT
-6 0.44 0.56 0.59 0.57 0.61-3 0.52 0.64 0.67 0.65 0.690 0.59 0.71 0.74 0.73 0.763 0.67 0.77 0.80 0.79 0.826 0.74 0.82 0.84 0.84 0.869 0.80 0.86 0.88 0.87 0.89
Aver. 0.63 0.73 0.75 0.74 0.77
(d) SDR
SNR(dB)
noisy sT snT nDT snDT
-6 -5.97 7.07 7.96 7.22 8.75-3 -3.11 9.63 10.42 9.85 11.100 -0.17 11.92 12.67 12.16 13.213 2.80 14.06 14.71 14.27 15.146 5.78 15.81 16.42 16.03 16.819 8.78 17.34 17.94 17.56 18.24
Aver. 1.35 12.64 13.35 12.85 13.88
53
Table 4.2: Results of PESQ, segSNR, eSTOI, and SDR values of the proposed andbaseline networks in the mismatched noise type conditions, where −6 and 9 dB casesare unseen SNR conditions.
(a) PESQ
SNR(dB)
noisy sT snT nDT snDT
-6 1.33 1.68 1.77 1.79 1.90-3 1.55 1.93 2.02 2.02 2.130 1.77 2.16 2.25 2.27 2.353 1.98 2.38 2.46 2.44 2.556 2.20 2.59 2.67 2.65 2.759 2.41 2.78 2.86 2.83 2.93
Aver. 1.88 2.25 2.34 2.33 2.43
(b) segSNR
SNR(dB)
noisy sT snT nDT snDT
-6 -6.59 -0.86 1.78 1.70 1.90-3 -5.08 0.81 2.85 2.72 2.810 -3.35 2.58 3.50 3.47 4.043 -1.48 4.16 4.97 4.89 5.696 0.64 5.82 6.64 6.60 7.449 2.91 7.29 8.11 8.08 8.97
Aver. -2.16 3.30 4.64 4.58 5.14
(c) eSTOI
SNR(dB)
noisy sT snT nDT snDT
-6 0.39 0.46 0.48 0.48 0.51-3 0.47 0.55 0.58 0.57 0.600 0.55 0.63 0.66 0.66 0.683 0.63 0.71 0.74 0.73 0.756 0.71 0.77 0.80 0.80 0.819 0.78 0.82 0.84 0.84 0.86
Aver. 0.59 0.66 0.68 0.68 0.70
(d) SDR
SNR(dB)
noisy sT snT nDT snDT
-6 -6.00 1.96 2.44 2.20 2.59-3 -3.11 4.89 5.37 5.21 5.570 -0.17 7.89 8.37 8.26 8.613 2.79 10.50 10.92 10.78 11.176 5.78 13.01 13.41 13.24 13.669 8.78 15.11 15.52 15.37 15.82
Aver. 1.34 8.89 9.34 9.18 9.57
54
the snT model, which constructed the masks using both speech and noise infor-
mation, performed better than the sT model whose prediction was based only on
speech components. Similarly, the snDT model with noise estimates reported better
performance in terms of all the metrics compared to the nDT model.
The nDT model, which disentangles the noise components in the latent feature
space, resulted in lower performance improvements in comparison with the snT
model. This confirms that even though the nDT model incorporated disentangled
feature learning, it was not able to exploit the noise information to construct the
masks during the speech enhancement process. On the other hand, in order to ex-
amine the sole effect of the disentangled feature learning, the nDT model should be
compared to the sT model whose structure is identical to the nDT model except
for the noise disentangler. As can be seen in the results, the nDT model outper-
formed the sT model in terms of all the metrics. Furthermore, the comparison of the
snDT model to the snT model, both of which similarly adopted the masks except
that the snDT model additionally applied disentangled feature learning, reported
better performance improvements for the snDT model. In summary, the proposed
model showed better performance than all the other baseline models in terms of
speech quality, intelligibility, noise reduction, and speech distortion, indicating that
the disentanglement between speech and noise features in the latent feature space
was more effective for the prediction of the clean speech.
In case of the mismatched noise conditions, we evaluated performance given four
different noise types and averaged the results over each of the SNR environment. Ta-
ble 4.2 presents the PESQ scores, segSNR, eSTOI, and SDR values obtained under
the mismatched noise conditions. The results show that the snDT model outper-
formed the baseline methods, implying that it was more robust to the unseen noise
55
types. Since the snDT model learned how to disentangle speech components from
the latent features, the disentangled features could be obtained even in the mis-
matched noise conditions. From the perspective of noise reduction, in particular, it
is quite noteworthy that the models using disentangled feature learning showed rel-
atively competitive performance improvements in the mismatched noise conditions
compared to the matched conditions. In case of the matched noise conditions, the
relative improvement of segSNR was 16.31% for the nDT model when compared
against the sT model, and 11.21% for the snDT model against the snT model. In
the case of the mismatched noise conditions, however, the relative improvements of
segSNR of the nDT and snDT models were 38.79% and 15.95%, respectively. It can
be seen that the proposed approach is particularly effective in the aspect of noise
reduction.
Additionally, Figure 4.5 shows the spectrograms of an utterance enhanced by
the snT and snDT models in the mismatched noise conditions. From this figure, it
is shown that the proposed algorithm effectively reduced the noise from the original
noisy speech while the speech distortion was minimized.
We also conducted a listening test to compare the subjective performance of the
proposed algorithm with the conventional scheme. For that, 18 listeners participated
and were presented with 42 randomly selected sentences corrupted by the 14 different
noises in the SNR values of -3, 0, and 3 dB. In the test, each listener was provided
with speech samples enhanced by the snT model and snDT model. Listeners could
listen to each enhanced speech as many times as they wanted, and were asked to
choose the preferred one from each pair of speech samples in terms of perceptual
speech quality. If the quality of the two samples was indistinguishable, listeners could
select no preference. Two samples in each pair were given in arbitrary order.
56
Figure 4.5: (From top to bottom) The spectrograms of noisy speech degraded bymetro noise with −3 dB SNR, enhanced speech by the snT model, enhanced speechby the snDT model, and the corresponding clean speech, respectively.
The results are shown in Figure 4.6. It can be seen that the quality of the speech
enhanced by the proposed model was better than the conventional model in all SNR
values. With respect to the averaged results, the snDT model was preferred to the
snT model in 52.78% of the cases, while the opposite preference was 8.20% (no
preference in 39.02% of the cases). These results imply that the proposed algorithm
enhances not only the objective measures but also the perceived quality.
57
8.20%
7.94%
11.51%
5.16%
39.02%
45.24%
36.11%
35.71%
52.78%
46.82%
52.38%
59.13%
0% 20% 40% 60% 80% 100%
Average
3
0
-3
snT Neutral snDT
Figure 4.6: Results of subjective preference test (%) comparing the speech qualityfor the snT and snDT models with various SNR values.
4.5.5 Analysis of Noise-Invariant Speech Enhancement
As the network is trained with different types of noise, it is easily anticipated
that the performance may vary depending on the noise types even when given the
same SNR value. This could be problematic, especially under various real-world
noise environments, because lower performance improvements for certain noise types
could certainly result in lower performance in overall for the entire system. Figure 4.7
describes the variances of the PESQ scores obtained from different noise types. We
separately measured the PESQ scores for each noise type and computed the variances
of 14 different noise types used in the matched and mismatched noise conditions.
The results show that the proposed algorithm yielded the smallest performance
gap among the noise types in all of the SNR environments. It is noted that the
snDT model produced much smaller variances at the low SNR level compared to
the baseline models. This demonstrates that the proposed model was less sensitive
to different noise types during the enhancement process because it disentangled the
speech attributes well from the noisy speech in the latent feature space. Experimental
results, therefore, suggest that the proposed model is a speech enhancement system
with an improved noise-invariant property.
58
Figure 4.7: Variances of PESQ scores for the 14 different noise types in various SNRenvironments.
4.5.6 Disentangled Feature Representations
We further explored the effect of disentangled feature learning by visualizing
the speech latent feature (zs) using t-SNE [89]. t-SNE is a popular data visualiza-
tion method which projects high dimensional data into a subspace with a smaller
dimension. The projection serves as a useful tool to visually inspect feature represen-
tations learned by the model. We extracted speech latent features from a subset of
the test samples through trained models and projected the 512-dimensional zs into
the 2-dimensional space using t-SNE. Figure 4.8 visualizes the speech latent feature
representations obtained in the matched noise conditions. Figure 4.8d, in particular,
shows that by using two disentanglers for adversarial learning, the distribution of zs
became almost indistinguishable. This implies that the noise attributes were highly
likely to be disentangled in zs. In contrast, without disentangled feature learning, as
shown in Figures 4.8a and 4.8b, we were able to separate each type of noise cluster
59
(a) sT model (b) snT model
(c) nDT model (d) snDT model
Figure 4.8: Visualization of speech latent feature (zs) using t-SNE in the matched noisecondition
(a) snT model (b) snDT model
Figure 4.9: Visualization of speech latent feature (zs) using t-SNE in the mismatched noisecondition
60
easily in the latent feature space. This indicates that the noise attributes remain
intact in zs. Figure 4.8c shows that the nDT model disentangled the noise compo-
nents more clearly as compared to the sT and snT models, yet not as much as the
snDT model. Finally, Figure 4.9 shows the speech latent feature representations in
the mismatched noise conditions. Even though the noise types were not included in
the training data, the proposed model disentangled noise components more clearly
in the latent feature space compared to the conventional DNN-based models.
4.6 Summary
In this chapter, we proposed a novel speech enhancement method in which speech
and noise latent features were disentangled via adversarial learning. In order to
explore the disentangled representation which has not been exploited in the con-
ventional speech enhancement algorithms, we designed a model using GRLs. The
proposed architecture is composed of five sub-networks where the decoders and the
disentanglers were trained in an adversarial manner to encourage the encoder to pro-
duce noise-invariant features. The speech latent features generated by the encoder
reduced the variability among different noise types while retaining the speech infor-
mation intact. Experimental results showed that the proposed model outperformed
the conventional DNN-based speech enhancement algorithms in terms of various
measurements in both the matched and mismatched noise conditions. Moreover, the
proposed model achieved more competitive noise-invariant property through disen-
tangled feature learning. Visualization of the speech latent features demonstrated
that the proposed method was able to disentangle speech attributes from the noisy
speech in the latent feature space.
61
Chapter 5
Conclusions
In this thesis, we proposed a variety of deep learning techniques to improve
performance of acoustic environment recognition and speech enhancement. In or-
der to enhance the classification accuracy of acoustic scenes, we proposed a novel
neural network structure which achieved higher performance compared with the
conventional DNN, CNN and LSTM architecture in terms of both frame-based and
segment-based accuracy. By combining different networks in parallel, the proposed
method was able to learn complementary information of LSTM and CNN.
Also, we proposed a neural network for overlapping AEC based on joint training
between source separation model and multi-label classification model. By adopting
the source separation framework into the overlapping AEC task, the jointly trained
network can minimize the interference of overlapping events. From the experimental
results, it has been found that the proposed technique outperforms the baseline
networks which do not apply the joint training with source separation.
Finally, we proposed a novel speech enhancement method in which speech and
noise latent features were disentangled via adversarial learning. In order to explore
63
the disentangled representation which has not been exploited in the conventional
speech enhancement algorithms, we designed a model using GRLs. The proposed
architecture is composed of five sub-networks where the decoders and the disentan-
glers were trained in an adversarial manner to encourage the encoder to produce
noise-invariant features. The speech latent features generated by the encoder re-
duced the variability among different noise types while retaining the speech infor-
mation intact. Experimental results showed that the proposed model outperformed
the conventional DNN-based speech enhancement algorithms in terms of various
measurements in both the matched and mismatched noise conditions. Moreover, the
proposed model achieved more competitive noise-invariant property through disen-
tangled feature learning. Visualization of the speech latent features demonstrated
that the proposed method was able to disentangle speech attributes from the noisy
speech in the latent feature space.
64
Bibliography
[1] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “Cp-jku submissions
for dcase-2016: a hybrid approach using binaural i-vectors and deep convolu-
tional neural networks,” IEEE AASP Challenge on Detection and Classification
of Acoustic Scenes and Events (DCASE), 2016.
[2] H. Phan, L. Hertel, M. Maass, P. Koch, R. Mazur, and A. Mertins, “Improved
audio scene classification based on label-tree embeddings and convolutional neu-
ral networks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 25, no. 6, pp. 1278–1290, 2017.
[3] L. Lu, H.-J. Zhang, and S. Z. Li, “Content-based audio classification and seg-
mentation by using support vector machines,” Multimedia systems, vol. 8, no. 6,
pp. 482–492, 2003.
[4] A. Temko and C. Nadeu, “Classification of acoustic events using svm-based
clustering schemes,” Pattern Recognition, vol. 39, no. 4, pp. 682–694, 2006.
[5] A. Temko and C. Nadeu, “Acoustic event detection in meeting-room environ-
ments,” Pattern Recognition Letters, vol. 30, no. 14, pp. 1281–1288, 2009.
65
[6] A. Harma, M. F. McKinney, and J. Skowronek, “Automatic surveillance of the
acoustic activity in our living environment,” in IEEE International Conference
on Multimedia and Expo (ICME), 2005.
[7] M. Xu, C. Xu, L. Duan, J. S. Jin, and S. Luo, “Audio keywords generation for
sports video analysis,” ACM Transactions on Multimedia Computing, Commu-
nications, and Applications (TOMM), vol. 4, no. 2, p. 11, 2008.
[8] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,
M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset
for audio events,” in IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2017.
[9] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training
to increase speech intelligibility for hearing-impaired listeners in novel noises,”
The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. 2604–2612,
2016.
[10] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “A deep denois-
ing autoencoder approach to improving the intelligibility of vocoded speech in
cochlear implant simulation,” IEEE Transactions on Biomedical Engineering,
vol. 64, no. 7, pp. 1568–1578, 2016.
[11] A. Maas, Q. V. Le, T. M. O’neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Re-
current neural networks for noise reduction in robust asr,” 2012.
[12] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement
with generative adversarial networks for robust speech recognition,” in 2018
66
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2018, pp. 5024–5028.
[13] J. Ortega-Garcıa and J. Gonzalez-Rodrıguez, “Overview of speech enhancement
techniques for automatic speaker recognition,” in Proceeding of Fourth Inter-
national Conference on Spoken Language Processing. ICSLP’96, vol. 2. IEEE,
1996, pp. 929–932.
[14] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,”
IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2,
pp. 113–120, 1979.
[15] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square
error short-time spectral amplitude estimator,” IEEE Transactions on acous-
tics, speech, and signal processing, vol. 32, no. 6, pp. 1109–1121, 1984.
[16] N. S. Kim and J.-H. Chang, “Spectral enhancement based on global soft deci-
sion,” IEEE Signal processing letters, vol. 7, no. 5, pp. 108–110, 2000.
[17] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp.
197–210, 1978.
[18] P. Gupta, M. Patidar, and P. Nema, “Performance analysis of speech enhance-
ment using lms, nlms and unanr algorithms,” in 2015 International Conference
on Computer, Communication and Control (IC4). IEEE, 2015, pp. 1–5.
[19] R. Li, Y. Liu, Y. Shi, L. Dong, and W. Cui, “Ilmsaf based speech enhancement
with dnn and noise classification,” Speech Communication, vol. 85, pp. 53–70,
2016.
67
[20] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise envi-
ronments,” Signal processing, vol. 81, no. 11, pp. 2403–2418, 2001.
[21] K. Kwon, J. W. Shin, and N. S. Kim, “Nmf-based speech enhancement using
bases update,” IEEE Signal Processing Letters, vol. 22, no. 4, pp. 450–454,
2014.
[22] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised
speech enhancement using nonnegative matrix factorization,” IEEE Transac-
tions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140–2151,
2013.
[23] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D.
Plumbley, “Detection and classification of acoustic scenes and events: An
IEEE AASP challenge,” in IEEE Workshop on Applications of Signal Process-
ing to Audio and Acoustics, 2013, pp. 1–4.
[24] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic scene
classification: Classifying environments from the sounds they produce,” IEEE
Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.
[25] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “De-
tection and classification of acoustic scenes and events,” IEEE Transactions on
Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
[26] Z. Kons and O. Toledo-Ronen, “Audio event classification using deep neural
networks,” in INTERSPEECH, 2013, pp. 1482–1486.
68
[27] O. Gencoglu, T. Virtanen, and H. Huttunen, “Recognition of acoustic events
using deep neural networks,” in European Signal Processing Conference (EU-
SIPCO), 2014, pp. 506–510.
[28] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event
classification using deep neural networks,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing,, vol. 23, no. 3, pp. 540–552, 2015.
[29] A. Graves, “Supervised sequence labelling,” in Supervised Sequence Labelling
with Recurrent Neural Networks. Springer, 2012, pp. 5–13.
[30] Y. Wang, L. Neves, and F. Metze, “Audio-based multimedia event detection
using deep recurrent neural networks,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2742–2746.
[31] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for
polyphonic sound event detection in real life recordings,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.
6440–6444.
[32] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using
convolutional neural networks,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2015, pp. 559–563.
[33] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, “Exploiting spectro-
temporal locality in deep learning based acoustic event detection,” EURASIP
Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, pp. 1–12,
2015.
69
[34] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene
classification and sound event detection,” in European Signal Processing Con-
ference (EUSIPCO), 2016.
[35] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent
neural networks,” arXiv preprint arXiv:1211.5063, 2012.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu-
tation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recog-
nition with 1-max pooling convolutional neural networks,” arXiv preprint
arXiv:1604.06338, 2016.
[38] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for
visual recognition and description,” in IEEE Conference on Computer Vision
and Pattern Recognition, 2015, pp. 2625–2634.
[39] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-
term memory, fully connected deep neural networks,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2015, pp. 4580–4584.
[40] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint
arXiv:1212.5701, 2012.
[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,” The
Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
70
[42] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependent
sound event detection,” EURASIP Journal on Audio, Speech, and Music Pro-
cessing, vol. 2013, no. 1, pp. 1–13, 2013.
[43] A. Dessein, A. Cont, and G. Lemaitre, “Real-time detection of overlapping
sound events with non-negative matrix factorization,” in Matrix Information
Geometry. Springer, 2013, pp. 341–371.
[44] Y. Wang and F. Metze, “A first attempt at polyphonic sound event detection
using connectionist temporal classification,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2017.
[45] E. Benetos, G. Lafay, M. Lagrange, and M. D. Plumbley, “Polyphonic sound
event tracking using linear dynamical systems,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1266–1277, 2017.
[46] J. Dennis, H. D. Tran, and E. S. Chng, “Overlapping sound event recognition
using local spectrogram features and the generalised hough transform,” Pattern
Recognition Letters, vol. 34, no. 9, pp. 1085–1093, 2013.
[47] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event
detection using multi label deep neural networks,” in IEEE International Joint
Conference on Neural Networks (IJCNN), 2015, pp. 1–7.
[48] T. Heittola, A. Mesaros, T. Virtanen, and M. Gabbouj, “Supervised model
training for overlapping sound events based on unsupervised source separation.”
in IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2013, pp. 8677–8681.
71
[49] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint opti-
mization of masks and deep recurrent neural networks for monaural source sep-
aration,” IEEE/ACM Transactions on Audio, Speech and Language Processing
(TASLP), vol. 23, no. 12, pp. 2136–2147, 2015.
[50] E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbley, “Discriminative
enhancement for single channel audio source separation using deep neural net-
works,” in International Conference on Latent Variable Analysis and Signal
Separation. Springer, 2017, pp. 236–246.
[51] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “NMF-based target source
separation using deep neural network,” IEEE Signal Processing Letters, vol. 22,
no. 2, pp. 229–233, 2015.
[52] A. Narayanan and D. Wang, “Improving robustness of deep neural network
acoustic models via speech separation and joint adaptive training,” IEEE/ACM
transactions on audio, speech, and language processing, vol. 23, no. 1, pp. 92–
101, 2015.
[53] K. H. Lee, S. J. Kang, W. H. Kang, and N. S. Kim, “Two-stage noise aware
training using asymmetric deep denoising autoencoder,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.
5765–5769.
[54] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic
speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 24, no. 4, pp. 796–806, 2016.
72
[55] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene
classification and sound event detection,” in European Signal Processing Con-
ference (EUSIPCO), 2016, pp. 1128–1132.
[56] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[57] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind
audio source separation,” IEEE transactions on audio, speech, and language
processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[58] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep
denoising autoencoder.” in Interspeech, 2013, pp. 436–440.
[59] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech
enhancement based on deep neural networks,” IEEE/ACM Transactions on
Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19,
2015.
[60] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “Nmf-based target source
separation using deep neural network,” IEEE Signal Processing Letters, vol. 22,
no. 2, pp. 229–233, 2014.
[61] X.-L. Zhang and D. Wang, “A deep ensemble learning method for monaural
speech separation,” IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), vol. 24, no. 5, pp. 967–977, 2016.
[62] J. Chen and D. Wang, “Long short-term memory for speaker generalization in
supervised speech separation,” The Journal of the Acoustical Society of Amer-
ica, vol. 141, no. 6, pp. 4705–4714, 2017.
73
[63] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee, “Convolutional-recurrent neural
networks for speech enhancement,” in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2401–
2405.
[64] P. Chandna, M. Miron, J. Janer, and E. Gomez, “Monoaural audio source sep-
aration using deep convolutional neural networks,” in International conference
on latent variable analysis and signal separation. Springer, 2017, pp. 258–266.
[65] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in
neural information processing systems, 2014, pp. 2672–2680.
[66] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative
adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[67] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based speech
enhancement using generative adversarial network,” in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018,
pp. 5039–5043.
[68] A. Pandey and D. Wang, “On adversarial training and loss functions for speech
enhancement,” in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2018, pp. 5414–5418.
[69] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural net-
works,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–
2030, 2016.
74
[70] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised
speech separation,” IEEE/ACM transactions on audio, speech, and language
processing, vol. 22, no. 12, pp. 1849–1858, 2014.
[71] M. Delfarah and D. Wang, “Features for masking-based monaural speech sepa-
ration in reverberant conditions,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 25, no. 5, pp. 1085–1094, 2017.
[72] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on
knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[73] Y. Shinohara, “Adversarial multi-task learning of deep neural networks for ro-
bust speech recognition.” in INTERSPEECH. San Francisco, CA, USA, 2016,
pp. 2369–2372.
[74] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domain adap-
tation approach for robust speech recognition,” Neurocomputing, vol. 257, pp.
79–87, 2017.
[75] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B.-H. Juang,
“Speaker-invariant training via adversarial learning,” in 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018, pp. 5969–5973.
[76] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial
training for accented speech recognition,” in 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
4854–4858.
75
[77] Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsupervised do-
main adaptation via domain adversarial training for speaker recognition,” in
2018 IEEE International Conference on Acoustics, Speech and Signal Process-
ing (ICASSP). IEEE, 2018, pp. 4889–4893.
[78] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recog-
nition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. 26, no. 12, pp. 2423–2435, 2018.
[79] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptive
speech enhancement using domain adversarial training,” arXiv preprint
arXiv:1807.07501, 2018.
[80] L. R. Rabiner and B. Gold, “Theory and application of digital signal process-
ing,” Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 1975.
[81] V. Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit
and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990.
[82] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition:
Ii. noisex-92: A database and an experiment to study the effect of additive
noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp.
247–251, 1993.
[83] P. Kabal, “Tsp speech database,” McGill University, Database Version, vol. 1,
no. 0, pp. 09–02, 2002.
[84] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve
neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
76
[85] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[86] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-
mawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine
learning,” in 12th Symposium on Operating Systems Design and Implementa-
tion, 2016, pp. 265–283.
[87] I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An ob-
jective method for end-to-end speech quality assessment of narrow-band tele-
phone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
[88] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of
speech masked by modulated noise maskers,” IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
[89] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of
machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
77
요 약
우리 주변에서 발생하는 소리들은 많은 정보를 담고 있으며, 특히 인간의 음성이
가장대표적인예이다.하지만음성외에발생하는환경음 (environmental sound)또한
사용자맞춤형서비스측면에서주위환경을파악하는중요한요소가될수있다.이러한
환경음은 음성 정보를 추출하기 위한 어플리케이션에는 잡음으로 작용되어 제거해야
할 대상이 되며, 반대로 주변 환경을 파악하기 위한 어플리케이션에서는 인식해야 할
대상이 된다. 이와 같은 관점으로 본 논문에서는 딥 러닝 기반의 음향 환경 분류와 음성
향상 기법에 대해 제안한다.
먼저 음향 환경 분류를 위해 CNN (convolutional neural network)과 LSTM (long
short-term memory)을 결합하여 학습하는 분류 모델을 제안한다. 기존에 사용되었던
DNN (deep neural network) 기반 모델들은 음향 신호의 시간적인 정보를 활용하지
못한다는 단점이 있었다. 이를 극복하기 위해 LSTM 구조를 통해 시간적인 정보를 이
용하였으며, 또한 음향 신호의 국부적인 주파수와 시간의 상관 정보를 이용하기 위해
CNN 구조를 함께 결합하였다. 이는 서로 다른 두 모델이 상호 보완적인 정보를 이
용하여 학습이 되게 함으로써 기존의 기법에 비해 음향 환경 분류 성능이 향상됨을
확인하였다.
두 번째로 중첩된 음향 이벤트의 분류를 위해 음원 분리를 적용한 기법을 제안한다.
실생활에서는 서로 다른 음원들이 중첩되어 발생하는 경우가 많으며,이는 분류의 난이
도를높이는요소로작용한다.이를해결하기위해중첩된음향이벤트를음원분리하는
79
모델을통해학습시키고,별도로각각의분리된이벤트를분류하는모델을학습시킨후,
마지막으로 두 모델을 결합하여 다시 훈련 (joint training)을 한다. 이를 통해 훈련된
모델은 중첩된 음향을 효과적으로 분리하여 각각의 이벤트를 분류하는 성능을 높이게
된다.
마지막으로, 팩터 분리 학습 (disentangled factor learning)을 적용한 음성 향상 기
법을 제안한다. 위에서 제안한 기법들은 환경음을 인식하는 어플리케이션이지만, 음성
향상에서는 음성 이 외의 환경음은 제거를 목적으로 한다. 제안한 기법은 음성과 잡음
을 각각 다른 팩터로 하여 잠재 공간 (latent space) 상에서 두 팩터를 분리하고, 잡음
팩터가 제거된 음성 팩터를 통해 깨끗한 음성 (clean speech)을 추정한다. 팩터 분리
학습으로 접근한 음성 향상 기법은 여러 성능 측정 기준에서 기존 딥 러닝 기반의 음성
향상기법들보다뛰어난성능을보였다.또한환경음분류정보를사전에이용한환경음
인지 학습 (environmental sound aware training)이 음성 향상 성능에 미치는 영향을
확인하였다.
주요어: 음성 향상, 음향 환경 분류, 중첩된 음향 이벤트 분류, 음원 분리, 팩터 분리
학습, 딥 러닝
학 번: 2012-20781
80
감사의 글
박사 학위의 목표를 위해 휴먼인터페이스 연구실에 들어온 지도 어느덧 8년이 되어
갑니다. 지금 이 글을 쓰고 있는 시점에서 그 동안의 대학원 생활을 되돌아보고 내가
얼마나성장했는지에대해많은생각이듭니다.석박학위과정을거치면서연구는물론
연구 외적으로도 많은 것을 배울수 있었던 소중한 시간이였습니다. 그 동안 함께하고
많은 도움을 주신 소중한 분들께 감사의 인사를 드립니다.
가장 먼저 지도 교수님이신 김남수 교수님께 감사의 말씀을 드립니다. 연구적으
로 많은 아이디어와 영감을 주시고 부족한 제가 졸업을 하기까지 가장 큰 힘이 되어
주셨습니다. 연구 외적으로도 제자들에게 대하는 모습을 보면 진정한 스승의 의미를
깨닫게 됩니다. 교수님께서 그 동안 주신 가르침을 받아 사회에 나가서도 계속 성장해
나갈 수 있는 사람이 되겠습니다. 박사 학위 논문의 부족한 부분에 대해서 많은 조언과
도움을 주신 김성철 교수님, 심병효 교수님, 장준혁 교수님, 그리고 신종원 교수님께도
감사 드립니다. 지도해주신 모든 교수님들께서 항상 건강하시고 삶에 행복이 깃들길
바라겠습니다.
제가 신입생 때부터 많은 조언을 해주시고 지금은 각자 사회에 나가 연구실을 빛
내고 계시는 먼저 졸업하신 선배님들께도 감사의 인사를 드립니다. 창우형, 준식이형,
기호형,유광이형,두화형,신재형,철민이형,석재,태균이,가장고마운기수형,그리고
석사 졸업을 한 현우, 수카냐, 세영이, 지환이, 석완이, 모두 계속해서 밝은 앞날이 되길
바라겠습니다. 또한 먼저 졸업한 제 동기 강현이, 같이 술도 마시고 운동도 하며 덕분에
81
대학원 생활에 활력을 얻게 해주었습니다. 같이 뉴올리언스 학회를 갔던 그 때가 가끔
그립기도 합니다. 지금도 열심히 연구에 매진하고 있는 후배들에게도 감사 드립니다.
덕분에 많은 추억을 안고 졸업을 하게 되었습니다. 논문과 학회, 그리고 과제까지 저와
모든 것을 함께한 인규도 졸업 축하하고, 회사에서도 승승장구 하길 바랍니다. 책임감
있게 오랜기간 방장을 잘 수행했고, 항상 편한 대화 상대가 되어준 준엽이, 또한 마찬가
지로 자기 일에 책임감이 강하고 술자리에서 같이 술을 잘 마실 수 있는 정훈이도 다음
졸업 타자로써 무사히 학위를 마치길 응원합니다. 성준이도 조금만 더 힘내서 졸업과
함께 여자친구와의 결혼도 자연스럽게 연결되길 바랍니다. 연구실 핫이슈/인기쟁이/
삼각별/다이어터 형용이, 그리고 대학원 연구는 이렇게 하는 것이다 라는 진수를 보여
주고 있는 우현이도 마무리 잘 하길 바랍니다. 취미생활의 과도기를 지나 연구에 포텐
터진 원익이는 쿠알라룸푸르에서의 추억이 떠오릅니다. 똑똑하고 운동 신경도 적절하
고 싹싹하기 까지 한 현승이는 이제 연애만 하면 완벽할 것 같습니다. 회사 다니면서도
석사까지 마치고, 박사학위를 받고 있는 주현이형은 인서울 분양 아파트가 부럽습니다.
피지컬 좋고 마인드 좋은 병진이는 자기 일도 열심히 하고, 결혼도 했으니 참 듬직합니
다. 묵묵한 성실 대장 성환이는 누군가 데리고 가서 일탈을 맛보게 해 주시길 바랍니다.
이제막운동을같이시작한민현이는아직털어줘야하는근육이한참남았는데떠나게
되어 아쉽습니다. 하루 두번 연구실 출근하는 형래는 캔커피 좀 줄이고 석사 졸업 무사
히 하길 바랍니다. 좋은 친구가 될 수 있을 것 같은 지원이는 지금과 같은 추진력이면
박사 과정도 훌륭히 해낼 것이라 믿습니다. 석민이는 송년회 때 제가 술먹고 괴롭혀서
미안합니다. 다 기억하고 있습니다. 오래 전부터 같이 운동하면 좋았을 것 같은 민찬이
에게는 민현이의 하드 트레이닝을 부탁합니다. 이렇게 연구를 잘 할수 있나 하는 사람,
그 사람이 형주입니다. 아랫방 귀염둥이 범준이는 전문연 되길 바라며, 병찬이는 한
참 남았지만 졸업 때까지 꼭 난을 잘 키워서 후배에게 넘겨주면 좋겠습니다. 길호형도
바쁘시겠지만 학위 마무리 잘 하시길 바랍니다. 모두들 건강하게 원하는 목표 이루고
대학원 졸업하기를 바라겠습니다.
82