disclaimers-space.snu.ac.kr/bitstream/10371/168036/1/000000160062.pdf · 2020-05-19 · deep...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

공학박사학위논문

Environmental Sound Classification

and Disentangled Factor Learning

for Speech Enhancement

음성 향상을 위한 환경음 분류 및 팩터 분리 학습

2020년 2월

서울대학교 대학원

전기 ․ 컴퓨터공학부

배 수 현

Abstract

Sounds carry a large amount of information about our everyday environment, es-

pecially human speech. However, environmental sound can also be an important fac-

tor in understanding the surrounding environment for user-customized services. The

environmental sound acts as noise to be removed to the application for extracting

speech information and is an object to be recognized to the application for extracting

environmental information. From this perspective, we propose deep learning-based

acoustic environment classification and speech enhancement techniques.

The goal of acoustic scene classification is to classify a test recording into one

of the predefined acoustic scene classes. In the last few years, deep neural networks

(DNNs) have achieved great success in various learning tasks and have also been

used for the classification of environmental sounds. While DNNs are showing their

potential in the classification task, they cannot fully utilize the temporal informa-

tion. In this thesis, we propose a neural network architecture for the purpose of

using sequential information. The long short-term memory (LSTM) layers extract

the sequential information from consecutive audio features. The convolutional neural

network (CNN) layers learn the spectro-temporal locality from spectrogram images,

and the fully connected layers summarize the outputs of two networks to take ad-

vantage of the complementary features of the LSTM and CNN by combining them.

i

By using the proposed combination structure, we achieved higher performance com-

pared to the conventional DNN, CNN, and LSTM architectures.

Overlapping acoustic event classification is the task of estimating multiple acous-

tic events in a mixed source. In the case of non-overlapping event classification, many

approaches have achieved great success using various feature extraction methods and

deep learning models. However, in most real-life situations, acoustic events are over-

lapped, and different events may share similar properties. Simultaneously detecting

mixed sources is a challenging problem. In this thesis, we propose a classification

method for overlapping acoustic events that incorporates joint training with the

source separation framework. Since overlapping acoustic events are mixed in multi-

ple sources, we train the source separation model and multi-label classification model

for estimating the type of overlapping acoustic events. The source separation model

is trained to reconstruct the target sources by minimizing the interference of over-

lapping events. Joint training can be conducted to achieve end-to-end optimization

between the acoustic event source separation and multi-label estimation.

Speech enhancement techniques aim to improve the quality and intelligibility of

a given speech degraded by certain additive noise in the background. Most of the

recently proposed deep learning-based speech enhancement techniques have focused

on designing the neural network architectures as a black box. However, it is often

beneficial to understand what kinds of hidden representations the model has learned.

Since the real-world speech data are drawn from a generative process involving

multiple entangled factors, disentangling the speech factor can encourage the trained

model to result in better performance for speech enhancement. With the recent

success in learning disentangled representation using neural networks, we explore

a framework for disentangling speech and noise, which has not been exploited in

ii

conventional speech enhancement algorithms. In this thesis, we propose a novel

noise-invariant speech enhancement method that manipulates the latent features to

distinguish between the speech and noise features in the intermediate layers using an

adversarial training scheme. Experimental results show that our model successfully

disentangles the speech and noise latent features. Consequently, the proposed model

not only achieves better enhancement performance but also offers more robust noise-

invariant property than conventional speech enhancement techniques.

Keywords: Speech enhancement, acoustic scene classification, overlapping acous-

tic event classification, source separation, disentangled factor learning, deep

learning

Student number: 2012-20781

iii

Contents

Abstract i

Contents iv

List of Figures vii

List of Tables x

1 Introduction 1

1.1 Environmental Sound Classification . . . . . . . . . . . . . . . . . . . 1

1.2 Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Disentangled Factor Learning . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Deep Learning Models for Acoustic Scene Classification 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Parallel Combination of LSTM and CNN . . . . . . . . . . . . . . . 10

2.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 LSTM Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

v

2.3.3 CNN layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 Connected Layer of LSTM and CNN . . . . . . . . . . . . . . 14

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Dataset and Measurement . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Neural Networks Setup . . . . . . . . . . . . . . . . . . . . . 15

2.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Overlapping Acoustic Event Classification Based on Joint Training

with Source Separation 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Source Separation of Overlapping Acoustic Event . . . . . . . . . . . 22

3.3 Proposed Method Using Joint Training . . . . . . . . . . . . . . . . . 24

3.3.1 Source Separation Model . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Multi-Label Classification Model . . . . . . . . . . . . . . . . 26

3.3.3 Joint Training Method . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Dataset and Data Augmentation . . . . . . . . . . . . . . . . 27

3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.3 Evaluation of Source Separation . . . . . . . . . . . . . . . . 30

3.4.4 Acoustic Event Classification Results . . . . . . . . . . . . . . 32

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Disentangled Feature Learning for Noise-Invariant Speech Enhance-

ment 37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vi

4.2 Masking-Based Speech Enhancement . . . . . . . . . . . . . . . . . . 39

4.3 Concept of Domain Adversarial Training . . . . . . . . . . . . . . . . 40

4.4 Disentangling Speech and Noise factors . . . . . . . . . . . . . . . . 42

4.4.1 Neural Network Architecture . . . . . . . . . . . . . . . . . . 42

4.4.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 Adversarial Training for Disentangled Features . . . . . . . . 46

4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.1 Dataset and Feature Extraction . . . . . . . . . . . . . . . . . 48

4.5.2 Network Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.3 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . 52

4.5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 52

4.5.5 Analysis of Noise-Invariant Speech Enhancement . . . . . . . 58

4.5.6 Disentangled Feature Representations . . . . . . . . . . . . . 59

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Conclusions 63

Bibliography 65

요 약 78

감사의 글 81

vii

List of Figures

2.1 Scheme of the proposed method. . . . . . . . . . . . . . . . . . . . . 10

2.2 Neural network structure for the proposed technique. . . . . . . . . . 12

3.1 Scheme of the proposed method. . . . . . . . . . . . . . . . . . . . . 24

3.2 Joint training structure for the proposed technique. . . . . . . . . . . 25

3.3 Comparison between separated and integrated source separation mod-

els. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 The source separation performance (SDR [dB]) . . . . . . . . . . . . 31

3.5 The source separation performance (SIR [dB]) . . . . . . . . . . . . . 31

3.6 Results of source separation in the time domain. . . . . . . . . . . . 32

3.7 Multi-task learning for overlapping acoustic event classification. . . . 34

4.1 Scheme of DNN-based speech enhancement method. . . . . . . . . . 40

4.2 The architecture of the proposed model for disentangled feature learn-

ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Plot of losses on training the proposed model. . . . . . . . . . . . . . 49

4.4 The architectures of the baseline models. . . . . . . . . . . . . . . . . . . 51

ix

4.5 (From top to bottom) The spectrograms of noisy speech degraded by

metro noise with −3 dB SNR, enhanced speech by the snT model,

enhanced speech by the snDT model, and the corresponding clean

speech, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Results of subjective preference test (%) comparing the speech quality

for the snT and snDT models with various SNR values. . . . . . . . 58

4.7 Variances of PESQ scores for the 14 different noise types in various

SNR environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Visualization of speech latent feature (zs) using t-SNE in the matched noise

condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.9 Visualization of speech latent feature (zs) using t-SNE in the mismatched

noise condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

x

List of Tables

2.1 Frame-based classification accuracy (%) on IEEE DCASE 2016 Chal-

lenge Task 1 Development Dataset. . . . . . . . . . . . . . . . . . . . 17

2.2 Segment-based (30s) classification accuracy (%) on IEEE DCASE

2016 Challenge Task 1 Development Dataset. Asterisk(*) CNN-LSTM

represents the accuracy on Evaluation Dataset. . . . . . . . . . . . . 18

3.1 Precision performance of overlapping acoustic event classification. . . 33

3.2 Recall performance of overlapping acoustic event classification. . . . 33

3.3 F-score of overlapping acoustic event classification. . . . . . . . . . . 33

4.1 Results of PESQ, segSNR, eSTOI, and SDR values of the proposed

and baseline networks in the matched noise type conditions, where

−6 and 9 dB cases are unseen SNR conditions. . . . . . . . . . . . . 53

4.2 Results of PESQ, segSNR, eSTOI, and SDR values of the proposed

and baseline networks in the mismatched noise type conditions, where

−6 and 9 dB cases are unseen SNR conditions. . . . . . . . . . . . . 54

xi

Chapter 1

Introduction

1.1 Environmental Sound Classification

Environmental sound classification, which attempts to classify or detect audio

signals into predetermined classes, constitutes one of the main tasks of the emerg-

ing research field named “machine hearing.” Environmental sound is classified into

two categories. An acoustic scene is a complex environmental sound from multiple

sources, and an acoustic event is a single sound from a specific source.

The goal of acoustic scene classification is to classify a sound into one of the

predefined classes that characterize the environment in which it was recorded. To

deal with the acoustic scene classification, many approaches have been proposed,

including feature representation and classification models. A variety of acoustic fea-

tures have been used to represent the acoustic scenes and events. Examples include

single- or multi-dimensional log-mel spectrogram, wavelet spectrogram, and a kind

of i-vector extraction from the traditional features like mel-frequency cepstral co-

efficients (MFCC) [1]. Moreover, many methods for combining multiple acoustic

1

features have been proposed, such as MFCC, gammatone filter, and log-mel en-

ergy [2], or even a wide range of features. Before deep learning was actively studied,

the support vector machine (SVM) was one of the most successful learning models

in a number of scene classification tasks. [3], [4]. Recently, many deep learning-based

scene classification techniques have been proposed and have shown outstanding per-

formance in classifying acoustic scenes.

Acoustic event is a segment of environmental audio that easily occur in human

life, such as coughing, phone ringing, clash sound and so on. Acoustic event clas-

sification (AEC) and detection (AED) aim to recognize the audio elements inside

an audio clip. Recognizing acoustic events in audio can be utilized in various appli-

cations, including indoor environment recognition [5], surveillance systems [6] and

automatic audio indexing [7]. Recently, as the interest in this area increases, large

datasets [8] were released and challenges such as the detection and classification of

acoustic scenes and events (DCASE) challenge have been held. Research on AED

can be separated into two main scenarios, overlapping and non-overlapping. Over-

lapping AED is a much more challenging problem due to the mixture of acoustic

sources and is considered to be more important because acoustic events often overlap

with each other in real life recordings.

1.2 Speech Enhancement

Speech enhancement techniques aim to improve the quality and intelligibility of

a given speech degraded by certain additive noise in the background. In a variety

of applications, speech enhancement is considered as an essential pre-processing

step. This technique can be directly employed to improve the quality of mobile

2

communications in noisy environments or to enhance speech signals for hearing aid

devices [9], [10] before amplification. Speech enhancement has also been widely used

as a pre-processing technique in automatic speech recognition (ASR) [11], [12] and

speaker recognition systems [13] for more robust performances.

Over the past several decades, myriads of approaches have been developed in

the speech research community for better speech enhancement. Spectral subtraction

method [14] suppresses stationary noise from the input noisy speech by subtract-

ing the spectral noise bias computed during the non-speech activity periods. The

minimum mean-square error (MMSE) based spectral amplitude estimator [15], [16]

showed promising results in terms of reducing residual noise as compared to the

spectral subtraction method or Wiener filtering-based algorithm [17]. Least mean

square adaptive filtering (LMSAF) based speech enhancement approaches have the

best filtering performances of Wiener filter. Meanwhile, they do not need a priori

knowledge, and can be adapted to the external environment by self-learning. But

these approaches have some disadvantages including low constringency, strong sensi-

tivity to non-stationary noise and a contradiction between constringency and stabil-

ity [18], [19]. The minima controlled recursive averaging (MCRA) noise estimation

was also introduced in [20] of which the performance is known to be reasonably com-

petitive under the environments with relatively high signal-to-noise ratios (SNR).

However, since these statistical models are constructed based on a stationarity as-

sumption, their performances generally tend to deteriorate in low SNR or highly

non-stationary noise conditions. Non-negative matrix factorization (NMF) is one

of the most common template-based approaches to speech enhancement [21], [22],

which models noisy observations as a weighted sum of non-negative source bases.

NMF-based speech enhancement methods are more robust to non-stationary noise

3

conditions as compared to the statistical model-based methods. These approaches,

however, often result in signal distortion in the enhanced speech since they are based

on an unrealistic assumption that speech spectrograms are linear combinations of the

basis spectra. Due to the complex nature of the noise corruption process, non-linear

models such as deep neural networks (DNNs) have been suggested as an alternative

choice for modeling the relationship between the noisy and the corresponding clean

speech utterances.

1.3 Disentangled Factor Learning

The real-world speech data are drawn from a generative process involving mul-

tiple entangled factors. A challenge in understanding speech data is learning to dis-

entangle the underlying factors of variation that give rise to the observations. The

factors of variation involved in generating a speech recording include the speaker’s

attributes as well as noise and channel information. The difficulty of disentangling

these hidden factors is that, in most real-world situations, each can influence the

observation in a different and unpredictable way. By separating the desired factors,

disentangled factor learning can be helpful to improve the performance of the task

to be solved. In this thesis, we propose a method to disentangle a factor with speech

components and a factor with noise properties from the noisy speech input.

1.4 Outline of the thesis

In this thesis, motivated by the success of DNN in speech processing area, we

adopt the deep learning approaches to the environmental sound classification and

speech enhancement.

4

In Chapter 2, we propose a neural network architecture for the purpose of using

sequential information. The proposed structure is composed of two separated lower

networks and one upper network. We refer to these as long short-term memory

(LSTM) layers, convolutional neural network (CNN) layers and connected layers,

respectively. The LSTM layers extract the sequential information from consecutive

audio features. The CNN layers learn the spectro-temporal locality from spectrogram

images. Finally, the connected layers summarize the outputs of two networks to

take advantage of the complementary features of the LSTM and CNN by combining

them. To compare the proposed method with other neural networks, we conducted

a number of experiments on the TUT acoustic scenes 2016 dataset which consists

of recordings from various acoustic scenes.

In Chapter 3, we propose a classification method for overlapping acoustic events

which incorporates joint training with source separation framework. Since overlap-

ping acoustic events are mixed in multiple sources, we train the source separation

model and multi-label classification model for estimating the type of overlapping

acoustic events. The source separation model is trained to reconstruct the target

sources by minimizing the interference of overlapping events. Joint training can be

conducted to achieve end-to-end optimization between the acoustic event source sep-

aration and multi-label estimation. To evaluate the proposed method, we conducted

a number of experiments using artificially mixed data.

In Chapter 4, we propose a novel noise-invariant speech enhancement method

which manipulates the latent features to distinguish between the speech and noise

features in the intermediate layers using adversarial training scheme. To compare

the performance of the proposed method with other conventional algorithms, we

conducted experiments in both the matched and mismatched noise conditions using

5

TIMIT and TSPspeech datasets.

The rest of the thesis is organized as follows: The next Chapter introduces the

proposed acoustic scene classification method using parallel combination of LSTM

and CNN. In Chapter 3, a joint training with source separation is proposed for over-

lapping acoustic event classification. Finally, a novel speech enhancement method

using disentangled feature learning is proposed in Chapter 4. The conclusions are

drawn in Chapter 5.

6

Chapter 2

Deep Learning Models for

Acoustic Scene Classification

2.1 Introduction

Acoustic scene classification aims to recognize the environmental sounds that oc-

cur for a period of time. Many approaches have been proposed for acoustic scene clas-

sification including feature representation, classification models, and post-processing.

The support vector machine (SVM) was one of the most successful learning model in

a number of scene classification tasks. As SVM is a binary classifier, some additional

methods must be combined to apply them to the multi-class problems, such as the

use of tree or clustering schemes [3], [4]. Furthermore, many machine learning-based

scene classification techniques were proposed in the detection and classification of

acoustic scenes and events (DCASE) challenge 2013 [23]–[25].

However, as deep learning techniques have been widely used on various learning

tasks, researchers have started to apply them to acoustic scene classification as well

7

[26], [27]. In [28], a DNN-based sound event classification algorithm was performed

with several image features.

Deep neural networks (DNNs) are powerful pattern classifier which enables the

networks to learn the highly nonlinear relationships between the input features and

output targets. Though the DNNs work well in the classification task, they cannot

be used to map sequences to sequences because of their structural limitations. To

overcome this shortcoming, recurrent neural networks (RNNs) and long short-term

memory (LSTM), which is a special type of RNN, have been applied to sequence

learning [29].

DNNs can only map from present input vector to output vector, whereas LSTM

can map from sequence to output sequence or vector. Therefore, LSTM can learn

the temporal information through consecutive input vectors. The authors in [30]

and [31] proposed sound event detection techniques based on bi-directional LSTM

which yielded higher performance compared to the DNNs. Unlike sound events which

occur in a short time frame, acoustic scenes are maintained for relatively longer

range. Thus, applying RNNs to the acoustic scene classification will improve the

performance.

Other approaches were proposed to use convolutional neural networks (CNNs)

with spectrogram image features (SIF) [32]. In [33], the authors addressed the im-

portance of spectro-temporal locality and proposed a CNN-based acoustic event

detection algorithm.

In this chapter, we propose to combine the LSTM and CNNs in parallel as

lower networks in order to exploit sequential correlation and local spectro-temporal

information. In the LSTM layers, sequences of Mel-frequency cepstral coefficients

(MFCCs) features are utilized as input in order to extract the sequential information.

8

The CNN layers learn the spectro-temporal locality from SIF, and SIF clips are

set to have the same length with the timestep of LSTM inputs. The outputs of

the two separated layers are combined by the connected layers which are able to

learn complementary features of LSTM and CNN. To compare the performance

of the proposed method with various neural networks, we conducted a number of

experiments on the TUT acoustic scenes 2016 dataset [34]. The results revealed that

the combination of LSTM and CNN outperforms the conventional DNN, CNN and

LSTM architecture with respect to classification accuracy.

2.2 Long Short-Term Memory

The key idea of RNN is that the recurrent connections between the hidden layers

allow the memory of previous inputs to retain internal state, which can affect the

outputs. However, RNN mainly have two issues to solve in the training phase: vanish-

ing gradient and exploding gradient problems [35]. When computing the derivatives

of activation function in the back propagation process, long-term components may

go exponentially fast to zero. This makes the model hard to learn the correlation

between temporally distant inputs. Meanwhile, when the gradient grows exponen-

tially during training, the exploding gradient problem occurs. In order to solve this

problem, the LSTM architecture was proposed [36]. LSTM layers are composed of

recurrently connected memory blocks in which one memory cell contains three mul-

tiplicative gates. The gates perform continuous analogues of write, read and reset

operations which enable the network to utilize the temporal information over a pe-

riod of time.

9

Figure 2.1: Scheme of the proposed method.

2.3 Parallel Combination of LSTM and CNN

In this section, we describe our approach to improve the classification accuracy

of acoustic scene. The schematic of the proposed training and test procedure is

illustrated in Figure 2.1 and the neural networks structure can be seen in Figure

2.2.

2.3.1 Feature Extraction

In the proposed system, different types of neural networks are combined in par-

allel. Thus, each network accept different form of input feature. The LSTM layers

utilize sequence of acoustic feature, but the CNN layers use spectrogram images. As

inputs for the CNN layers, the SIF are extracted from the sound spectrogram [28],

[32], [37]. Firstly, a spectrogram is generated by short-time Fourier transform. Given

audio frame s(n) segmented by length N and Hamming window w(n), the short time

10

spectral column F(f, t) at time t is computed as,

F(f, t) =

∣∣∣∣∣N−1∑n=0

s(n)w(n)e−j2πnf

N

∣∣∣∣∣ (2.1)

for f = 0, ..., N/2. In order to generate a spectrogram image which has K-bin

frequency resolution, down sampling is performed by using a window of length

W = N/2K as follows:

Fdown(f, t) =W−1∑i=0

F(f + i, t)/W, (2.2)

for f = 0, ..., (K−1). Finally, a simple de-noising method is performed by subtracting

each minimum frequency bin value in a frame-wise manner as follows:

Fdn(f, t) = Fdown(f, t)−mint{Fdown(f, t)} (2.3)

for f = 0, ..., (K − 1). In the proposed system, the extracted SIF has size of K × τ ,

where τ represents the time resolution which is also identical to the timesteps in the

LSTM layers.

11

1st LSTM layer

1st convolutional layer

1st pooling layer

2nd convolutional layer

2nd pooling layer

down sampling

& denoising

2nd LSTM layer

𝑥𝑡𝐿𝑆𝑇𝑀

𝑥𝑡𝐶𝑁𝑁

1st fully connected layer

2nd fully connected layer

Softmax Layer

flattening

class probability 𝑦𝑡

𝑥𝑡−𝜏+1 𝑥𝑡𝑥𝑡−𝜏+2 𝑥𝑡−1

ℎ𝑡−𝜏+1,1 ℎ𝑡,1ℎ𝑡−𝜏+2,1 ℎ𝑡−1,1

ℎ𝑡−𝜏+1,2 ℎ𝑡,2ℎ𝑡−𝜏+2,2 ℎ𝑡−1,2

𝑧𝑡𝐿𝑆𝑇𝑀

feature sequence 𝑥𝑡𝐿𝑆𝑇𝑀 for 𝜏 timesteps

unrolling

𝑧𝑡𝐶𝑁𝑁

𝑧𝑡𝑐𝑜𝑛𝑐𝑎𝑡

𝑧𝑡𝐿𝑆𝑇𝑀

Figure 2.2: Neural network structure for the proposed technique.

12

2.3.2 LSTM Layers

The hidden layers of LSTM have self-recurrent weights. These enable the cell

in the memory block to retain previous information. In the proposed system, τ

vectors are used for sequential learning. The lower part in Figure 2.2 depicts how

the sequences are trained through the LSTM layers. Previous τ − 1 vectors and one

present vector are forwarded to the recurrent layer sequentially. If the MFCC vectors

from xt−τ+1 to xt are used as the present inputs, vectors from xt−τ+2 to xt+1 will be

used as the next input sequence. The output vector zLSTMt is extracted from input

MFCC sequence xLSTMt through the LSTM layers, where xLSTMt = [xt−τ+1, ..., xt].

2.3.3 CNN layers

From Section 3.3.1, SIF xCNNt , which is a F × τ matrix, are extracted. The con-

volutional layer performs 2-dimensional convolution between the spectrogram image

and the pre-defined linear filters. To enable the network to extract complementary

features and learn the characteristics of input SIF, a number of filters with different

functions are used. Thus, if we apply K different filters to the spectrogram image,

K different filtered images are generated in the convolutional layer. The filtered

spectrogram images are forwarded to the pooling layer which conducts down sam-

pling. Especially, max pooling divides the input image into a set of non-overlapping

sub-regions and selects the maximum value. By reducing the spatial size of repre-

sentation via pooling, the most dominant feature in the sub-region is extracted. The

pooling layer operates independently on every filtered image and resizes them spa-

tially. In the last pooling layer, the resized outputs are rearranged in order to fully

connect with the upper layer. The flattened output vector zCNNt is extracted from

13

xCNNt through the CNN layers.

2.3.4 Connected Layer of LSTM and CNN

In [38], long-term recurrent convolution network (LRCN) model was proposed

for visual recognition. LRCN is a consecutive structure of CNN and LSTM. LRCN

processes the variable-length input with a CNN, whose outputs are fed into LSTM

network, which finally predicts the class of the input. In [39], a cascade structure

was used for voice search. Compared to the method mentioned above, the proposed

network forms a parallel structure in which LSTM and CNN accept different inputs

separately. Concatenated vector zconcatt is forwarded to the fully connected layer,

where zconcatt = [zLSTMt , zCNNt ]. The connected layers can train the complementary

information of LSTM and CNN. These enable the proposed model to learn the

sequential information and spectro-temporal information, simultaneously. Finally,

the class probability yt is predicted through the softmax layer.

2.4 Experiments

2.4.1 Dataset and Measurement

To assess the performance of the proposed method, we conducted a number of

experiments on the TUT acoustic scenes 2016 dataset which consists of recordings

from various acoustic scenes. The dataset contains 1170 recordings of total 9.75 hours

with 15 different classes. Audio signals sampled at 44.1 kHz sampling frequency were

divided into 40 ms frames with 50% hop size. Experiments were conducted using 4-

fold cross validation. The final results were obtained by averaging over all evaluation

folds.

14

We evaluated the classification accuracy using two measures: frame-based ac-

curacy and segment (30s)-based accuracy. Due to the softmax output layer of our

networks, probability distributions among the J class labels were obtained individ-

ually. Given zconcatt , the predicted class label at t frame was computed by,

Cframe = arg maxj

P (yt = j|zconcatt ) (2.4)

where j denotes class index. To obtain the class label of the entire audio segment,

the likelihood was computed follows as:

Csegment = arg maxj

T∑t=1

log(P (yt = j|zconcatt )), (2.5)

where T represents the number of frames in the one audio segment.

2.4.2 Neural Networks Setup

All networks in our experiments were trained using mean squared error as the

loss function supervised by one-hot encoding class vectors. The randomly ordered

mini-batches in each epoch was set to be 256. After a mini-batch was processed,

the weights were updated using adadelta [40]. In order to mitigate the over-fitting

problem in the training phase, we used the dropout technique which has already

proved its regularization capability [41]. The output layer contained 15 softmax

nodes identical to the number of scenes.

As a baseline system, we built a DNN which has three hidden layers with 512

hidden units each and used the ReLU activation in the hidden layers. The input

features were 60-dimensional MFCC features including both delta and acceleration

15

MFCC coefficients. Input layer was composed of a concatenation of 9 input frames

(the current frame and the four previous and four next frames) resulting in 540 input

units. To regularize the network, we used dropout with a probability of 40% for all

hidden layers.

The CNN architecture for the baseline system comprised two convolutional lay-

ers, two pooling layers and one fully connected layer with softmax layer on the top.

The input features were F × τ size SIF, where F=40 and τ=40. In the first con-

volutional layer, the input SIF is convolved with 32 filters of fixed size 5×5. The

first pooling layer then reduce the size of filtered SIF. We utilized max-pooling with

kernel size 2×2 for all pooling layers. As an activation function, ReLU was applied.

The second convolutional layer perform convolution between the output of the pool-

ing layer and 16 filters of fixed size 5×5. After the second pooling is performed,

the flattened output is combined with fully connected layer with 512 units. Dropout

was only used after the second pooling layer and the fully connected layer with

probabilities 30% and 40%, respectively.

The network had two hidden layers with 256 LSTM units each and one feed-

forward layer with 512 ReLU units. The structure of two LSTM layers is identical

to the lower part in Figure 2.2. The input sequence consisted of 40 frames of 60-

dimensional MFCC features. Dropout was applied with a probability of 40% for all

layers. The output layer was identical to the mentioned in the previous section.

As a proposed system, we built a combined structure of LSTM and CNN in

parallel. The network setup and structure of LSTM part and CNN part was identical

to the aforementioned networks. To combine and further train the two separated

networks, we used fully connected layers. The connected layers were consisted of

two hidden layers with 512 ReLU units each.

16

Table 2.1: Frame-based classification accuracy (%) on IEEE DCASE 2016 ChallengeTask 1 Development Dataset.

Scene DNN CNN LSTMCNN-LSTM

beach 76.56 65.29 79.86 81.26bus 44.69 62.61 56.21 60.99

cafe/restaurant 47.79 61.89 57.72 57.12car 75.49 71.11 85.51 80.57

city center 80.41 79.13 89.26 91.25forest path 87.24 72.15 91.69 92.22

grocery store 77.19 57.39 83.07 84.71home 66.28 72.71 52.70 55.39

library 64.07 71.27 69.29 72.55metro station 85.71 85.76 82.52 82.47

office 83.40 78.93 82.97 89.09park 38.24 36.11 48.89 43.88

residential area 61.87 51.71 52.54 57.74train 22.46 38.87 24.42 38.21tram 73.57 56.82 72.99 76.46

Overall acc 65.66 64.12 68.64 70.92

2.4.3 Results and Discussion

We compared the average accuracies over all scenes for the conventional DNN,

CNN, LSTM, and the proposed network. The frame-based classification results are

given in Table 2.1. Table 2.2 shows the segment-based classification accuracy, where

the correct represents the number of correctly classified segments among the total

1170 segments. The proposed method achieved higher accuracy than other networks

in both frame-based and segment-based classification.

Though the combined neural network achieved higher performance on average,

it did not give the best classification results across all scenes. In the bus case, CNN

outperformed other networks. In the park case, LSTM had better result. In the

17

Table 2.2: Segment-based (30s) classification accuracy (%) on IEEE DCASE 2016Challenge Task 1 Development Dataset. Asterisk(*) CNN-LSTM represents the ac-curacy on Evaluation Dataset.

Scene Base. DNN CNN LSTMCNN-LSTM

*CNN-LSTM

beach 69.3 84.62 73.08 88.46 88.46 84.6bus 79.6 51.28 88.46 67.95 65.38 100

cafe/rest. 83.2 58.97 73.08 67.95 60.26 61.5car 87.2 78.21 73.08 88.46 89.74 88.5

city center 85.5 92.31 91.03 93.59 97.44 92.3forest path 81.0 93.59 82.05 98.72 97.44 100

grocery store 65.0 83.33 71.79 85.90 91.03 96.2home 82.1 80.77 89.74 64.10 70.51 88.5

library 50.4 75.64 83.33 76.92 76.92 46.2metro station 94.7 94.87 100.0 92.31 94.87 88.5

office 98.6 93.59 96.15 87.18 96.15 100park 13.9 41.03 43.59 57.69 52.56 96.2

resident. area 77.7 87.18 75.64 73.08 74.36 65.4train 34.9 25.64 46.15 29.49 43.59 53.8tram 85.4 88.46 82.05 88.46 88.46 100

correct - 881 912 905 926 -

Overall acc 72.6 75.30 77.95 77.35 79.15 84.1

residential area case, DNN achieved higher performance. This can be interpreted

that the proposed network cannot fully train some acoustic scenes, and these scenes

may not contain enough temporal information. Future research will deal with a more

robust network architecture to extract distinct features of acoustic scenes.

The proposed method was found to improve classification performance and achieved

an average accuracy of 79.15%. The baseline accuracy of audio scene classification

task in DCASE 2016 challenge [34], which was based on MFCCs and GMMs, was

72.6%. Our method improved the performance by relative 6.6%. Finally, The accu-

racy on the evaluation dataset was 84.1%.

18

2.5 Summary

In this chapter, in order to enhance the classification accuracy of acoustic scenes,

we proposed a novel neural network structure which achieved higher performance

compared with the conventional DNN, CNN and LSTM architecture in terms of

both frame-based and segment-based accuracy. In the segment-based classification

results, the proposed technique obtained improvement of 3.85%, 1.2% and 1.8%

in comparison with DNN, CNN and LSTM architecture, respectively. By combining

different networks in parallel, the proposed method was able to learn complementary

information of LSTM and CNN.

19

Chapter 3

Overlapping Acoustic Event

Classification Based on Joint

Training with Source Separation

3.1 Introduction

For a decade, there have been many studies to address the problem of detect-

ing overlapping events from audio. In [42], the author proposed context-dependent

hidden Markov models (HMMs) with multiple path decoding. Also non-negative ma-

trix factorization (NMF) approach has been utilized in order to separate overlapping

events via dictionary learning [43]. Other approaches were proposed, such as using

connectionist temporal classification (CTC) [44], linear dynamical systems for over-

lapping sound event tracking [45] and feature representation for AED [46]. More

recently, various neural network models have been quite successful in AED area.

In [47], the multi-label deep neural networks (DNNs) were proposed for detecting of

21

temporally overlapping sound events, and the author in [31] used bi-directional long

short term memory (BLSTM).

With regard to AED, although neural networks are able to learn the non-linear

relationship between the input and output, they cannot fully utilize each source in-

formation from the mixture source. The additive property of sound sources makes it

difficult to find the robust features to recognize them in overlapping audio. Thus, we

propose a neural network for overlapping AEC which is optimized by the joint train-

ing with source separation model and multi-label classification model. The source

separation model is trained to reconstruct the target sources from unknown overlap-

ping event. It helps the model to decompose the mixture event. The classification

model learns the properties of overlapping event from the reference sources. After

that, two models are combined and jointly trained, so that the model can be op-

timized to minimize the interference of overlapping events and estimate labels of

mixed events directly.

The remainder of this chapter is organized as follows: Section 3.2 presents the

problem formulation of source separation for overlapping AEC. The proposed ap-

proach of using joint training for AEC is described in Section 3.3. Section 3.4 presents

the experimental results, and Section 3.5 provides conclusions and future work.

3.2 Source Separation of Overlapping Acoustic Event

The main objective of source separation is to estimate one or more sources from

a given mixed source signal. This can serve as an intermediate step for other tasks.

Since overlapping acoustic events are also mixture of multiple signals, source sep-

aration framework can be applied to AEC. In [48], unsupervised source separation

22

was used as a pre-processor for overlapping AED. Unlike this approach, the pro-

posed system is trained as a single model including source separation and event

classification.

In this section, we focus on source separation of overlapping acoustic events.

Given target sources s1(t) and s2(t), we define S1(t, f), S2(t, f) and Y (t, f) as the

short time Fourier transform(STFT) coefficients of s1(t), s2(t) and mixed signal y(t),

respectively, where t represents the frame index and f is the frequency-bins. Due to

the linearity of the STFT, source separation problem can be defined as follows:

y(t) = s1(t) + s2(t),

Y (t, f) = S1(t, f) + S2(t, f).

(3.1)

In the source separation framework, the magnitude spectrogram of the mixture

signal can be approximated as the sum of the magnitude spectra of each source as

follows:

|Y (t, f)| ≈ |S1(t, f)|+ |S2(t, f)|. (3.2)

For a specific time frame t, the magnitude spectrogram can be written in vector

form as follows:

yt ≈ s1t + s2t , (3.3)

where yt ∈ RF , s1t ∈ RF and s2t ∈ RF denote the magnitude spectrum of the

mixture and the two target acoustic events at time frame t, respectively. F is the

spectral magnitude dimension. Hence, the goal of event separation is to find s1 and

s2 using the mixture training data and reference event data.

23

Figure 3.1: Scheme of the proposed method.

3.3 Proposed Method Using Joint Training

In this section, we describe the proposed neural network training scheme for

improving the AEC performance. The schematic of the proposed training and test

procedure is illustrated in Figure 3.1 and the neural networks structure can be seen

in Figure 3.2.

3.3.1 Source Separation Model

Various DNN based approaches have been proposed to address the monaural

source separation problem [49]–[51]. In order to obtain the estimated single event

from overlapping acoustic events, we exploit the DNN framework for source separa-

tion. Given the input mixture features yt from the mixture, we obtain the output

estimates y1t and y2t from the network. In the training process, the discriminative

objective function is used in order to regularize the reconstruction error as defined

24

Overlapping acoustic event

Source estimate

Multi label estimate

Source separation model training

Event classification model training

Joint training

Figure 3.2: Joint training structure for the proposed technique.

25

in [49]

L(t) = ‖y1t − s1t‖2 + ‖y2t − s2t‖2 − γ‖y1t − s2t‖2 − γ‖y2t − s1t‖2, (3.4)

where ‖ · ‖ indicates the l2-norm and γ denotes the regularization parameter which

adjusts the trade-off between the reconstruction error and the discrimination infor-

mation. In order to estimate each source, the soft time-frequency mask mt ∈ RF is

calculated as follows:

mt =

∣∣y1t

∣∣∣∣y1t

∣∣+∣∣y2t

∣∣ . (3.5)

Then Wiener filtering can be used to reconstruct the magnitude spectra of each

acoustic event source as follows:

s1t = mt ⊗ yt,

s2t = (1−mt)⊗ yt,

(3.6)

where the division is performed element-wise and ⊗ indicates element-wise multipli-

cation. The source separation model is trained through the mixture source yt as an

input and reference source [st1st2 ] as a target. This process is described in Figure

3.2 by the solid blue line box.

3.3.2 Multi-Label Classification Model

Multi label neural networks are utilized for detection of temporally overlapping

acoustic events [47]. In the training stage of multi-label classification, the network

learns the mapping between reference source [st1st2 ] as an input and the correspond-

ing target output at, where at ∈ RI indicates true multi-label vector of overlapping

26

acoustic events. I is the number of acoustic events. This process is shown in Figure

3.2 by the red dashed line box.

3.3.3 Joint Training Method

Jointly trained models have achieved improvement in various learning tasks, es-

pecially in the speech recognition area. Motivated by the good performance of the

joint training scheme shown in [52]–[54], we use this technique in order to improve

AEC performance. AEC is also suitable enough to adopt the joint training because

source separation and event classification are trained through the difference objec-

tives.

After two networks are trained, they are combined to form a single network and

further trained jointly. In the training phase, the network is trained with mixture

source yt as input and true label at as output. As shown in Figure 3.2, the weights

of the unified network are adjusted using back-propagation. As a result, the network

is trained to utilize the information of separated source implicitly. This helps the

network to estimate acoustic events from the mixture source.

3.4 Experiments

3.4.1 Dataset and Data Augmentation

In order to evaluate the performance of the proposed method, we conducted a

set of acoustic event source separation experiments using the IEEE DCASE 2016

Challenge Task 2 Train Datasets [55]. The training dataset consists of 20 isolated

sound events per event class. We selected four acoustic events: coughing, keyboard

typing, page turning and speech, and used them to construct a mixed source dataset.

27

Six different types of dataset were generated in the source mixing process (4C2 =

6). Unlike most speech datasets which usually consist of hours of data or more,

conventional sound event datasets are not sufficiently long enough to train a robust

DNN model. In order to tackle the insufficient data problem to train a DNN model,

data augmentation approach was used for training the DNN. To construct the diverse

source mixtures from a small dataset, acoustic events were artificially generated by

time stretching. Finally, various mixture combination of two acoustic events were

produced with SNR 0 dB scale.

3.4.2 Experimental Setup

The dataset were sampled at 16 kHz, and the magnitude spectrograms were

calculated using STFT. Hamming window with 512-point length and 75% overlap

was applied and the FFT was taken at 512 points. Only the first 257 FFT points

were used since the conjugate of the remaining 255 FFT point are symmetric with

the first half.

As a regression model for source separation training, we built a DNN with two

hidden layers with 1000 Rectified Linear Unit (ReLU). The input features were

257×7-dimension (current frame and the three previous and next frames of mixture

source), and the output was 257×2-dimension (regression of two target sources)

with sigmoid unit. Equation (3.4) was used as the loss function, and the number of

randomly ordered mini-batches in each epoch was set to be 100. After processing

each mini-batch, the weights were updated using Adam [56]. In order to mitigate the

over-fitting problem in the training phase, dropout was applied with a probability

of 30% for all hidden layer.

In order to predict the labels of overlapping acoustic events, we also trained a

28

C-K mixture

EstimateC

C-K network

EstimateK

EstimateC

C-P network

EstimateP

EstimateP

P-S network

EstimateS

← C-K, C-P, C-S, K-P, K-S, P-S

C-P mixture

P-S mixture

Estimatesource

network

Estimatesource

mixture

Separated training

Integrated training

Figure 3.3: Comparison between separated and integrated source separation models.

DNN consisting of two hidden layer with 1000 ReLU node. The input features were

257×2 (two separated source), and output was 4-dimensional (each acoustic event

label) softmax layer. Mean squared error was used as the loss function and other

setup was equivalent to the source separation model.

After training the source separation model and the multi-label classification

model, two networks were cascaded to form a single larger network and the weights

of the unified network were adjusted using back-propagation.

29

3.4.3 Evaluation of Source Separation

In many two source separation tasks, a single network is trained to estimate a

source pair. However, in the proposed source separation network, a single network es-

timates six source pairs (4C2 = 6). This means that if the source separation network

do not estimate the target sources well, the jointly trained network may show simi-

lar performance to the baseline network which has an identical structure including

model size and hyper-parameters, but without applying the joint training scheme.

In order to verify this point, we compared the source separation performance of the

proposed method and networks which were trained using a mixture dataset, where

each recordings consist of only two target sources as shown in Figure 3.3. Alphabets

denote the acoustic event name, C: Coughing, K: Keyboard typing, P: Page turning

and S: Speech. The ‘C-K’ means that the mixture source includes coughing sound

and keyboard typing sound. The ‘separated training’ indicates that a single network

was trained using only a mixture dataset. Thus, total six networks were produced.

The ‘integrated training’ means that a single network was trained using whole six

pair datasets.

The performance of source separation was evaluated in terms of the signal to

distortion ratio (SDR) and signal to interference ratio (SIR) [57]. Figure 3.4 and

Figure 3.5 show the source separation performance. As shown in the figures, although

the performance is degraded, the proposed source separation network is enough to

provide each source information to multi-label classification network. Finally, An

example of source separation in the time domain can be seen in Figure 3.6.

30

<average>: 7.08: 5.81

Figure 3.4: The source separation performance (SDR [dB])

<average>: 12.64: 9.88

Figure 3.5: The source separation performance (SIR [dB])

31

Overlapping

Reconstruction1 Reconstruction2

Source1 Source2

Figure 3.6: Results of source separation in the time domain.

3.4.4 Acoustic Event Classification Results

To evaluate the performance of proposed method, we calculated the number of

correct, missed and false alarm events. The precision, recall and F–score are

calculated as follows:

precision =correct

correct+ false arlam, (3.7)

recall =correct

correct+missed, (3.8)

F–score =2× precision× recallprecision+ recall

. (3.9)

Table 3.1, 3.2 and 3.3 show the overlapping AEC performance. ‘2L-DNN’ and

‘5L-DNN’ denote DNN structures which have two and five hidden layers. In addi-

32

Table 3.1: Precision performance of overlapping acoustic event classification.

Event classPrecision

2L-DNN 5L-DNN MTL Proposed

C 0.8677 0.9220 0.9645 0.9785

K 0.9085 0.8987 0.9116 0.8898

P 0.9477 0.9341 0.9723 0.9852

S 0.9526 0.9377 0.9552 0.9685

average 0.9191 0.9231 0.9509 0.9555

Table 3.2: Recall performance of overlapping acoustic event classification.

Event classRecall


C 0.9294 0.9067 0.9246 0.9333

K 0.9308 0.9467 0.9540 0.9841

P 0.8767 0.8821 0.9135 0.8967

S 0.8667 0.9067 0.9378 0.9765

average 0.9009 0.9106 0.9325 0.9477

Table 3.3: F-score of overlapping acoustic event classification.

Event classF-score


C 0.8975 0.9143 0.9441 0.9554

K 0.9195 0.9221 0.9323 0.9346

P 0.9108 0.9074 0.9420 0.9389

S 0.9076 0.9219 0.9464 0.9725

average 0.9089 0.9164 0.9412 0.9503

33

Main task

Sub task

𝒚𝑡−𝜏:𝑡+𝜏 𝒂𝑡

𝒚1𝑡

𝒚2𝑡

Source estimate

Multi-label estimate

Figure 3.7: Multi-task learning for overlapping acoustic event classification.

tion, we compared the performance with model using multi-task learning. ’MTL’

denotes DNN structure which adopted multi-task learning as seen Figure 3.7. These

baseline networks did not apply the joint training with source separation. The pro-

posed method was found to improve the classification performance and achieve an

average F–score of 0.9503. In the each acoustic source, the joint training with source

separation achieved higher performance.

34

3.5 Summary

In this chapter, we have proposed a neural network for overlapping AEC based on

joint training between source separation model and multi-label classification model.

By adopting the source separation framework into the overlapping AEC task, the

jointly trained network can minimize the interference of overlapping events. From

the experimental results, it has been found that the proposed technique outperforms

the baseline networks which do not apply the joint training with source separation.

35

Chapter 4

Disentangled Feature Learning

for Noise-Invariant Speech

Enhancement

4.1 Introduction

DNNs have been successful in solving the speech enhancement tasks under var-

ious noise environments since its introduction. Early literature using DNNs as a

nonlinear mapping function for estimating clean speech had reported better enhance-

ment results [58]–[60] compared to the NMF-based algorithms. Various neural net-

work structures have been employed for speech enhancement, such as multi-context

stacking networks for ensemble learning [61], recurrent neural networks (RNNs) [49],

[62], and convolutional neural networks (CNNs) [63], [64].

More recently, generative adversarial network (GAN) [65] has become popular

in the area of deep learning, and it has been also applied to speech enhancement.

37

Pascual et al. proposed end-to-end speech enhancement GAN (SEGAN) in which the

generator learns to model the mapping from the noisy speech samples to their clean

counterparts, while the discriminator learns to distinguish between the enhanced and

the target clean samples within the context of a mini-max game [66]. The underlying

idea of GAN has been adopted in many GAN-based speech enhancement algorithms

including the time-frequency mask estimation using the minimum mean square error

GAN (MMSE-GAN) [67] and the conditional GAN (cGAN) [68].

Though deep learning-based speech enhancement models have achieved consider-

able improvements, the performance is usually degraded in the case of mismatched

conditions caused by different types of noises or SNR levels between the training

and test set samples. Moreover, the performance varies depending on the types of

noises. In order to address such issues, disentangled feature learning can be consid-

ered as a possible solution. Most of the previous studies, which have focused mainly

on the mapping between the noisy and the clean speech, rarely consider how input

features are learned in the hidden layers. The model based on disentangled feature

learning, on the other hand, manipulates the latent features to distinguish between

the speech and noise in the intermediate layers, hence resulting in better enhance-

ment performance even in the mismatched noise conditions. Moreover, the quality

of noise-invariant attribute can also be improved.

In this chapter, we propose a novel deep learning-based noise-invariant speech

enhancement algorithm which employs an adversarial training framework designed

to disentangle the latent features of speech and noise, under the concept of domain

adversarial training (DAT) [69]. Although DAT was originally introduced for the

domain adaptation task, the proposed algorithm exploits the DAT framework for

use in the regression task, i.e.,speech enhancement. Experimental results show that

38

the proposed method successfully disentangles the speech and noise latent features.

Moreover, the results also reveal that our model outperforms the conventional DNN-

based algorithms. The main contributions of this paper are summarized as follows:

• We modify the DAT framework in order to solve the speech enhancement

task in a supervised manner. The proposed model achieves better performance

in speech enhancement as compared to the baseline models under both the

matched and mismatched noise conditions.

• By reducing the performance gap among different noise types, we show that

our method is more robust to noise variability.

• By visualizing feature representations, we demonstrate that our model suc-

cessfully disentangles speech and noise latent features.

4.2 Masking-Based Speech Enhancement

When training neural networks in a supervised manner and regression approach

as seen Figure 4.1, it is essential to define a proper training target in order to

ensure effective learning. The training targets for speech enhancement can be mainly

categorized into two groups: (i) mapping-based, and (ii) masking-based approaches.

The mapping-based methods learn a regression function relating a noisy speech to

the corresponding clean speech directly while the masking-based methods estimate

time-frequency masks given a noisy speech. A variety of training targets have been

studied. Wang et al. evaluated and compared the performance of various mapping-

based and masking-based targets [70]. It may be controversial to argue which method

is better, yet many cases have shown that the masking-based methods (e.g. ideal

39

Figure 4.1: Scheme of DNN-based speech enhancement method.

ratio masks) tend to perform better than the mapping-based methods [61], [70], [71]

in terms of enhancement results. In this work, we design the proposed model within

a masking-based framework. We use the time-frequency masking functions as an

extra layer in the neural network [49]. This way, the model implicitly incorporates

the masking functions when optimizing the network which will be detailed in Section

4.4.1.

4.3 Concept of Domain Adversarial Training

Domain adaption [72] addresses the problem of mismatch between the training

and test datasets by transferring the knowledge learned from the source domain to

a robust model in the target domain. DAT is one of the approaches that attempts

to match the data distributions across different domains. In [69], DAT exploits an

adversarial training method in order to learn intermediate features which are invari-

40

ant to the shifts in data from different domains. Here, the neural network learns

two different classifiers: (i) a classifier for the main classification task, and (ii) the

domain classifier. The training objective of the domain classifier, in particular, is

to learn whether the input sample is from the source or target domain, given fea-

tures extracted using labeled data from the source domain and unlabeled data from

the target domain. The feature extractor is shared by both the main task and the

domain classifiers. In implementation, a gradient reversal layer (GRL) is employed

to act as an identity transformer in the forward-propagation and to reverse the

gradient by multiplying a negative scalar during the back-propagation [69]. Conse-

quently, the GRL encourages the latent features to act discriminatively when solving

the main classification task, yet act indiscriminately towards the shifts across dif-

ferent domains. In other words, the feature extractor is trained so that the model

maps data from different domains to the latent features with similar distributions

via adversarial learning.

Many speech processing frameworks have adopted the idea of DAT in order to ex-

tract domain-invariant features. Under the noise robust speech recognition scheme,

the clean speech was regarded as the source domain data and was used to train the

senone label classifier, while the noisy speech played the role of the target domain

data to be adjusted by the feature extractor [73], [74]. DAT was also used to learn

speaker-invariant senone labels, as shown in [75] where the adversarial training suc-

cessfully aligned the feature representation of different speakers. In [76], the authors

demonstrated that accent-invariant features could be learned for the ASR system.

For speaker recognition tasks, DAT was adopted to tackle the channel mismatch

problem. In particular, the latent features were extracted in order to learn channel-

invariant, yet speaker-discriminative representations [77]. In [78], the authors showed

41

Noisy

speech 𝐱Encoder

(𝐄)

Speech latent

feature 𝐳s

Noise latent

feature 𝐳n

Speech Decoder(𝐃𝐬)

Noise Decoder(𝐃𝐧)

Speech

Disentangler(𝐃𝐄𝐬)

Noise

Disentangler(𝐃𝐄𝐧)

Predicted

speech 𝐬

Predicted

noise 𝐧

Gradient reversal

layer

Gradient reversal

layer

Predicted

speech power 𝐦s

Predicted

noise power 𝐦n

Learnable weights

Deterministic operation

Predicted

speech 𝐬

Predicted

noise 𝐧

Figure 4.2: The architecture of the proposed model for disentangled feature learning.

that DAT was able to adapt multiple forms of mismatches (e.g. speaker, acoustic

conditions, and emotional content) when solving the acoustic emotion recognition

task. As for the speech enhancement problems, a noise adaptive method exploiting

DAT was proposed in [79]. In their work, however, DAT was only used to classify

stationary and non-stationary noises, and the authors did not make use of various

noise components for domain-invariant regression.

4.4 Disentangling Speech and Noise factors

4.4.1 Neural Network Architecture

Our neural network consists of five sub-networks: (i) an encoder (E), (ii) a speech

decoder (Ds), (iii) a noise decoder (Dn), (iv) a noise disentangler (DEn), and (v)

a speech disentangler (DEs). The overall architecture of the proposed model is il-

lustrated in Figure 4.2. We extract the magnitude spectra as the raw features of

all signal components. Only the magnitude spectra are estimated while the phase

42

parts of the noisy speech are kept intact. Let us denote the magnitude spectra of the

noisy speech, clean speech, and noise as x ∈ RF×(2τ+1), s ∈ RF×1, and n ∈ RF×1,

respectively, where F denotes the number of frequency bins and τ represents an

input context expansion parameter (i.e., one current frame, τ previous and τ next

frames). The encoder E learns a function that maps x into speech and noise latent

features, defined by neural network parameter θE as follows:

(zs, zn) = E(x; θE) (4.1)

where zs ∈ RM×1 and zn ∈ RM×1 indicate M -dimensional speech and noise latent

features, respectively. Similarly, Ds and Dn learn mappings parameterized by θDs

and θDn , respectively, as follows:

ms = Ds(zs; θDs),

mn = Dn(zn; θDn)

(4.2)

where ms ∈ RF×1 and mn ∈ RF×1 denote the estimated speech and noise masks,

respectively. The time-frequency masks are constrained such that the sum of the

estimated values should be equal to the input noisy speech. Given the masks from

both of the decoders, we can obtain the predicted speech and noise through a de-

terministic layer [49]. Given ms and mn, the predicted magnitude spectra of speech

s ∈ RF×1 and noise n ∈ RF×1 can be calculated as

s =ms

ms + mn⊗ x,

n =mn

ms + mn⊗ x

(4.3)

43

where the addition, division, and product (⊗) operators are executed element-wisely.

Finally, DEn and DEs are trained to separate the noise attributes from the

speech latent features, and vice versa. DEn and DEs are respectively parameterized

by θDEn and θDEs as follows:

n = DEn(zs; θDEn),

s = DEs(zn; θDEs)

(4.4)

where s ∈ RF×1 and n ∈ RF×1 represent the speech and noise components, respec-

tively estimated from the latent features. Note that s and n differ from s and n

in Equation (4.3). s and n are generated by the disentanglers which are trained to

make the encoder difficult to predict the speech and noise. The GRLs are inserted

between the encoder and the disentanglers to establish an adversarial setting. On

the other hand, s and n are well estimated by the corresponding decoders.

In the final speech enhancement stage, after obtaining s from the decoders, the

estimated clean speech spectrum S is reconstructed by

S = s⊗ exp (j]x) (4.5)

where ]x denotes the phase of the corresponding input noisy speech. S is then

transformed to the time-domain signal through inverse discrete Fourier transform

(IDFT). Finally, an overlap-add method as in [80] is used to synthesize the waveform

of the enhanced speech.

44

4.4.2 Training Objectives

Given the estimates s and n of the clean speech s and noise n, we optimize the

neural network parameters of the encoder and decoders by minimizing the mean

squared error defined as follows:

LDs(θE , θDs) =1

K

K∑k=1

‖sk − sk‖2,

LDn(θE , θDn) =1

K

K∑k=1

‖nk − nk‖2(4.6)

where ‖ · ‖ indicates the l2-norm, K is the number of mini-batch size, and sk (nk) is

the estimate of the k-th speech (noise) sample sk (nk) in the mini-batch. Similarly,

we also train the encoder and the disentanglers by using the following objective

functions:

LDEn(θE , θDEn) =1

K

K∑k=1

‖nk − nk‖2,

LDEs(θE , θDEs) =1

K

K∑k=1

‖sk − sk‖2(4.7)

where nk and sk are obtained through Equation (4.4). To obtain disentangled fea-

tures, we minimize LDEn and LDEs defined in Equation (4.7) with respect to θDEn

and θDEs , while maximizing them with respect to θE simultaneously. Combining

Equations (4.6) and (4.7), the total loss of the proposed network is formulated as

LT (θE , θDs , θDn , θDEn , θDEs)

= [LDs(θE , θDs)− λ1LDEn(θE , θDEn)] + α[LDn(θE , θDn)− λ2LDEs(θE , θDEs)](4.8)

45

where λ1 and λ2 are positive hyper-parameters which control the amount of gradient

reversal in the back-propagation step, and α denotes the weight controlling the

contribution of the noise estimate.

In recent studies, GRL has only been used for domain predictions under narrowly

restricted settings (with only two possible domains, e.g., the source and the target)

or for classifications of channels, speakers, and noise types. The proposed model

distinguishes itself from the past approaches by using two GRLs to disentangle the

speech and noise latent features in a regression manner.

4.4.3 Adversarial Training for Disentangled Features

Neural network parameters are optimized by using the objective function given

in Equation (4.8) via adversarial learning. Ds and Dn are trained to minimize LDs

and LDn , and DEn and DEs are also trained to minimize LDEn and LDEs . As for

the optimization of E, it is essential to ensure that it should produce disentangled

features. This idea is implemented by minimizing LDs and LDn while maximizing

LDEn and LDEs in an adversarial manner with respect to the encoder parameter θE .

Such a mini-max competition eventually converges to the point where the encoder

network generates the noise-confusing latent feature zs and the speech-confusing

latent feature zn, disentangled in the latent feature space. Ds and Dn then use

zs and zn as input respectively and produce noise-invariant speech s. In summary,

optimizations of the network parameters are given by

(θE , θDs , θDn) = arg minθE ,θDs ,θDn

LT (θE , θDs , θDn , θDEn , θDEs),

(θDEn , θDEs) = arg maxθDEn ,θDEs

LT (θE , θDs , θDn , θDEn , θDEs)

(4.9)

46

where θ(·) denotes the optimal parameters for each given network (·).

The network parameters defined by Equation (4.9) can be found as a stationary

point of the following gradient updates:

θE ←− θE − µ(∂LDs∂θE

+ α∂LDn∂θE

− λ1∂LDEn∂θE

− αλ2∂LDEs∂θE

),

θDs ←− θDs − µ∂LDs∂θDs

,

θDn ←− θDn − µα∂LDn∂θDn

,

θDEn ←− θDEn − µ∂LDEn∂θDEn

,

θDEs ←− θDEs − µα∂LDEs∂θDEs

(4.10)

where µ indicates the learning rate. The updates of Equation (4.10) are very similar

to stochastic gradient descent (SGD) updates for the feed-forward deep model that

comprises the encoder fed into the decoders and into the disentanglers. The difference

is that the gradients from the decoders and disentanglers are subtracted with loss

weighted by λ1, λ2, and α, instead of being summed. The negative coefficient −λ1

and −λ2 enable the encoder to induce the maximization of LDEn and LDEs by

reversing the gradients during the back-propagation. If both λ1 and λ2 are set to

zero, the neural network structure presented in Figure 4.2 becomes equivalent to the

conventional DNN structure. The optimized networks E, Ds and Dn are then used

during the test stage for generating the clean speech estimates given the noisy test

speech samples.

47

4.5 Experiments and Results

4.5.1 Dataset and Feature Extraction

We used 6, 300 utterances of clean speech data from the TIMIT database [81]

to train the neural networks. TIMIT database consists of 10 sentences each spoken

by 630 English speakers. In order to make sure that various noisy utterances are

considered during simulations, we selected 10 different noise types including: car,

construction, office, railway, cafeteria, street, incar, train, bus from ITU-T recom-

mendation P.501 database and white noise from NOISEX-92 database [82]. In the

case of matched noise conditions, two-thirds of each noise clip was used for training

and the rest for testing. For each pair of the clean speech utterance and the noise

waveform, a noisy speech utterance was artificially generated with an SNR value

randomly chosen from −3 to 6 dB in 1 dB scale. As a result, a total of 63, 000 utter-

ances (about 54 hours) were used so that the entire database was mixed with each

noise type.

The test set consisted of 1, 400 utterances of clean speech data from TSPspeech

[83], spoken by 12 male and 12 female English speakers. For the experiments in the

matched noise conditions, we used the same noise types as used for training. For

the experiments in the mismatched noise conditions, noises including kids, traffic,

metro, and restaurant from ITU-T recommendation P.501 database were applied.

Noisy speech utterances were generated with the SNR value ranging from −6 to 9 dB

with 3 dB step in which −6 and 9 dB cases represented the unseen SNR conditions.

The input and target features of the networks were extracted in the following way.

First, we extracted the magnitude spectra from the noisy speech, the corresponding

clean speech, and noise. A 512-point Hamming window with 50% overlap was ap-

48

Figure 4.3: Plot of losses on training the proposed model.

plied to the audio signals, sampled at 16kHz, and then short-time Fourier transform

(STFT) was applied. 512 points STFT magnitudes were reduced to 257 points by

removing the symmetric half. F and τ were fixed to 257 and 5, respectively. Thus,

input feature vectors, extracted from 11 consecutive frames, were concatenated in a

similar manner as in [59].

4.5.2 Network Setup

The network architecture of the proposed model is presented in Figure 4.2 which

we refer to the speech-noise disentangled training (snDT ) model. The encoder E

was constructed by stacking two hidden layers with 2, 048 leaky rectified linear units

(ReLUs) [84] in each layer. The number of the input nodes of E was 257×11 = 2, 827.

The output layer generated two separated outcomes of 512 nodes (i.e., the dimension

M of zs and zn) with leaky ReLUs.

The decoders Ds and Dn also had two hidden layers with 2, 048 leaky ReLUs

49

in each layer. The numbers of the input and output nodes in each network were

512 and 257, respectively. For the output activations, Sigmoid was used to restrict

the output mask (ms and mn) values to be in [0,1], yet s and n were determined

implicitly by Equation (4.3). The structures of DEn and DEs were identical to that

of Ds except for the output activation functions. ReLUs were used for the output

magnitudes (s and n).

The snDT model was trained with Adam optimizer [56], with a learning rate

of 1e − 3, using a mini-batch size of 10 utterances. Batch normalization [85] was

applied to all of the hidden and output layers for regularization and stable training.

As for the hyper-parameters λ1 and λ2 of Equation (4.8), we took an approach

similar to [69]. λ1 and λ2 were initialized with 0 for the first 50K training iterations,

and then their values were gradually increased until reaching (λ1, λ2) = 0.3 by

the end of the training. α in Equation (4.8) was fixed at 0.4. Figure 4.3 shows the

training losses obtained from the snDT model, and it is seen that the model was

trained properly. Through the adversarial training as defined by Equation (4.9), the

speech and noise estimation losses decreased, and the disentangling losses increased

gradually to convergence.

To evaluate the performance of the disentangled feature learning technique, we

implemented three baseline models for comparison. These baseline systems are as

follows:

• Speech training (sT ) model, as shown in Figure 4.4a, is a deep denoising

autoencoder [58], and it takes a regression approach closely resembling [59].

• Speech-noise training (snT ) model, as shown in Figure 4.4b, utilizes noise

components to construct the time-frequency masks. This approach is similar

50

Noisy

speech 𝐱

Encoder(𝐄)

Speech latent

feature 𝐳s


Predicted

speech 𝐬

Predicted

speech mask 𝐦s

Learnable

weights

Deterministic

operation

(a) sT model

Noisy

speech 𝐱

Encoder(𝐄)

Speech latent

feature 𝐳s


Predicted

speech 𝐬

Predictedspeech mask

𝐦s

Noise latent

feature 𝐳n

Noise Decoder(𝐃𝐧)

Predictednoise mask

𝐦n

Predicted

noise 𝐧

(b) snT model

Noisy

speech 𝐱

Encoder(𝐄)

Speech latent

feature 𝐳s


Predicted

speech 𝐬

Predictedspeech mask

𝐦s Noise

Disentangler

(𝐃𝐄𝐧)

Predicted

noise 𝐧

Gradient reversal

layer

(c) nDT model

Figure 4.4: The architectures of the baseline models.

to the method suggested in [49]. Unlike the snDT model, however, the snT

model does not exploit disentangled feature learning.

• Noise disentangled training (nDT ) model, as shown in Figure 4.4c, is trained

so that the noise components are disentangled from the speech latent features

without using noise latent features.

The baseline models were configured similarly in terms of hyper-parameters, the

number of layers and nodes in each module, to ensure a fair comparison with the

snDT model. We implemented all the networks using Tensorflow [86].

51

4.5.3 Objective Measures

For the evaluation of the models’ performances, we considered four different

aspects, speech quality, noise reduction, speech intelligibility, and speech distortion.

The tested objective measures are summarized as in the following:

• PESQ: Perceptual evaluation of speech quality defined in the ITU-T P.862

standard [87]

• segSNR: Segmental SNR, which is the average of the SNR per frame for the

two speech signals

• eSTOI: Extended short-time objective intelligibility [88]

• SDR: Signal-to-distortion ratio [57]

All metric values for the enhanced speech were compared with the corresponding

clean reference of the test set.

4.5.4 Performance Evaluation

In case of the matched noise conditions, we measured the objective metrics and

averaged them over each SNR environment to evaluate performance for ten different

noise types. Table 4.1 presents the PESQ scores, segSNR, eSTOI, and SDR values

obtained in the matched noise conditions where the column “noisy” refers to the

results obtained from the clean and the unprocessed noisy speech. The cases with

SNR equal to −6 and 9 dB indicate the unseen SNR conditions that were not in-

cluded during the training phase. Firstly, we investigated whether the use of noise

information improves performance for speech enhancement. The results show that

52

Table 4.1: Results of PESQ, segSNR, eSTOI, and SDR values of the proposed andbaseline networks in the matched noise type conditions, where −6 and 9 dB casesare unseen SNR conditions.

(a) PESQ

SNR(dB)

noisy sT snT nDT snDT

-6 1.53 2.00 2.12 2.06 2.22-3 1.71 2.23 2.35 2.30 2.450 1.90 2.44 2.57 2.52 2.663 2.11 2.64 2.76 2.72 2.856 2.33 2.83 2.95 2.90 3.029 2.54 2.99 3.10 3.05 3.17

Aver. 2.02 2.52 2.64 2.59 2.73

(b) segSNR

SNR(dB)


-6 -6.87 1.49 3.18 2.85 3.53-3 -5.39 3.06 4.31 3.93 4.920 -3.65 4.57 5.58 5.27 6.293 -1.80 6.08 7.03 6.79 7.866 0.32 7.41 8.33 8.14 9.209 2.57 8.64 9.56 9.38 10.43

Aver. -2.47 5.21 6.33 6.06 7.04

(c) eSTOI

SNR(dB)


-6 0.44 0.56 0.59 0.57 0.61-3 0.52 0.64 0.67 0.65 0.690 0.59 0.71 0.74 0.73 0.763 0.67 0.77 0.80 0.79 0.826 0.74 0.82 0.84 0.84 0.869 0.80 0.86 0.88 0.87 0.89

Aver. 0.63 0.73 0.75 0.74 0.77

(d) SDR

SNR(dB)


-6 -5.97 7.07 7.96 7.22 8.75-3 -3.11 9.63 10.42 9.85 11.100 -0.17 11.92 12.67 12.16 13.213 2.80 14.06 14.71 14.27 15.146 5.78 15.81 16.42 16.03 16.819 8.78 17.34 17.94 17.56 18.24

Aver. 1.35 12.64 13.35 12.85 13.88

53

Table 4.2: Results of PESQ, segSNR, eSTOI, and SDR values of the proposed andbaseline networks in the mismatched noise type conditions, where −6 and 9 dB casesare unseen SNR conditions.

(a) PESQ

SNR(dB)


-6 1.33 1.68 1.77 1.79 1.90-3 1.55 1.93 2.02 2.02 2.130 1.77 2.16 2.25 2.27 2.353 1.98 2.38 2.46 2.44 2.556 2.20 2.59 2.67 2.65 2.759 2.41 2.78 2.86 2.83 2.93

Aver. 1.88 2.25 2.34 2.33 2.43

(b) segSNR

SNR(dB)


-6 -6.59 -0.86 1.78 1.70 1.90-3 -5.08 0.81 2.85 2.72 2.810 -3.35 2.58 3.50 3.47 4.043 -1.48 4.16 4.97 4.89 5.696 0.64 5.82 6.64 6.60 7.449 2.91 7.29 8.11 8.08 8.97

Aver. -2.16 3.30 4.64 4.58 5.14

(c) eSTOI

SNR(dB)


-6 0.39 0.46 0.48 0.48 0.51-3 0.47 0.55 0.58 0.57 0.600 0.55 0.63 0.66 0.66 0.683 0.63 0.71 0.74 0.73 0.756 0.71 0.77 0.80 0.80 0.819 0.78 0.82 0.84 0.84 0.86

Aver. 0.59 0.66 0.68 0.68 0.70

(d) SDR

SNR(dB)


-6 -6.00 1.96 2.44 2.20 2.59-3 -3.11 4.89 5.37 5.21 5.570 -0.17 7.89 8.37 8.26 8.613 2.79 10.50 10.92 10.78 11.176 5.78 13.01 13.41 13.24 13.669 8.78 15.11 15.52 15.37 15.82

Aver. 1.34 8.89 9.34 9.18 9.57

54

the snT model, which constructed the masks using both speech and noise infor-

mation, performed better than the sT model whose prediction was based only on

speech components. Similarly, the snDT model with noise estimates reported better

performance in terms of all the metrics compared to the nDT model.

The nDT model, which disentangles the noise components in the latent feature

space, resulted in lower performance improvements in comparison with the snT

model. This confirms that even though the nDT model incorporated disentangled

feature learning, it was not able to exploit the noise information to construct the

masks during the speech enhancement process. On the other hand, in order to ex-

amine the sole effect of the disentangled feature learning, the nDT model should be

compared to the sT model whose structure is identical to the nDT model except

for the noise disentangler. As can be seen in the results, the nDT model outper-

formed the sT model in terms of all the metrics. Furthermore, the comparison of the

snDT model to the snT model, both of which similarly adopted the masks except

that the snDT model additionally applied disentangled feature learning, reported

better performance improvements for the snDT model. In summary, the proposed

model showed better performance than all the other baseline models in terms of

speech quality, intelligibility, noise reduction, and speech distortion, indicating that

the disentanglement between speech and noise features in the latent feature space

was more effective for the prediction of the clean speech.

In case of the mismatched noise conditions, we evaluated performance given four

different noise types and averaged the results over each of the SNR environment. Ta-

ble 4.2 presents the PESQ scores, segSNR, eSTOI, and SDR values obtained under

the mismatched noise conditions. The results show that the snDT model outper-

formed the baseline methods, implying that it was more robust to the unseen noise

55

types. Since the snDT model learned how to disentangle speech components from

the latent features, the disentangled features could be obtained even in the mis-

matched noise conditions. From the perspective of noise reduction, in particular, it

is quite noteworthy that the models using disentangled feature learning showed rel-

atively competitive performance improvements in the mismatched noise conditions

compared to the matched conditions. In case of the matched noise conditions, the

relative improvement of segSNR was 16.31% for the nDT model when compared

against the sT model, and 11.21% for the snDT model against the snT model. In

the case of the mismatched noise conditions, however, the relative improvements of

segSNR of the nDT and snDT models were 38.79% and 15.95%, respectively. It can

be seen that the proposed approach is particularly effective in the aspect of noise

reduction.

Additionally, Figure 4.5 shows the spectrograms of an utterance enhanced by

the snT and snDT models in the mismatched noise conditions. From this figure, it

is shown that the proposed algorithm effectively reduced the noise from the original

noisy speech while the speech distortion was minimized.

We also conducted a listening test to compare the subjective performance of the

proposed algorithm with the conventional scheme. For that, 18 listeners participated

and were presented with 42 randomly selected sentences corrupted by the 14 different

noises in the SNR values of -3, 0, and 3 dB. In the test, each listener was provided

with speech samples enhanced by the snT model and snDT model. Listeners could

listen to each enhanced speech as many times as they wanted, and were asked to

choose the preferred one from each pair of speech samples in terms of perceptual

speech quality. If the quality of the two samples was indistinguishable, listeners could

select no preference. Two samples in each pair were given in arbitrary order.

56

Figure 4.5: (From top to bottom) The spectrograms of noisy speech degraded bymetro noise with −3 dB SNR, enhanced speech by the snT model, enhanced speechby the snDT model, and the corresponding clean speech, respectively.

The results are shown in Figure 4.6. It can be seen that the quality of the speech

enhanced by the proposed model was better than the conventional model in all SNR

values. With respect to the averaged results, the snDT model was preferred to the

snT model in 52.78% of the cases, while the opposite preference was 8.20% (no

preference in 39.02% of the cases). These results imply that the proposed algorithm

enhances not only the objective measures but also the perceived quality.

57

8.20%

7.94%

11.51%

5.16%

39.02%

45.24%

36.11%

35.71%

52.78%

46.82%

52.38%

59.13%

0% 20% 40% 60% 80% 100%

Average

3

0

-3

snT Neutral snDT

Figure 4.6: Results of subjective preference test (%) comparing the speech qualityfor the snT and snDT models with various SNR values.

4.5.5 Analysis of Noise-Invariant Speech Enhancement

As the network is trained with different types of noise, it is easily anticipated

that the performance may vary depending on the noise types even when given the

same SNR value. This could be problematic, especially under various real-world

noise environments, because lower performance improvements for certain noise types

could certainly result in lower performance in overall for the entire system. Figure 4.7

describes the variances of the PESQ scores obtained from different noise types. We

separately measured the PESQ scores for each noise type and computed the variances

of 14 different noise types used in the matched and mismatched noise conditions.

The results show that the proposed algorithm yielded the smallest performance

gap among the noise types in all of the SNR environments. It is noted that the

snDT model produced much smaller variances at the low SNR level compared to

the baseline models. This demonstrates that the proposed model was less sensitive

to different noise types during the enhancement process because it disentangled the

speech attributes well from the noisy speech in the latent feature space. Experimental

results, therefore, suggest that the proposed model is a speech enhancement system

with an improved noise-invariant property.

58

Figure 4.7: Variances of PESQ scores for the 14 different noise types in various SNRenvironments.

4.5.6 Disentangled Feature Representations

We further explored the effect of disentangled feature learning by visualizing

the speech latent feature (zs) using t-SNE [89]. t-SNE is a popular data visualiza-

tion method which projects high dimensional data into a subspace with a smaller

dimension. The projection serves as a useful tool to visually inspect feature represen-

tations learned by the model. We extracted speech latent features from a subset of

the test samples through trained models and projected the 512-dimensional zs into

the 2-dimensional space using t-SNE. Figure 4.8 visualizes the speech latent feature

representations obtained in the matched noise conditions. Figure 4.8d, in particular,

shows that by using two disentanglers for adversarial learning, the distribution of zs

became almost indistinguishable. This implies that the noise attributes were highly

likely to be disentangled in zs. In contrast, without disentangled feature learning, as

shown in Figures 4.8a and 4.8b, we were able to separate each type of noise cluster

59

(a) sT model (b) snT model

(c) nDT model (d) snDT model

Figure 4.8: Visualization of speech latent feature (zs) using t-SNE in the matched noisecondition

(a) snT model (b) snDT model

Figure 4.9: Visualization of speech latent feature (zs) using t-SNE in the mismatched noisecondition

60

easily in the latent feature space. This indicates that the noise attributes remain

intact in zs. Figure 4.8c shows that the nDT model disentangled the noise compo-

nents more clearly as compared to the sT and snT models, yet not as much as the

snDT model. Finally, Figure 4.9 shows the speech latent feature representations in

the mismatched noise conditions. Even though the noise types were not included in

the training data, the proposed model disentangled noise components more clearly

in the latent feature space compared to the conventional DNN-based models.

4.6 Summary

In this chapter, we proposed a novel speech enhancement method in which speech

and noise latent features were disentangled via adversarial learning. In order to

explore the disentangled representation which has not been exploited in the con-

ventional speech enhancement algorithms, we designed a model using GRLs. The

proposed architecture is composed of five sub-networks where the decoders and the

disentanglers were trained in an adversarial manner to encourage the encoder to pro-

duce noise-invariant features. The speech latent features generated by the encoder

reduced the variability among different noise types while retaining the speech infor-

mation intact. Experimental results showed that the proposed model outperformed

the conventional DNN-based speech enhancement algorithms in terms of various

measurements in both the matched and mismatched noise conditions. Moreover, the

proposed model achieved more competitive noise-invariant property through disen-

tangled feature learning. Visualization of the speech latent features demonstrated

that the proposed method was able to disentangle speech attributes from the noisy

speech in the latent feature space.

61

Chapter 5

Conclusions

In this thesis, we proposed a variety of deep learning techniques to improve

performance of acoustic environment recognition and speech enhancement. In or-

der to enhance the classification accuracy of acoustic scenes, we proposed a novel

neural network structure which achieved higher performance compared with the

conventional DNN, CNN and LSTM architecture in terms of both frame-based and

segment-based accuracy. By combining different networks in parallel, the proposed

method was able to learn complementary information of LSTM and CNN.

Also, we proposed a neural network for overlapping AEC based on joint training

between source separation model and multi-label classification model. By adopting

the source separation framework into the overlapping AEC task, the jointly trained

network can minimize the interference of overlapping events. From the experimental

results, it has been found that the proposed technique outperforms the baseline

networks which do not apply the joint training with source separation.

Finally, we proposed a novel speech enhancement method in which speech and

noise latent features were disentangled via adversarial learning. In order to explore

63

the disentangled representation which has not been exploited in the conventional

speech enhancement algorithms, we designed a model using GRLs. The proposed

architecture is composed of five sub-networks where the decoders and the disentan-

glers were trained in an adversarial manner to encourage the encoder to produce

noise-invariant features. The speech latent features generated by the encoder re-

duced the variability among different noise types while retaining the speech infor-

mation intact. Experimental results showed that the proposed model outperformed

the conventional DNN-based speech enhancement algorithms in terms of various

measurements in both the matched and mismatched noise conditions. Moreover, the

proposed model achieved more competitive noise-invariant property through disen-

tangled feature learning. Visualization of the speech latent features demonstrated

that the proposed method was able to disentangle speech attributes from the noisy

speech in the latent feature space.

64

Bibliography

[1] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “Cp-jku submissions

for dcase-2016: a hybrid approach using binaural i-vectors and deep convolu-

tional neural networks,” IEEE AASP Challenge on Detection and Classification

of Acoustic Scenes and Events (DCASE), 2016.

[2] H. Phan, L. Hertel, M. Maass, P. Koch, R. Mazur, and A. Mertins, “Improved

audio scene classification based on label-tree embeddings and convolutional neu-

ral networks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro-

cessing, vol. 25, no. 6, pp. 1278–1290, 2017.

[3] L. Lu, H.-J. Zhang, and S. Z. Li, “Content-based audio classification and seg-

mentation by using support vector machines,” Multimedia systems, vol. 8, no. 6,

pp. 482–492, 2003.

[4] A. Temko and C. Nadeu, “Classification of acoustic events using svm-based

clustering schemes,” Pattern Recognition, vol. 39, no. 4, pp. 682–694, 2006.

[5] A. Temko and C. Nadeu, “Acoustic event detection in meeting-room environ-

ments,” Pattern Recognition Letters, vol. 30, no. 14, pp. 1281–1288, 2009.

65

[6] A. Harma, M. F. McKinney, and J. Skowronek, “Automatic surveillance of the

acoustic activity in our living environment,” in IEEE International Conference

on Multimedia and Expo (ICME), 2005.

[7] M. Xu, C. Xu, L. Duan, J. S. Jin, and S. Luo, “Audio keywords generation for

sports video analysis,” ACM Transactions on Multimedia Computing, Commu-

nications, and Applications (TOMM), vol. 4, no. 2, p. 11, 2008.

[8] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,

M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset

for audio events,” in IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), 2017.

[9] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training

to increase speech intelligibility for hearing-impaired listeners in novel noises,”

The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. 2604–2612,

2016.

[10] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “A deep denois-

ing autoencoder approach to improving the intelligibility of vocoded speech in

cochlear implant simulation,” IEEE Transactions on Biomedical Engineering,

vol. 64, no. 7, pp. 1568–1578, 2016.

[11] A. Maas, Q. V. Le, T. M. O’neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Re-

current neural networks for noise reduction in robust asr,” 2012.

[12] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement

with generative adversarial networks for robust speech recognition,” in 2018

66

IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2018, pp. 5024–5028.

[13] J. Ortega-Garcıa and J. Gonzalez-Rodrıguez, “Overview of speech enhancement

techniques for automatic speaker recognition,” in Proceeding of Fourth Inter-

national Conference on Spoken Language Processing. ICSLP’96, vol. 2. IEEE,

1996, pp. 929–932.

[14] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,”

IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2,

pp. 113–120, 1979.

[15] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square

error short-time spectral amplitude estimator,” IEEE Transactions on acous-

tics, speech, and signal processing, vol. 32, no. 6, pp. 1109–1121, 1984.

[16] N. S. Kim and J.-H. Chang, “Spectral enhancement based on global soft deci-

sion,” IEEE Signal processing letters, vol. 7, no. 5, pp. 108–110, 2000.

[17] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE

Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp.

197–210, 1978.

[18] P. Gupta, M. Patidar, and P. Nema, “Performance analysis of speech enhance-

ment using lms, nlms and unanr algorithms,” in 2015 International Conference

on Computer, Communication and Control (IC4). IEEE, 2015, pp. 1–5.

[19] R. Li, Y. Liu, Y. Shi, L. Dong, and W. Cui, “Ilmsaf based speech enhancement

with dnn and noise classification,” Speech Communication, vol. 85, pp. 53–70,

2016.

67

[20] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise envi-

ronments,” Signal processing, vol. 81, no. 11, pp. 2403–2418, 2001.

[21] K. Kwon, J. W. Shin, and N. S. Kim, “Nmf-based speech enhancement using

bases update,” IEEE Signal Processing Letters, vol. 22, no. 4, pp. 450–454,

2014.

[22] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised

speech enhancement using nonnegative matrix factorization,” IEEE Transac-

tions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140–2151,

2013.

[23] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D.

Plumbley, “Detection and classification of acoustic scenes and events: An

IEEE AASP challenge,” in IEEE Workshop on Applications of Signal Process-

ing to Audio and Acoustics, 2013, pp. 1–4.

[24] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic scene

classification: Classifying environments from the sounds they produce,” IEEE

Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.

[25] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “De-

tection and classification of acoustic scenes and events,” IEEE Transactions on

Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.

[26] Z. Kons and O. Toledo-Ronen, “Audio event classification using deep neural

networks,” in INTERSPEECH, 2013, pp. 1482–1486.

68

[27] O. Gencoglu, T. Virtanen, and H. Huttunen, “Recognition of acoustic events

using deep neural networks,” in European Signal Processing Conference (EU-

SIPCO), 2014, pp. 506–510.

[28] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event

classification using deep neural networks,” IEEE/ACM Transactions on Audio,

Speech, and Language Processing,, vol. 23, no. 3, pp. 540–552, 2015.

[29] A. Graves, “Supervised sequence labelling,” in Supervised Sequence Labelling

with Recurrent Neural Networks. Springer, 2012, pp. 5–13.

[30] Y. Wang, L. Neves, and F. Metze, “Audio-based multimedia event detection

using deep recurrent neural networks,” in IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2742–2746.

[31] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for

polyphonic sound event detection in real life recordings,” in IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.

6440–6444.

[32] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using

convolutional neural networks,” in IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), 2015, pp. 559–563.

[33] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, “Exploiting spectro-

temporal locality in deep learning based acoustic event detection,” EURASIP

Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, pp. 1–12,

2015.

69

[34] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene

classification and sound event detection,” in European Signal Processing Con-

ference (EUSIPCO), 2016.

[35] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent

neural networks,” arXiv preprint arXiv:1211.5063, 2012.

[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu-

tation, vol. 9, no. 8, pp. 1735–1780, 1997.

[37] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recog-

nition with 1-max pooling convolutional neural networks,” arXiv preprint

arXiv:1604.06338, 2016.

[38] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,

K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for

visual recognition and description,” in IEEE Conference on Computer Vision

and Pattern Recognition, 2015, pp. 2625–2634.

[39] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-

term memory, fully connected deep neural networks,” in IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,

2015, pp. 4580–4584.

[40] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint

arXiv:1212.5701, 2012.

[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: A simple way to prevent neural networks from overfitting,” The

Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

70

[42] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependent

sound event detection,” EURASIP Journal on Audio, Speech, and Music Pro-

cessing, vol. 2013, no. 1, pp. 1–13, 2013.

[43] A. Dessein, A. Cont, and G. Lemaitre, “Real-time detection of overlapping

sound events with non-negative matrix factorization,” in Matrix Information

Geometry. Springer, 2013, pp. 341–371.

[44] Y. Wang and F. Metze, “A first attempt at polyphonic sound event detection

using connectionist temporal classification,” in IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), 2017.

[45] E. Benetos, G. Lafay, M. Lagrange, and M. D. Plumbley, “Polyphonic sound

event tracking using linear dynamical systems,” IEEE/ACM Transactions on

Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1266–1277, 2017.

[46] J. Dennis, H. D. Tran, and E. S. Chng, “Overlapping sound event recognition

using local spectrogram features and the generalised hough transform,” Pattern

Recognition Letters, vol. 34, no. 9, pp. 1085–1093, 2013.

[47] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event

detection using multi label deep neural networks,” in IEEE International Joint

Conference on Neural Networks (IJCNN), 2015, pp. 1–7.

[48] T. Heittola, A. Mesaros, T. Virtanen, and M. Gabbouj, “Supervised model

training for overlapping sound events based on unsupervised source separation.”

in IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), 2013, pp. 8677–8681.

71

[49] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint opti-

mization of masks and deep recurrent neural networks for monaural source sep-

aration,” IEEE/ACM Transactions on Audio, Speech and Language Processing

(TASLP), vol. 23, no. 12, pp. 2136–2147, 2015.

[50] E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbley, “Discriminative

enhancement for single channel audio source separation using deep neural net-

works,” in International Conference on Latent Variable Analysis and Signal

Separation. Springer, 2017, pp. 236–246.

[51] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “NMF-based target source

separation using deep neural network,” IEEE Signal Processing Letters, vol. 22,

no. 2, pp. 229–233, 2015.

[52] A. Narayanan and D. Wang, “Improving robustness of deep neural network

acoustic models via speech separation and joint adaptive training,” IEEE/ACM

transactions on audio, speech, and language processing, vol. 23, no. 1, pp. 92–

101, 2015.

[53] K. H. Lee, S. J. Kang, W. H. Kang, and N. S. Kim, “Two-stage noise aware

training using asymmetric deep denoising autoencoder,” in IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.

5765–5769.

[54] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic

speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language

Processing, vol. 24, no. 4, pp. 796–806, 2016.

72

[55] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene

classification and sound event detection,” in European Signal Processing Con-

ference (EUSIPCO), 2016, pp. 1128–1132.

[56] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv

preprint arXiv:1412.6980, 2014.

[57] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind

audio source separation,” IEEE transactions on audio, speech, and language

processing, vol. 14, no. 4, pp. 1462–1469, 2006.

[58] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep

denoising autoencoder.” in Interspeech, 2013, pp. 436–440.

[59] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech

enhancement based on deep neural networks,” IEEE/ACM Transactions on

Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19,

2015.

[60] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “Nmf-based target source

separation using deep neural network,” IEEE Signal Processing Letters, vol. 22,

no. 2, pp. 229–233, 2014.

[61] X.-L. Zhang and D. Wang, “A deep ensemble learning method for monaural

speech separation,” IEEE/ACM Transactions on Audio, Speech and Language

Processing (TASLP), vol. 24, no. 5, pp. 967–977, 2016.

[62] J. Chen and D. Wang, “Long short-term memory for speaker generalization in

supervised speech separation,” The Journal of the Acoustical Society of Amer-

ica, vol. 141, no. 6, pp. 4705–4714, 2017.

73

[63] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee, “Convolutional-recurrent neural

networks for speech enhancement,” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2401–

2405.

[64] P. Chandna, M. Miron, J. Janer, and E. Gomez, “Monoaural audio source sep-

aration using deep convolutional neural networks,” in International conference

on latent variable analysis and signal separation. Springer, 2017, pp. 258–266.

[65] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in

neural information processing systems, 2014, pp. 2672–2680.

[66] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative

adversarial network,” arXiv preprint arXiv:1703.09452, 2017.

[67] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based speech

enhancement using generative adversarial network,” in 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018,

pp. 5039–5043.

[68] A. Pandey and D. Wang, “On adversarial training and loss functions for speech

enhancement,” in 2018 IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP). IEEE, 2018, pp. 5414–5418.

[69] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,

M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural net-

works,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–

2030, 2016.

74

[70] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised

speech separation,” IEEE/ACM transactions on audio, speech, and language

processing, vol. 22, no. 12, pp. 1849–1858, 2014.

[71] M. Delfarah and D. Wang, “Features for masking-based monaural speech sepa-

ration in reverberant conditions,” IEEE/ACM Transactions on Audio, Speech,

and Language Processing, vol. 25, no. 5, pp. 1085–1094, 2017.

[72] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on

knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.

[73] Y. Shinohara, “Adversarial multi-task learning of deep neural networks for ro-

bust speech recognition.” in INTERSPEECH. San Francisco, CA, USA, 2016,

pp. 2369–2372.

[74] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domain adap-

tation approach for robust speech recognition,” Neurocomputing, vol. 257, pp.

79–87, 2017.

[75] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B.-H. Juang,

“Speaker-invariant training via adversarial learning,” in 2018 IEEE Inter-

national Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 5969–5973.

[76] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial

training for accented speech recognition,” in 2018 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.

4854–4858.

75

[77] Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsupervised do-

main adaptation via domain adversarial training for speaker recognition,” in

2018 IEEE International Conference on Acoustics, Speech and Signal Process-

ing (ICASSP). IEEE, 2018, pp. 4889–4893.

[78] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recog-

nition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing,

vol. 26, no. 12, pp. 2423–2435, 2018.

[79] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptive

speech enhancement using domain adversarial training,” arXiv preprint

arXiv:1807.07501, 2018.

[80] L. R. Rabiner and B. Gold, “Theory and application of digital signal process-

ing,” Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 1975.

[81] V. Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit

and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990.

[82] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition:

Ii. noisex-92: A database and an experiment to study the effect of additive

noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp.

247–251, 1993.

[83] P. Kabal, “Tsp speech database,” McGill University, Database Version, vol. 1,

no. 0, pp. 09–02, 2002.

[84] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve

neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.

76

[85] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-

ing by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[86] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-

mawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine

learning,” in 12th Symposium on Operating Systems Design and Implementa-

tion, 2016, pp. 265–283.

[87] I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An ob-

jective method for end-to-end speech quality assessment of narrow-band tele-

phone networks and speech codecs,” Rec. ITU-T P. 862, 2001.

[88] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of

speech masked by modulated noise maskers,” IEEE/ACM Transactions on Au-

dio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.

[89] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of

machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.

77

요 약

우리 주변에서 발생하는 소리들은 많은 정보를 담고 있으며, 특히 인간의 음성이

가장대표적인예이다.하지만음성외에발생하는환경음 (environmental sound)또한

사용자맞춤형서비스측면에서주위환경을파악하는중요한요소가될수있다.이러한

환경음은 음성 정보를 추출하기 위한 어플리케이션에는 잡음으로 작용되어 제거해야

할 대상이 되며, 반대로 주변 환경을 파악하기 위한 어플리케이션에서는 인식해야 할

대상이 된다. 이와 같은 관점으로 본 논문에서는 딥 러닝 기반의 음향 환경 분류와 음성

향상 기법에 대해 제안한다.

먼저 음향 환경 분류를 위해 CNN (convolutional neural network)과 LSTM (long

short-term memory)을 결합하여 학습하는 분류 모델을 제안한다. 기존에 사용되었던

DNN (deep neural network) 기반 모델들은 음향 신호의 시간적인 정보를 활용하지

못한다는 단점이 있었다. 이를 극복하기 위해 LSTM 구조를 통해 시간적인 정보를 이

용하였으며, 또한 음향 신호의 국부적인 주파수와 시간의 상관 정보를 이용하기 위해

CNN 구조를 함께 결합하였다. 이는 서로 다른 두 모델이 상호 보완적인 정보를 이

용하여 학습이 되게 함으로써 기존의 기법에 비해 음향 환경 분류 성능이 향상됨을

확인하였다.

두 번째로 중첩된 음향 이벤트의 분류를 위해 음원 분리를 적용한 기법을 제안한다.

실생활에서는 서로 다른 음원들이 중첩되어 발생하는 경우가 많으며,이는 분류의 난이

도를높이는요소로작용한다.이를해결하기위해중첩된음향이벤트를음원분리하는

79

모델을통해학습시키고,별도로각각의분리된이벤트를분류하는모델을학습시킨후,

마지막으로 두 모델을 결합하여 다시 훈련 (joint training)을 한다. 이를 통해 훈련된

모델은 중첩된 음향을 효과적으로 분리하여 각각의 이벤트를 분류하는 성능을 높이게

된다.

마지막으로, 팩터 분리 학습 (disentangled factor learning)을 적용한 음성 향상 기

법을 제안한다. 위에서 제안한 기법들은 환경음을 인식하는 어플리케이션이지만, 음성

향상에서는 음성 이 외의 환경음은 제거를 목적으로 한다. 제안한 기법은 음성과 잡음

을 각각 다른 팩터로 하여 잠재 공간 (latent space) 상에서 두 팩터를 분리하고, 잡음

팩터가 제거된 음성 팩터를 통해 깨끗한 음성 (clean speech)을 추정한다. 팩터 분리

학습으로 접근한 음성 향상 기법은 여러 성능 측정 기준에서 기존 딥 러닝 기반의 음성

향상기법들보다뛰어난성능을보였다.또한환경음분류정보를사전에이용한환경음

인지 학습 (environmental sound aware training)이 음성 향상 성능에 미치는 영향을

확인하였다.

주요어: 음성 향상, 음향 환경 분류, 중첩된 음향 이벤트 분류, 음원 분리, 팩터 분리

학습, 딥 러닝

학 번: 2012-20781

80

감사의 글

박사 학위의 목표를 위해 휴먼인터페이스 연구실에 들어온 지도 어느덧 8년이 되어

갑니다. 지금 이 글을 쓰고 있는 시점에서 그 동안의 대학원 생활을 되돌아보고 내가

얼마나성장했는지에대해많은생각이듭니다.석박학위과정을거치면서연구는물론

연구 외적으로도 많은 것을 배울수 있었던 소중한 시간이였습니다. 그 동안 함께하고

많은 도움을 주신 소중한 분들께 감사의 인사를 드립니다.

가장 먼저 지도 교수님이신 김남수 교수님께 감사의 말씀을 드립니다. 연구적으

로 많은 아이디어와 영감을 주시고 부족한 제가 졸업을 하기까지 가장 큰 힘이 되어

주셨습니다. 연구 외적으로도 제자들에게 대하는 모습을 보면 진정한 스승의 의미를

깨닫게 됩니다. 교수님께서 그 동안 주신 가르침을 받아 사회에 나가서도 계속 성장해

나갈 수 있는 사람이 되겠습니다. 박사 학위 논문의 부족한 부분에 대해서 많은 조언과

도움을 주신 김성철 교수님, 심병효 교수님, 장준혁 교수님, 그리고 신종원 교수님께도

감사 드립니다. 지도해주신 모든 교수님들께서 항상 건강하시고 삶에 행복이 깃들길

바라겠습니다.

제가 신입생 때부터 많은 조언을 해주시고 지금은 각자 사회에 나가 연구실을 빛

내고 계시는 먼저 졸업하신 선배님들께도 감사의 인사를 드립니다. 창우형, 준식이형,

기호형,유광이형,두화형,신재형,철민이형,석재,태균이,가장고마운기수형,그리고

석사 졸업을 한 현우, 수카냐, 세영이, 지환이, 석완이, 모두 계속해서 밝은 앞날이 되길

바라겠습니다. 또한 먼저 졸업한 제 동기 강현이, 같이 술도 마시고 운동도 하며 덕분에

81

대학원 생활에 활력을 얻게 해주었습니다. 같이 뉴올리언스 학회를 갔던 그 때가 가끔

그립기도 합니다. 지금도 열심히 연구에 매진하고 있는 후배들에게도 감사 드립니다.

덕분에 많은 추억을 안고 졸업을 하게 되었습니다. 논문과 학회, 그리고 과제까지 저와

모든 것을 함께한 인규도 졸업 축하하고, 회사에서도 승승장구 하길 바랍니다. 책임감

있게 오랜기간 방장을 잘 수행했고, 항상 편한 대화 상대가 되어준 준엽이, 또한 마찬가

지로 자기 일에 책임감이 강하고 술자리에서 같이 술을 잘 마실 수 있는 정훈이도 다음

졸업 타자로써 무사히 학위를 마치길 응원합니다. 성준이도 조금만 더 힘내서 졸업과

함께 여자친구와의 결혼도 자연스럽게 연결되길 바랍니다. 연구실 핫이슈/인기쟁이/

삼각별/다이어터 형용이, 그리고 대학원 연구는 이렇게 하는 것이다 라는 진수를 보여

주고 있는 우현이도 마무리 잘 하길 바랍니다. 취미생활의 과도기를 지나 연구에 포텐

터진 원익이는 쿠알라룸푸르에서의 추억이 떠오릅니다. 똑똑하고 운동 신경도 적절하

고 싹싹하기 까지 한 현승이는 이제 연애만 하면 완벽할 것 같습니다. 회사 다니면서도

석사까지 마치고, 박사학위를 받고 있는 주현이형은 인서울 분양 아파트가 부럽습니다.

피지컬 좋고 마인드 좋은 병진이는 자기 일도 열심히 하고, 결혼도 했으니 참 듬직합니

다. 묵묵한 성실 대장 성환이는 누군가 데리고 가서 일탈을 맛보게 해 주시길 바랍니다.

이제막운동을같이시작한민현이는아직털어줘야하는근육이한참남았는데떠나게

되어 아쉽습니다. 하루 두번 연구실 출근하는 형래는 캔커피 좀 줄이고 석사 졸업 무사

히 하길 바랍니다. 좋은 친구가 될 수 있을 것 같은 지원이는 지금과 같은 추진력이면

박사 과정도 훌륭히 해낼 것이라 믿습니다. 석민이는 송년회 때 제가 술먹고 괴롭혀서

미안합니다. 다 기억하고 있습니다. 오래 전부터 같이 운동하면 좋았을 것 같은 민찬이

에게는 민현이의 하드 트레이닝을 부탁합니다. 이렇게 연구를 잘 할수 있나 하는 사람,

그 사람이 형주입니다. 아랫방 귀염둥이 범준이는 전문연 되길 바라며, 병찬이는 한

참 남았지만 졸업 때까지 꼭 난을 잘 키워서 후배에게 넘겨주면 좋겠습니다. 길호형도

바쁘시겠지만 학위 마무리 잘 하시길 바랍니다. 모두들 건강하게 원하는 목표 이루고

대학원 졸업하기를 바라겠습니다.

82

연구 외적으로 제 삶에 큰 활력을 주는 중/고/대학교 오랜 친구들에게도 정말 감사

합니다. 지금의 저를 있게 해준 사랑하는 부모님, 동생, 형, 누나, 그리고 어렸을 때부터

즐거운추억많이만들어주신작은아버지가족분들,고모가족분들도항상감사합니다.

항상 베풀어 주셨던 사랑을 생각하며 감사히 살겠습니다. 마지막으로 하늘에 계시지만

절대 잊을 수 없는 할머니께 감사 드립니다.

83

disclaimers-space.snu.ac.kr/bitstream/10371/168036/1/000000160062.pdf · 2020-05-19 · deep...

Documents