ry pyconjp2015 karaoke

1

PyCon JP 2015

Renyuan Lyu

呂仁園

Chun-Han Lai

賴俊翰

Karaoke-style Read-aloud System

Chang Gung Univ.

Taiwan

Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1

https://pycon.jp/2015/en/speaker/profile/57/



https://pycon.jp/2015/en/schedule/presentation/62/





CguTextKaraoke a Karaoke-style Read-aloud System

Using Speech Alignment and Text-to-Speech Technology

Chun-Han Lai (賴俊翰)

Renyuan Lyu (呂仁園)

Chang Gung University (長庚大學) Taiwan (台灣)

2







Abstract

• A procedure to create a Speech-to-Text Synchronization file from an original text-only file

– can be used to show high-light text just like a Karaoke machine

– very useful for language learning purpose.

• TTS (Text-to-speech) technology on clouds, like Google TTS

• Speech-recognition technology, like HTK, for temporal alignment

3

Introduction

• Starting from a text-only file, using a cloud-based text-to-speech (TTS) technology, like Google Translate/TTS, and also a speech-recognition technology, like Hidden Markov Model Toolkits (HTK), we could generate its associated timed-text file which aligns up text with speech waveform file in the temporal axis.

• Python is used not only as a glue to link all different styles of software resources, like Google Translate and HTK, but also as a powerful tool to deal with all text processing tasks in this project.

• From such a kind of timed text file, we have also provided a JavaScript based web-app and also a Python GUI software to demonstrate the time-aligned high-lighted text like a karaoke machine in word level, which are considered very useful for the language learning purpose.

4

a Karaoke-style Text Read-aloud System

https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180

• Karaoke (カラオケ) is a form of interactive entertainment in which an amateur singer sings along with recorded music.

• Lyrics are usually displayed on a video screen, along with a moving symbol, changing color, or music video images, to guide the singer.

• Here is an example of my favorites

https://en.wikipedia.org/wiki/Karaoke

5




https://en.wikipedia.org/wiki/Karaoke

Speech Shadowing Technique for Language Learning

• The motivation of this project » https://en.wikipedia.org/wiki/Speech_shadowing

– Speech shadowing

• is an Language Learning technique in which subjects repeat speech immediately after hearing it.

– The technique is used in language learning.

– A demonstration can be viewed at the following Youtube link.

• “English Speaking Practice: How to improve your English Speaking and Fluency: SHADOWING”

• https://www.youtube.com/watch?v=GVWFGIyNswI

6

https://en.wikipedia.org/wiki/Speech_shadowing

https://www.youtube.com/watch?v=GVWFGIyNswI

Text-to-Speech Synthesis

7

Wikipedia is a multilingual, web-based, free-content encyclopedia project supported

by the Wikimedia Foundation and based on a model of openly editable content. The

name "Wikipedia" is a portmanteau of the words wiki (a technology for creating

collaborative websites, from the Hawaiian word wiki, meaning "quick") and

encyclopedia. Wikipedia's articles provide links designed to guide the user to related

pages with additional information.

Given: a piece of Text and its speech, e.g.,

The goal is to obtain its speech

Google TTS API in a Python module

8

• pip install gTTS

from gtts import gTTS

aText= 'Wikipedia is a multilingual, ...'

aLang= 'en'

tts= gTTS(text= aText, lang= aLang)

tts.save("aSpeech.mp3")

aSpeech.mp3 aText

https://github.com/pndurette/gTTS

https://github.com/pndurette/gTTS

FFmpeg

• About Ffmpeg – [https://en.wikipedia.org/wiki/FFmpeg]

– FFmpeg is a free software project that produces libraries and programs for handling multimedia data.

– It is one of the leading multimedia frameworks, able to do many DSP tasks, including ...

• decode, encode,

• transcode, mux, demux, stream, filter and play

9

https://en.wikipedia.org/wiki/FFmpeg

10

FFmpeg -i aSpeech.mp3 -y -

vn -acodec pcm_s16le -ac 1

-ar 16000 -f wav

aSpeech.wav

aSpeech.mp3 aSpeech.wav

Pcm, 16 bits/sample Little endian 1 (mono) channel 16000 samples/sec

FFplay

aSpeech.wav

Verifying by seeing and hearing

Or using an interactive audio tool, like Audacity.

Audacity (audio editor) • Audacity is a powerful, free open source digital audio editor

– Its features include: • Recording and playing back sounds

• Importing and exporting of WAV, MP3, ....

• Viewing and editing via cut, copy, and paste, ...

11

aSpeech.mp3

aSpeech.wav

Text-to-Speech Alignment

12

Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information.

Given: a piece of Text and its speech, e.g.,

The goal is to obtain a ‘Timed-Text’

0.0

00

0.0

80

sil

0.0

80

0.8

70

wik

iped

ia

0.8

70

0.9

90

is

0.9

90

1.0

80

a

1.0

80

2.0

10

mu

ltil

ing

ual

2.0

10

2.1

40

sil

2.1

60

2.2

40

sil

2.2

40

3.0

20

web

bas

ed

3.0

20

3.1

80

sil

3.2

04

3.3

54

sil

3.3

54

4.2

84

fre

eco

nte

nt

4.2

84

5.3

74

en

cycl

op

edia

5.3

74

5.7

74

pro

ject

5.7

74

6.4

54

su

pp

ort

ed

6.4

54

6.7

54

by

6.7

54

6.9

04

th

e

6.9

04

7.5

74

wik

imed

ia

7.5

74

8.4

14

fo

un

dat

ion

8.4

14

8.5

14

sil

8.5

32

8.6

22

sil

8.6

22

8.8

52

an

d

8.8

52

9.2

42

bas

ed

9.2

42

9.3

82

on

9.3

82

9.4

32

a

9.4

32

9.9

82

mo

del

9.9

82

10

.03

2 o

f

10

.03

2 1

0.5

92

op

enly

10

.59

2 1

1.2

12

ed

itab

le

11

.21

2 1

1.8

02 c

on

ten

t

11

.80

2 1

1.9

32 s

il

: : :

Wav splitting

13

In Sentence-level, this can be straightforward done by extracting the time information from the TTS mp3 files, which are received sentence by sentence.

Sentence boundaries

Phonetic Transcription

• Speech recognition technology needs to transcribe text into phonetic symbols, in order to build up phone models.

14

“Wikipedia is a multilingual, web-based, free-content encyclopedia project.”

“wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.”

”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.”

Original English Text: (ASCII only, perhaps!)

Transcription in IPA: (needs Unicode)

Transcription in SAMPA: (ASCII only, including non-alphabet symbols)

http://upodn.com/phon.asp

http://upodn.com/phon.asp

• Post processing of phonetic transcription • To map or simply clean all undesired symbols from multiple

styles of outputs – (usually in unicode, or some non-alphabet symbols)

• For plain English (en), – Approximately using the original Text as the phone sequence.

– Although it seems too simple, it is so far so good.

• For Traditional Chinese (zh-tw), – Google Translate was used to get phonetic symbols in Pinyin (拼音,

pīnyīn), and then plain romaji (eliminating the tone mark)

• For Japanese (ja),

– Mecab has been used recently to get the Katakana (片仮名, カタカナ).

– Romkan has been used to transform katakana to romaji (kunrei)

• Thanks to Python, it helps me do the most jobs during this stage of processing!!

15

• Phonetic transcription for English

– Using regular expression module

16

phn= text2phn_en(enText)

enText= ‘’’Wikipedia is a multilingual, web-based, free-content encyclopedia project.‘’’

phn= ‘’’wikipedia_is_a_multilingual_webbased _freecontent_encyclopedia_project’’'

import re

pats= '\'|\"|\-|^_|_$|,|\.|$|$' phn= re.sub(pats, '', phn)

• Phonetic transcription for Traditional Chinese

– Using Google Translate/TTS api

17

phn= text2phn_tc(tcText)

tcText= ‘維基百科是一個自由內容’

phn= ‘weiji_baike_shi_yige_ziyou_neirong’

GOOGLE_TTS_URL= 'https://translate.google.com.tw/translate_a/single?dt=bd&dt=ex&dt=at&'

req= urllib.request.Request(GOOGLE_TTS_URL + data)

• Phonetic transcription for Japanese

– Using MeCab and Romkan

18

phn= text2phn_jp(jpText)

jpText= ‘‘’ウィキペディアは、信頼されるフリーなオンライン百科事典、‘’’

phn= ‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_ na_onrain_hyakka_ziten‘’’

import MeCab import romkan

y= MeCab.Tagger().parse(text) ... kun= romkan.to_kunrei(phn)

At the Halfway

• a bundle of files wav/lab

19

• HMM Toolkits (HTK), – http://htk.eng.cam.ac.uk/

– Given a speech utterance, with its phone sequence, the speech can be well aligned with phones by ‘forced alignment’ techniques in the HMM approach.

– A set of HMM Toolkits, called HTK, provided a convenient way to utilize the HMM approach.

20

Speech recognition technology

http://htk.eng.cam.ac.uk/

• The HTK overview

21

HTK processing (abstract) ....

22

• #[00] setting the working dir

• #[01] creating the (hmm) model prototype

• #[02] label processing

• #[03] feature extraction

• #[04] model initialization

• #[05] model training

• #[06] forced alignment

• #[07] post file moving operation

HTK processing (detail)....

23

#[00] setting the working dir

dirName= ./_wav/

#[01] creating the (hmm) model prototype

CreateHProto....

myHmmPro

N = 3 M = 6

#[02] label processing

000, 0,----> .\_htk\hled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hLed00

001, 0,----> .\_htk\hled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.led

002, 0,----> .\_htk\hled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I spLab

#[03] feature extraction

003, 0,----> .\_htk\HCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>> 2.htk.out

#[04] model initialization

004, 1,----> mkdir hmms_p

005, 0,----> .\_htk\HCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M hmms_p

#[05] model training

006, 0,----> .\_htk\HERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3 -

007, 0,----> .\_htk\HERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I spLab_p

: (repeating several times...)

:

#[06] forced alignment

016, 0,----> .\_htk\HVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i spLab_aligned

#[07] post file moving operation

017, 1,----> mkdir outDir

018, 1,----> copy spLab_aligned.mlf outDir\./_wav_aligned.mlf

24

HLed spLab.scp spLab.mlf

spLab.lst

hLed.led

HLed spLab00.mlf

spLab00.lst

hLed00.led

HLed spLab_p.mlf

spLab_p.lst

hLed.led

spLab_p.dic

HLed

25

HCopy

hCopy.conf

spWav2Mfc.scp

*.wav *.mfc

HCopy

HCompV

26

HCompV

HCompV.conf

*.mfc hmms_p/*

spMfc.scp

spLab_p.mlf myHmmPro

HERest

27

HERest

hErest.conf

*.mfc

hmms_p/*

spMfc.scp

spLab_p.mlf spLab_p.lst

hmms_p/HER1.acc

N iterations

N=5

HERest

HVite

28

HVite

hVite.conf *.mfc

spMfc.scp

spLab_p.lst

spLab_aligned.mlf

spLab.mlf

spLab_p.dic

hmms_p/

HTK summary

29

HLed

HCopy

HCompV

HERest

HVite

HTK Tools

#!MLF!#

"./_wav/SN0.rec"

0 800000 sil -578.044434

800000 8700000 wikipedia -5636.368652

8700000 9900000 is -855.988770

9900000 10800000 a -693.554871

10800000 20100000 multilingual -7268.197266

20100000 21400000 sil -791.746216

.

"./_wav/SN1.rec"

0 800000 sil -541.083069

800000 8600000 webbased -5977.622070

8600000 10200000 sil -1048.225220

.

"./_wav/SN2.rec"

0 1500000 sil -1100.892822

1500000 10800000 freecontent -7094.197266

10800000 21700000 encyclopedia -8148.633789

21700000 25700000 project -3247.493896

25700000 32500000 supported -5594.979492

32500000 35500000 by -2412.487305

35500000 37000000 the -1176.310547

37000000 43700000 wikimedia -5128.852051

43700000 52100000 foundation -5995.618164

52100000 53100000 sil -695.872864 .

.

.

spLab_aligned.mlf

wavDir/

The major algorithm in HTK

30

‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’

‘h’ ’o’ ’ng’

• Forced Alignment in HTK – 1. Given a Speech signal – 2. Doing the Pronunciation transcription

• Pronunciation symbols must be all-ASCII only!!

– 3. Training to get the HMM models

31

‘h’

’o’

’ng’

– 4. Doing the Viterbi Search for the optimal path (alignment):

32

#!MLF!#

"wavDir/SN0001.rec"

0 800000 sil -567.865356

800000 8700000 wikipedia -5670.471680

8700000 10000000 is -951.059692

10000000 10600000 a -489.843994

10600000 20000000 multilingual -7398.754395

20000000 20700000 sil -416.119415

.

"wavDir/SN0002.rec"

0 900000 sil -632.964050

900000 8600000 webbased -6000.767578

8600000 9900000 sil -914.236206

.

"wavDir/SN0003.rec"

0 2100000 sil -1373.137817

2100000 9000000 freecontent -5306.260742

9000000 18500000 encyclopedia -6654.958984

18500000 25600000 project -5698.730469

25600000 32700000 supported -5713.494141

32700000 33200000 by -429.306763

33200000 34800000 the -1205.477539

34800000 41500000 wikimedia -5115.318359

41500000 50000000 foundation -6074.208496

50000000 52000000 and -1746.236938

52000000 56200000 based -3267.695801

56200000 57000000 on -585.264404

57000000 57700000 a -577.346130

57700000 63200000 model -3769.413574

63200000 63800000 of -524.015503

63800000 65300000 sil -1129.348633

.

wavDir.align

33

Now it’s time to KaraOke !

A Browser in Javascript and HTML for Text-KaraOke

• https://youtu.be/11-ltx0yv_o

34

https://youtu.be/11-ltx0yv_o



A Browser in Python using TKinter for Text-KaraOke

35

Conclusion & Future Work

• Make the process more automatically.

• Make the user interface more friendly.

• Make the program more robust.

• Call for your help to improve.

• Thank you for Listening!

36

37

PyCon JP 2015

Renyuan Lyu

呂仁園

Chun-Han Lai

賴俊翰

Karaoke-style Read-aloud System

Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1

Thank you for Listening. ご聴取有り難う御座いました。

感謝您的收聽。









ry pyconjp2015 karaoke

Education