Synthesis Unit and Question Set Definition for Mandarin

  • Slides: 53
Download presentation
Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis Student: Ju-Yun

Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis Student: Ju-Yun Cheng Advisor: Prof. Chung-Hsien Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN

Outline Introduction Background Motivation Related work Singing voice synthesis system Evaluation Discussion Conclusion Future

Outline Introduction Background Motivation Related work Singing voice synthesis system Evaluation Discussion Conclusion Future work 2

Introduction - Background Speech and singing are both important ways to communicate and present

Introduction - Background Speech and singing are both important ways to communicate and present emotion Speech synthesizer can generate fluency and natural speech well, even with personal characteristics Singing voice synthesis has been one of the emerging and popular research topics recently enables computers to sing any songs without the need of the actual singing of human 3

Introduction - Background There are two main methods in the corpus-based singing synthesis area

Introduction - Background There are two main methods in the corpus-based singing synthesis area sample-based approach: unit-selection appropriate sub-word units are selected from large speech databases Pros: high-quality speech at the waveform level Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Singer Library Sample selection concatenation Synthesis output 4

Introduction - Background sample-based approach: unit-selection chosen from singing voice corpus with the lyrics

Introduction - Background sample-based approach: unit-selection chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] Vocaloid a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 Pitch conversion and timbre manipulation to smoothing concatenate samples 5

Introduction - Background There are two main methods in the corpus-based singing synthesis area

Introduction - Background There are two main methods in the corpus-based singing synthesis area statistical approach : HMM-based Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics Cons: vocoder sound, over-smoothing Singing waveform labels parameter extraction Acoustic model training parameter generation Acoustic model parameters Singing parameters Waveform generation Synthesis output 6

Introduction - Background statistical approach : HMM-based Sinsy A free on-line singing voice synthesis

Introduction - Background statistical approach : HMM-based Sinsy A free on-line singing voice synthesis service which provide Japanese and English version Users can obtain synthesized singing voices by uploading musical scores represented in Music. XML 7

Introduction - Background Another method for singing voice synthesis system HNM (Harmonic plus Noise

Introduction - Background Another method for singing voice synthesis system HNM (Harmonic plus Noise Model) HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] Speech-to-singing Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011] 8

Introduction - Motivation In order to synthesize smooth and continuous singing voice, we chose

Introduction - Motivation In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system HMM can model temporal sequence of singing voice parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf 0 parameters 9

Introduction - Improvement in Sinsy These are a series of papers written by the

Introduction - Improvement in Sinsy These are a series of papers written by the producer of Sinsy’s team [An HMM-based Singing Voice Synthesis System, 2006] The first paper about HMM-based singing voice synthesis system [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data, 2010] To increase the amount of F 0 training data, pitch-shifted pseudo data can be prepared by shifting F 0 up or down in halftone [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy , 2010] Introduce the free on-line singing voice synthesis service [Pitch Adaptive Training For HMM-based Singing Voice Synthesis , 2012] model-level normalization of pitch 10

Singing voice synthesis system - features extraction STRAIGHT [H. Kawahara 1997] A high-quality analysis

Singing voice synthesis system - features extraction STRAIGHT [H. Kawahara 1997] A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation extract parameters with relatively good performance in not professional recording environment Features: Pitch, Smoothed Spectrum, Aperiodic factors waveform F 0 extraction Fixed-point analysis F 0 Synthetic waveform Analysis Smoothed spectrum Aperiodic factors Synthesis Mixed excitation with phase manipulation 11

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Pitch

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Pitch contour Database, Model definition, question set 12

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Music

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Music Score pitch: duration: key: tempo: beat: 13

Singing voice synthesis system - Proposed method for Mandarin singing Different from Sinsy Language:

Singing voice synthesis system - Proposed method for Mandarin singing Different from Sinsy Language: from Japanese to Mandarin Database, model definition, question sets Refinement Japanese Syllabary – hiragana • Japanese syllables are basically from "consonant + vowel" • only five vowel Bopomofo • Existing 37 (initials 21, finals 16) 14

Singing voice synthesis system - Proposed method for Mandarin singing Acoustic parameters Model linguistic

Singing voice synthesis system - Proposed method for Mandarin singing Acoustic parameters Model linguistic info note info Question sets cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing 15

Singing voice synthesis system - system structure Singing voice database Musical Score Training phase

Singing voice synthesis system - system structure Singing voice database Musical Score Training phase Excitation parameter extraction Aperiod parameter extraction Spectral parameter extraction label CART-based state tying Question set Training of HMM conversion Context-dependent HMMs & duration models State selection by CART label Excitation generation Parameter generation from HMM Synthesis phase Spectral generation Aperiod generation Synthesis filter Synthesized Singing Voice 16

Singing voice synthesis system - Proposed method for Mandarin singing Singing Voice Database Construction

Singing voice synthesis system - Proposed method for Mandarin singing Singing Voice Database Construction Building a singing voice database for training and synthesis MHMC Singing Voice Database Mandarin singing Model definition Initial and final modification Medial modification Long duration models Question sets definition of decision trees Modification for Mandarin Refinements Pitch coverage by pitch-shift pseudo data Vibrato 17

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singing

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singing corpus design process Music Score Corpus Songs selection Selected Scores Phonetic transcription Segmentation by phoneme Singing database Singing signal 18

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Songs

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Songs selection Selecting scores Music book and internet version Choosing criteria and specialization Simple and no need many skills Phone coverage Digitizing data format: Music. XML Transposition to appropriate pitch range 19

Singing voice synthesis system - Model definition Music. XML file Sheet Music score Key

Singing voice synthesis system - Model definition Music. XML file Sheet Music score Key in Music. XML format Convert Music. XML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented. 20

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singer

Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singer selection and data processing Finding candidates to record demo 4 candidates Choosinger the accuracy of pitch timbre Checking recorded data noise is not allowed exceed recording criterion Segmentation and normalization Phoneme Let the energy of singing voice data smaller avoid singing voice becomes loud suddenly Pitch scale is too large leading to bad synthesize 21

Singing voice synthesis system - singing voice database NCKU Singing Voice Database We choose

Singing voice synthesis system - singing voice database NCKU Singing Voice Database We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes Songs Nursery rhyme / children’s song Total 148 songs Singer Pitch range version One female C 4~B 4 1, 2 Total time Sample rate Resolution Channels About 102 minutes 48 k. Hz 16 bits Mono File name data 小蜜蜂 兩隻老虎 火車快飛 22

Singing voice synthesis system - Model definition text Word to Phone Initial and final

Singing voice synthesis system - Model definition text Word to Phone Initial and final Processing Music. XML Long duration Processing Riffs and runs Processing Extract Scores Information cue information wav Note information Note Absolute Pitch Note Type Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 23

Singing voice synthesis system - Model definition text Word to Phone Initial and final

Singing voice synthesis system - Model definition text Word to Phone Initial and final Processing Music. XML Long duration Processing Riffs and runs Processing Extract Scores Information cue information wav Note information Note Absolute Pitch Note Type Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 24

Singing voice synthesis system - Model definition Initial and final processing tone instead of

Singing voice synthesis system - Model definition Initial and final processing tone instead of the original tone of words, the main pitch of note is more significant e. g. 不 speech->bu wu. H wu. L sing->bu wu Vowel We define the phonemes by phonology The medial with the rime rather than the initial When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. speech singing 25

Singing voice synthesis system - Model definition Initial and final processing Single initial A

Singing voice synthesis system - Model definition Initial and final processing Single initial A syllable only has initial without finals followed with an empty rime “帀“ to pronounce 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr Total phonemes are 59 (speech: 66) initial final with medial ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ ㄐ ㄑ b p m f d t n l g k h j ch ㄒ ㄓ ㄔ ㄕ ㄖ 帀1 ㄗ ㄘ ㄙ 帀2 sh jr chr shr r zr tz tsz sz sr 一 ㄨ ㄩ ㄚ ㄛ ㄜ ㄞ ㄟ ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄝ yi wu yu a o e ai ei au ou an en ang ng er eh 一ㄚ 一ㄝ 一ㄠ 一ㄡ 一ㄢ 一ㄣ ㄧㄤ 一ㄥ ㄨㄚ ㄨㄛ ㄨㄞ ㄨㄟ ㄨㄢ ia ieh iau Iou ian ien iang ing ua uo uai uei uan ㄨㄣ ㄨㄤ ㄨㄥ ㄩㄝ ㄩㄢ ㄩㄣ ㄩㄥ uen uang ung iueh iuan iuen iung 26

Singing voice synthesis system - singing voice database phonetic coverage final initial final contains

Singing voice synthesis system - singing voice database phonetic coverage final initial final contains medial phone 59 Total phones 15300 total words 8448 song 148 27

Singing voice synthesis system - Model definition Long duration model To express well in

Singing voice synthesis system - Model definition Long duration model To express well in singing, long duration note is important. shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. Lengthen the short duration note cannot present long duration note completely half or whole note -> Final + “L” 一起飛 飛就飛 叫就叫 28

Singing voice synthesis system - Model definition text Word to Phone Initial and final

Singing voice synthesis system - Model definition text Word to Phone Initial and final Processing Music. XML Long duration Processing Riffs and runs Processing Extract Scores Information cue information wav Note information Note Absolute Pitch Note Type Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 29

Singing voice synthesis system - Model definition Riffs and runs processing A syllable corresponding

Singing voice synthesis system - Model definition Riffs and runs processing A syllable corresponding to multiple notes Repeat the last tonal Pause processing In order to present the breathing pause or segmented pause when human singing The singer suspend more than a threshold (> 0. 3 seconds) a rest 30

Singing voice synthesis system - Model definition Linguistic information phoneme current phoneme, { preceding,

Singing voice synthesis system - Model definition Linguistic information phoneme current phoneme, { preceding, succeeding } two phonemes syllable # of phonemes at {preceding, current, succeeding} syllable Phrase # of phonemes/syllables at {preceding, current, succeeding} phrase song # of average phonemes/syllables in measure in this song # of phrases in this song Riffs and Run 31

Singing voice synthesis system - Model definition Singing is the act of producing musical

Singing voice synthesis system - Model definition Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm Note pitch Pitches are compared as "higher" and "lower" in the sense associated with musical melodies Note duration An amount of time or a particular time interval. It is the length of a note and one of the bases of rhythm. Songs structure what kind of an overall musical form or structure the song adopts the order of a music score 32

Singing voice synthesis system - Model definition text Word to Phone Initial and final

Singing voice synthesis system - Model definition text Word to Phone Initial and final Processing Music. XML Long duration Processing Riffs and runs Processing Extract Scores Information cue information wav Note information Note Absolute Pitch Note Type Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 33

Singing voice synthesis system - Model definition text Word to Phone Initial and final

Singing voice synthesis system - Model definition text Word to Phone Initial and final Processing Music. XML Long duration Processing Riffs and runs Processing Extract Scores Information cue information wav Note information Note Absolute Pitch Note Type Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 34

Singing voice synthesis system - Model definition User-defined phrase units phrasing may be necessary

Singing voice synthesis system - Model definition User-defined phrase units phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” We defined the unit of phrase depend on the song structure. used in outside label to present breathing pause 4 measures / phrase 2 measures / phrase 35

Singing voice synthesis system - Model definition Note Calculation the basic information is not

Singing voice synthesis system - Model definition Note Calculation the basic information is not enough to present one note completely Relative pitch means difference between the key note and the current note Key note depends on numbers of sharps or flats Note position different note positions in the measure or phrase may have different expression due to breathing unit: note, 0. 1 second, thirty-second note, % Note length 0. 1 second(absolute pitch), thirty-second note(relative length) 36

Singing voice synthesis system - Model definition text Music. XML Word to Phone Extract

Singing voice synthesis system - Model definition text Music. XML Word to Phone Extract Scores Information Initial and final Processing cue information wav Long duration Processing Riffs and runs Processing Note Absolute Pitch Note Type Note information Measure Song Settings transcription Note Calculation Pause Processing Note Pitch linguistic information Label Note Duration User-defined phrase units Song Structure 37

Singing voice synthesis system - Model definition Note information Note Pitch Absolute pitch (C

Singing voice synthesis system - Model definition Note information Note Pitch Absolute pitch (C 0 -G 9), relative pitch(0 -11), the difference pitch between previous & current / current & next Note Duration Length of note by syllable, thirty-second note, 0. 1 second Song Structure Beat: 2/4, 3/4, 4/4 Tempo: 90, 100, 120 key Position Count by note, 0. 1 second, thirty-second note, percentage in the measure/phrase Number of phrases 38

Singing voice synthesis system - Question sets definition for singing model clustering (1) Phoneme

Singing voice synthesis system - Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) Final With or without medial Initials pronunciation category Finals pronunciation category (2) Note (3) phrase Pitch Tempo Beat Duration Position • # of phonemes/syllables preceding, current, succeeding phrase (4) song • # of phonemes/syllables • # of phrases 39

Singing voice synthesis system - Refinement Pitch-shift pseudo data Pitch coverage using the nearby

Singing voice synthesis system - Refinement Pitch-shift pseudo data Pitch coverage using the nearby notes from other songs and shift to corresponding Hertz 40

Singing voice synthesis system - Refinement 41

Singing voice synthesis system - Refinement 41

Evaluation Experimental Conditions Database condition Mel-cepstral analysis condition Number of songs 148 number of

Evaluation Experimental Conditions Database condition Mel-cepstral analysis condition Number of songs 148 number of phonemes 15300 number of words 8443 Number of notes 9054 Total of time About 100 minutes Frame shift 5 ms Window Length 25 ms Window function Blackman window MGC order 49 dim MFCCs Sampling rate 48 k. Hz 42

Evaluation Experiments settings Baseline RQ : Reduced Question sets duplicate questions, indirect questions, relative

Evaluation Experiments settings Baseline RQ : Reduced Question sets duplicate questions, indirect questions, relative questions PS : Pitch-shift pseudo data VP : Vibrato post-processing 43

Evaluation - Subjective evaluation Pitch contour Synthesized (baseline) vs. Music score Synthesized (baseline) vs.

Evaluation - Subjective evaluation Pitch contour Synthesized (baseline) vs. Music score Synthesized (baseline) vs. Original singing 44

Evaluation - Subjective evaluation Mean Opinion Scores(MOS) 10 synthesize songs 12 subjects Quality and

Evaluation - Subjective evaluation Mean Opinion Scores(MOS) 10 synthesize songs 12 subjects Quality and Intelligibility evaluation Quality MOS Excellent 5 Good 4 Fair 3 Poor 2 Bad 1 ABX test A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B) 45

Evaluation - Subjective Quality evaluation Intelligibility evaluation mean variance baseline 2. 79 0. 166

Evaluation - Subjective Quality evaluation Intelligibility evaluation mean variance baseline 2. 79 0. 166 0. 132 RQ 2. 75 0. 187 0. 141 RQ+PS 3. 11 0. 008 mean variance baseline 2. 76 0. 173 RQ 2. 49 RQ+PS 3. 04 46

Demo Outside Test baseline+QR+PS 娃娃哭了 叫媽媽 推你摔下 你又站起來 47

Demo Outside Test baseline+QR+PS 娃娃哭了 叫媽媽 推你摔下 你又站起來 47

Evaluation - Subjective The score of quality and intelligibility is lower than baseline The

Evaluation - Subjective The score of quality and intelligibility is lower than baseline The question set we reduced including the important information to classify Too few question 5364 ->1257 Find out the better version of reduced question sets 48

Preference test Natural- Testing vibrato different pitch and situation corresponding to different settings Vibrato

Preference test Natural- Testing vibrato different pitch and situation corresponding to different settings Vibrato is not essential in children’ songs original vibrato 49

Discussion Singing corpus quality Recording in professional environment Singer’s timbre Context factor coverage Too

Discussion Singing corpus quality Recording in professional environment Singer’s timbre Context factor coverage Too blurred Not enough training corpus modeled with priority of singing characteristics 50

Conclusion A Mandarin corpus-based singing voice synthesis system based on hidden Markov models (HMMs)

Conclusion A Mandarin corpus-based singing voice synthesis system based on hidden Markov models (HMMs) was implemented We defined the Mandarin model definition for singing and the question sets for model clustering. We use three methods to refine our system, i. e. question set reduction, pitch-shift pseudo data and vibrato post-processing. 51

Demo Inside Test Outside test original 火車快飛 三輪車 蝴蝶 Our system original Our system

Demo Inside Test Outside test original 火車快飛 三輪車 蝴蝶 Our system original Our system 小星星 妹妹揹著 洋娃娃 康定情歌 52

 Thanks for listening & comments 53

Thanks for listening & comments 53