Basic Speech Features JyhShing Roger Jang http mirlab
Basic Speech Features (語音的基本特徵) Jyh-Shing Roger Jang (張智星) http: //mirlab. org/jang MIR Lab, CSIE Dept National Taiwan Univ. , Taiwan
Characteristics of Speech z Long-term (sentence level, several seconds) y. Drastic/irregular changes z Short-term (frame level, 20 ms or so) y. Regular periodic changes for voiced sounds y. Noise-like for unvoiced sounds z Hard to recognize without context information
Voice Generation & Reception z. Steps in voice generation & reception y. Vibration of voice source y. Resonance by surrounding organs y. Traveling through air (or other media) y. Reception of membranes and neurons at inner ears y. Recognition by brains
Frame-level Waveform
Three Basic Speech Features z Three basic speech features y Volume/Energy/Intensity(音量、能量、強度): Vibration Amplitude y Pitch(音高):Fundamental frequency (which is equal to the reciprocal of the fundamental period) y Timbre(音色):The waveform within a fundamental period z These features are perceived subjectively by humans. However, we can use some mathematics to “emulate” human and capture these features.
Waveform in Time-domain z. Three basic characteristics in a waveform: Fundamental period Intensity Timbre: Waveform within an FP
Demo: Waveform Display z. Waveform of sound forks and human voices y. MATLAB xtoolbox/audio/wave. File. Record. m y. Cool. Edit
Spectrum in Frequency-Domain z Three basic characteristics in a spectrum: y. Timbre: Spectrum after smoothing y. Pitch: Distance between harmonics y. Intensity: Magnitude of spectrum First formant F 1 Intensity Pitch freq Second formant F 2
Demo: Real-time Spectrogram z. Try “dspstfft_audio” under MATLAB: Spectrum: Spectrogram:
Audio Feature Extraction & Recog. z. Frame blocking y. Frame duration of 20 ms z. Feature extraction y. Volume, pitch, MFCC, LPC, etc z. Endpoint detection y. Based on volume & ZCR z. Recognition y. DTW, HMM
Example: Audio Feature Extraction Overlap Frame Zoom in 256 points/frame 84 points overlap 11025/(256 -84)=64 feature vectors per second
Acoustic Feature: Energy z Energy is the square sum of a frame, also known as intensity or volume. z Characteristics: y. Usually noise and fricative have low energy. y. Energy is influence a lot by microphone setup. y. If we take log of square sum, and times 10, we have energy in terms of Decibel(分貝) y. Energy is commonly used in endpoint detection. y. In embedded system implementation, volume can be computed as the abs. sum of a frame in order to reduce computation.
Acoustic Feature: Zero Crossing Rate z. Zero crossing rate (ZCR) y. The number of zero crossing in a frame. z. Characteristics: y Noise and unvoiced sound have high ZCR. y ZCR is commonly used in endpoint detection, especially in detection the start and end of unvoiced sound. y. To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR.
Pitch z. Computation y. Pitch freq. is the reciprocal of fundamental period. y. Pitch in terms of semitone:
- Slides: 15