A new framework for Language Model Training David















- Slides: 15
A new framework for Language Model Training David Huggins-Daines January 19, 2006
Overview • • Current tools Requirements for new framework User Interface Examples Design and API
Current status of LM training • The CMU SLM toolkit • Efficient implementation of basic algorithms • Doesn’t handle all tasks of building a LM • Text normalization • Vocabulary selection • Interpolation/adaptation • Requires an expert to “put the pieces together” • Lots of scripts • Simple. LM, Communicator, CALO, etc. • Other LM toolkits • SRILM, Lemur, others?
Requirements • LM training should be • Repeatable • An “end-to-end” rebuild should produce the same result • Configurable • It should be easy to change parameters and rebuild the entire model to see their effect • Flexible • Should support many types of source texts, methods of training • Extensible • Modular structure to allow new methods and data sources to be easily implemented
Tasks of building an LM • Normalize source texts • They come in many different formats! • LM toolkit expects a stream of words • What is a “word”? • Compound words, acronyms • Non-lexemes (filler words, pauses, disfluencies) • What is a “sentence”? • Segmentation of input data • Annotate source texts with class tags • Select a vocabulary • • • Determine optimal vocabulary size Collect words from training texts Define vocabulary classes Vocabulary closure Build a dictionary (pronunciation modeling)
Tasks, continued • Estimate N-Gram model(s) • Choose the appropriate smoothing parameters • Find the appropriate divisions of the training set • Interpolate N-Gram models • Use a held-out set representative of the test set • Find weights for different models which maximize likelihood (minimize perplexity) on this domain • Evaluate language model • Jointly minimize perplexity and OOV rate • (they tend to move in opposite directions)
A Simple Switchboard Example Top level tag - must be only one <NGram. Model> A set of transcripts <Transcripts name="swb. files"> The input filter to use <Input. Filter: : SWB> A list of files <Transcripts list="swb. files"/> </Input. Filter: : SWB> Exclude singletons </Transcripts> Backreference to named object <Vocabulary cutoff="1"> <Transcripts name="swb. files"/> </Vocabulary> </NGram. Model>
A More Complicated Example <NGram. Model name="interp. test"> � <Transcripts name="swb. test"> swb. test. lsn </Transcripts> <Transcripts name="icsi. test"> <Input. Filter: : ICSI> icsi. test. mrt </Input. Filter: : ICSI> </Transcripts> <Vocabulary name="icsi. swb 1"> <Vocabulary cutoff="1"> <Transcripts name="swb. test"/> </Vocabulary> <Transcripts name="icsi. test"/> </Vocabulary> BRAZIL </Vocabulary> <NGram. Model name="swb. test"> <Transcripts name="swb. test"/> <Vocabulary name="icsi. swb 1"/> </NGram. Model> <NGram. Model name="icsi. test"> <Transcripts name="icsi. test"/> <Vocabulary name="icsi. swb 1"/> </NGram. Model> <Interpolation> <Input. Filter: : CMU> cmu. test. trs </Input. Filter: : CMU> <NGram. Model name="swb. test"/> <NGram. Model name="icsi. test"/> </Interpolation> </NGram. Model> (Interpolation of ICSI and Switchboard) Files can be listed directly in element contents Vocabularies can be nested (merged) Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs
Command-line Interface • lm_train • “Runs” an XML configuration file • build_vocab • Build vocabularies, normalize transcripts • ngram_train • Train individual N-Gram models • ngram_test • Evaluate N-Gram models • ngram_interpolate • Interpolate and combine N-Gram models • ngram_pronounce • Build a pronunciation lexicon from a language model or vocabulary
Programming Interface • NGram. Factory • Builds an NGram. Model from an XML specification (as seen previously) • NGram. Model • Trains a single N-Gram LM from some transcripts • Vocabulary • Builds a vocabulary from transcripts or other vocabularies • Input. Filter • Subclassed into Input. Filter: : CMU, Input. Filter: : ICSI, Input. Filter: : HUB 5, Input. Filter: : ISL, etc • Reads transcripts in some format and outputs a word stream
Design in Plain English • NGram. Factory builds an NGram. Model • NGram. Model has a Vocabulary • NGram. Model and Vocabulary can have Transcripts • NGram. Model and Vocabulary use an Input. Filter (or maybe they don’t) • NGram. Model can merge two other NGram. Models using a set of Transcripts • Vocabulary can merge another Vocabulary
A very simple Input. Filter please!!! �se strict; u package Input. Filter: : Simple; require Input. Filter; use base 'Input. Filter'; (Input. Filter/Simple. pm) Subclass of Input. Filter sub process_transcript { my ($self, $file) = @_; (This is just good practice) local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while (<FILE>) { Read the input file chomp; my @words = split; $self->output_sentence(@words); Tokenize, normalize, etc } } 1; Pass each sentence to this method
Where to get it • Currently in CVS on fife. speech • : ext: fife. speech. cs. cmu. edu: /home/CVS • module LMTraining • Future: CPAN and cmusphinx. org • Possibly integrated with the CMU SLM toolkit in the future
Stuff TODO • Class LM support • Communicator-style class tags are recognized and supported • NGram. Model will build. lmctl and. probdef files • However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger • Automatic tagging would be nice… • Support for languages other than English • • Text normalization conventions Word segmentation (for Asian languages) Character set support (case conversions etc) Unicode (also a CMU-SLM problem)
Questions?