Conversion of Penn Treebank Data to Text Penn
Conversion of Penn Treebank Data to Text
Penn Tree. Bank Project “A Bank of Linguistic Trees” (as of 11/1992) • University of Pennsylvania, LINC Laboratory • 4. 5 million words of American English • Annotation of naturally-occurring text for linguistic structure
Tree Linguistic Components • Tokenization – Treatment of punctuation, words, etc. as separate tokens • Children’s Children ’s • Part-of-speech (POS) tagging – Text first assigned POS tags automatically – Human annotators correct first-pass POS tags • Bracketing – (Fidditch, a deterministic parser (Hindle 1983, 1989) ) – Two-stage parsing process made explicit with brackets
Penn Tree. Bank: Brown Corpus (as of 11/1992) • POS Tags (Tokens) 1, 172, 041 • Skeletal Parsing (Tokens) 1, 172, 041
You know you’re in trouble when … “ 0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)” Robert Mac. Intyre Programmer/Data Manager Penn Treebank Project robertm@unagi. cis. upenn. edu ftp: //ftp. cis. upenn. edu/pub/treebank/doc/faq. cd 2
• • • • • • • • • ( END_OF_TEXT_UNIT ) Tree ( (`` ``) (S (S (NP (PRP I) ) (VP (VBP leave) (NP (DT this) (NN church) ) (PP (IN with) (NP (DT a) (NN feeling) (SBAR (IN that) (S (NP (DT a) (JJ great) (NN weight) ) (AUX (VBZ has) ) (VP (VBN been) (VP (VBN lifted) (PP (IN off) (NP (PRP$ my) (NN heart) ))))) (, , ) (S (NP (PRP I) ) (AUX (VBP have) ) (VP (VBN left) (NP (PRP$ my) (NN grudge) ) (PP (IN at) (NP (DT the) (NN altar) ))) (CC and) (VP (VBN forgiven) (NP (PRP$ my) (NN neighbor) ))))) ('' '') (. . ) ) ( END_OF_TEXT_UNIT ) Conversion: Clean Case cb 08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''.
• • • • • • • • • • ( (S (NP (PRP He) ) (VP (VBD reported) (SBAR (IN that) (S (NP (DT the) (NN city) ) (POS 's) (NNS contributions) (PP (IN for) (NP (NN animal) (NN care) ))) (VP (VBD included) (NP ($ $) (CD 67, 000) (PP (TO to) (NP (DT the) (NNS Women) ) (POS 's) (NN S. P. C. A. ) ))) (: ; ) (NP ($ $) (CD 15, 000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (CD six) (NNS policemen) ) (VP (VBN assigned) (PP (IN as) (NP (NN dog) (NNS catchers) ))))))) (CC and) (NP ($ $) (CD 15, 000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB investigate) (NP (NN dog) (NNS bites) ))))) (. . ) ) ( END_OF_TEXT_UNIT ) Tree Conversion : Problematic Case (NP (DT the) (NNS Women) ) (POS 's) (NN S. P. C. A. ) ))) (: ; ) (NP ($ $) (CD 15, 000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (CD six) (NNS policemen) ) ca 09_46 He reported that the city's contributions for animal care included $67, 000 to the Women's S. P. C. A. ; ; $15, 000 to pay six policemen assigned as dog catchers and $15, 000 to investigate dog bites.
Summary of Problems Encountered • Typing Errors – Punctuation duplication in data • Special notation for delimiter characters – RRB, LRB, RSB, LSB, RCB, LCB • Special Null Elements – ( -NONE- ) * 0 T NIL ** Conventions for final output need to consider these lessons
Future Recommendations • Put POS tree data into proper database – Increases confidence in correctness of data – Minimizes error • Spend more effort upfront *once* to clean data • SQL queries more reusable than (write-only) perl scripts • Due to random graduate student ability • If DB option not available – Avoid duplication of data in final output – Avoid text delimiters that exist as data tokens (“ ‘ , s ) – Do thoughtful labeling conventions
- Slides: 9