Statistical NLP Lecture 6 CorpusBased Work Ch 4
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Corpus-Based Work • Text Corpora are usually big. – Corpora 사용의 중요한 한계점으로 작용 – 대용량 Computer의 발전으로 극복 • Corpus-Based word involves collection a large number of counts from corpora that need to be access quickly • There exists some software for processing corpora
Corpora • Linguistically mark-up or not • Representative sample of the population of interest – American English vs. British English – Written vs. Spoken – Areas • The performance of a system depends heavily on – the entropy – Text categorization • Balanced corpus vs. all text available
Software • Software – Text editor : 글자 그대로 보여준다. – Regular expression : 정확한 patter을 찾게 한다. – Programming language • C/C++, Perl, awk, Python, Prolog, Java – Programming techniques
Looking at Text • Text come a row format or marked up. • Markup – A term is used for putting code of some sort into a computer file – Commercial word processing : WYSIWYG • Features of text in human languages – 자연어 처리의 어려운 점
Low-Level Formatting Issues • Junk formatting/Content. – document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. – OCR : If your program is meant to deal with only connected English text • Uppercase and Lowercase: – should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.
Tokenization: What is a Word? (1) • Tokenization – To divide the input text into unit called token – what is a word? • graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks”
Tokenization: What is a Word? (2) • Period – 문자의 끝을 나타내는 의미가 있다. – 약어를 나타낸다. : as in etc. or Wash • Single apostrophes – isn’t, I’ll 2 words ? 1 words – 영어의 축약 : I’ll or isn’t • Hyphenation – 일반적으로 인쇄상 다음 줄로 넘어가는 한 단어를 표시 – text-based, co-operation, e-mail, A-1 -plus paper, “take-itor-leave-it”, the 90 -cent-an-hour raise, mark up mark-up mark(ed) up
Tokenization: What is a Word? (3) • Word Segmentation in other languages: no whitespace ==> words segmentation is hard • whitespace not indicating a word break. – New York, data base – the New York-New Haven railroad • 명확한 의미의 정보가 다양한 형태로 존재한다. – +45 43 48 60 60, (202) 522 -2230, 33 1 34 43 32 26, (44. 171) 830 1007
Tokenization: What is a Word? (4) Phone number Country 0171 378 0647 UK +45 43 60 60 Denmark (44. 171) 830 1007 UK 95 -51 -279648 Pakistan +44 (0) 1225 753678 UK +411/284 3797 Switzerland 01256 468551 UK (94 -1) 866854 Sri Lanka (202) 522 -2330 USA +49 69 136 -2 98 05 Germany 1 -925 -225 -3000 USA 33 1 34 43 32 26 France 212. 995. 5402 Nerherlands USA ++31 -20 -5200161 The Table 4. 2 Different formats for telephone numbers appearing in an issue of the Economist
Morphology • Stemming: Strips off affixes. – sit, sits, sat • Lemmatization: transforms into base form (lemma, lexeme) – Disambiguation • Not always helpful in English (from an IR point of view) which has very little morphology. • IR community has shown that doing stemming does not help the performance • Mutiple words a morpheme ? ? ? • Morphological analysis를 구현하기 위한 추가비용에 비해 효능이 안 좋다
Porter Stemming Algorithm
Porter Stemming Algorithm • Error #1: Words ending with “yed” and “ying” and having different meanings may end up with – Dying -> dy (impregnate with dye) – Dyed -> dy (passes away) • Error #2: The removal of “ic” or “ical” from words having m=2 and ending with a series of consonant, vowel, such as generic, politic…: – Political -> polit – Politic -> polit – Polite -> polit
Sentences • What is a sentence? – Something ending with a ‘. ’, ‘? ’ or ‘!’. True in 90% of the cases. – Colon, semicolon, dash도 문장으로 여겨질 수 있다. • Sometimes, however, sentences are split up by other punctuation marks or quotes. • Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done. • 우리말은 더욱 어려움!!! – 마침표가 없기도 하고 종결형 어미 뒤? – 연결형 어미이면서 종결형 어미 – 따옴표
End-of-Sentence Detection (I) • Place EOS after all. ? ! (maybe ; : -) • Move EOS after quotation marks, if any • Disqualify a period boundary if: – Preceeded by known abbreviation followed by upper case letter, not normally sentence-final: e. g. , Prof. vs. Mr.
End-of-Sentence Detection (II) – Precedeed by a known abbreviation not followed by upper case: e. g. , Jr. etc. (abbreviation that is sentence-final or medial) • Disqualify a sentence boundary with ? or ! If followed by a lower case (or a known name) • Keep all the rest as EOS
Marked-Up Data I: Mark-up Schemes • 초기의 markup schemes – 단순히 내용정보만을 위해 header에 삽입 (giving author, date, title, etc. ) • SGML – 문서의 구조와 문법을 표준화하는 grammer language • XML – SGML을 web에 응용하기 위해 만든 SGML의 축소판
Examples of Tagset(Korean)
Examples of Tagset(English) Penn. Treebank tagset Brown corpus tagset
- Slides: 22