KKAP KAIST Korean Analysis Platform Morphological Analyzer POS

  • Slides: 18
Download presentation
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12,

KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Research Goal • The goal of the research is to develop KKAP(KAIST Korean Analysis

Research Goal • The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean natural language analysis. • The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.

Contents • 1. Introduction of Korean Morphological Analysis • 2. Han. Nanum Korean Morphological

Contents • 1. Introduction of Korean Morphological Analysis • 2. Han. Nanum Korean Morphological Analyzer & POS Tagger • 3. Extension to KKAP(KAIST Korean Analysis Platform)

Difficulties of Korean morphological analysis � Features of Korean morphological analysis Ambiguity of part-of-speech

Difficulties of Korean morphological analysis � Features of Korean morphological analysis Ambiguity of part-of-speech � 가시는 � � � 가시/noun + 는/josa (thorn, prickle) 가시/verb + 는/eomi (leave, disappear) 가/verb + 시/eomi + 는/eomi (go) 갈/verb + 시/eomi + 는/eomi (grind, sharpen) Example Sentences: � � Ambiguity of segmentation of morpheme 그 선인장의 가시는 참 따가웠다. 물을 마셨더니 갈증이 가시는 기분이다. 할머니께서는 집에 가시는 길이었다. 아저씨의 칼을 가시는 모습은 인상적이다.

Han. Nanum Korean Morphological Analyzer • • • Han. Nanum has been developed since

Han. Nanum Korean Morphological Analyzer • • • Han. Nanum has been developed since 1990 s. Written in C programming language Module-based architecture Based on KAIST morphological analyzed corpus HMM-based, Maximum Entropy-based POS Tagger

Han. Nanum Architecture Segment Position Inverse Segment Position Morphological Analyzer INPUT Analyzer Connection Check

Han. Nanum Architecture Segment Position Inverse Segment Position Morphological Analyzer INPUT Analyzer Connection Check Morpheme Chart (lattice form) Sentence Divisor Tagger Dictionary Search Tag Set Table Phoneme Restoration Tag Mapper Code Conversion Connection Info. Table Frequency Dictionary System Dictionary (Trie) User Dictionary (Trie) Number Dictionary Computation Bigram Info. OUTPUT

HMM-based POS Tagger • Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM

HMM-based POS Tagger • Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of -Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, 389 -394, 1994. • Transition probability between word phrase tag • Transition probability between morpheme tag in a word phrase • Probability of occurrence of morpheme and POS

Analysis Example - HMM-based Tagger Find the most suitable result among the candidates -

Analysis Example - HMM-based Tagger Find the most suitable result among the candidates - POS-tagged Dictionary - Check Connection rule - Phoneme Restoration

Plug-In Component-based System – Each functionality for the Korean morphological analysis is implemented as

Plug-In Component-based System – Each functionality for the Korean morphological analysis is implemented as a plug-in. – It allows a user to set up a workflow with existing plug-ins for his own goal. Plug-In Pool Chart-base Morph Analyzer HMM POS Tagger Corpus-base Morph Analyzer Unknown Noun Proc. Tag Mapper Noun Extractor Tag Mapping Noun Extracting … CRF POS Tagger Auto Spacing Input Filter Sentence Splitter … Transliteration … Phase 1 Supplement Plugin Phase 2 Morphological Analyzer Phase 2 Supplement Plugin Phase 3 POS Tagger Phase 3 Supplement Plugin

Flexible Workflow $$$$$ $/su+$/su+$/su - Analysis of Announcement on Web 장소 장소/ncn $$$$$ $/su+$/su+$/su

Flexible Workflow $$$$$ $/su+$/su+$/su - Analysis of Announcement on Web 장소 장소/ncn $$$$$ $/su+$/su+$/su $$$$$장소$$$$$ Informal Input Filter 서울코엑스 3층 Auto Spacing Sentence Splitter Plain Text Processor Chart-based Morphological Analyzer Unknown Processor Morphological Analyzer Morpheme Processor HMM-based POS Tagger 서울 서울/nq 코엑스/ncn POS Tagger 3층 3/nnc+층/nbu - Indexing of News Articles 지난 9월 거제도에서 열린 축제 … Sentence Splitter Chart-based Morphological Analyzer Noun Extractor Plain Text Processor Morphological Analyzer Morpheme Processor 9월/n 거제도/ncn 축제/ncn

Han. Nanum Korean Morphological Analyzer Workflow for Morphological Analysis Phase 1. Text Preprocessing Phase

Han. Nanum Korean Morphological Analyzer Workflow for Morphological Analysis Phase 1. Text Preprocessing Phase 2. Morphological Analysis Major Plugin Supplement Plugin 7일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Supplement Plugin Major Plugin Pool Phase 1. Plugin Sentence Segmentation Phase 2. Plugin Unknown Term Processing Auto Spacing Input Filter Noun Extraction Korean Document Analysis Phase 3. POS Tagging CRF-based POS Tagging Phase 3. Plugin HMM-based POS Tagging Tag Mapper Chart-base Morph Analyzer Tag Mapper Supplement Plugin 7/nnc+일/nbu 저녁/ncn 발표예정/ncpa+이/jp+ㄴ/etm 노벨문학상/nq+의/jcm 유력/ncps 수상자/ncn+로/jca 고은/nq 시인/ncn+이/jcc 거론/ncpa+되/xsv+고/ecc 있/paa+다/ef . /sf 통신은 통/ncn+신/ncn+은/jxc 스웨덴/nq+의/jcm 노벨상/ncn 관측통/ncn+들/xsn 사이/ncn+에/jca …. Extract the Part Of Speech Information from Korean Text

Open Source Project • http: //kldp. net/projects/hannanum/ • 2011. 01. 10 jhannanum 0. 8.

Open Source Project • http: //kldp. net/projects/hannanum/ • 2011. 01. 10 jhannanum 0. 8. 2 was released

GUI Demo Workflow Information of a plug-in Pool Workflow control Input & Output

GUI Demo Workflow Information of a plug-in Pool Workflow control Input & Output

KKAP: KAIST Korean Analysis Platform Workflow for Korean Analysis Phase 1. Text Preprocessing Supplement

KKAP: KAIST Korean Analysis Platform Workflow for Korean Analysis Phase 1. Text Preprocessing Supplement Plugin 7일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Phase 2. Morphological Analysis Major Plugin Supplement Plugin Pool Phase 1. Plugin Sentence Segmentation Auto Spacing Noun Extraction HMM-based POS Tagging Phase 2. Plugin Unknown Term Processing Noun Extraction Input Filter Korean Document Analysis Phase 3. POS Tagging Tag Mapper Phase 3. Plugin Noun Phrase Extractor Chart Parser Chart-base Morph Analyzer Tag Mapper Verb Phrase Extractor Phase 4. Plugin Phase 4. Parsing Major Plugin Supplement Plugin 7/nnc+일/nbu 저녁/ncn 발표예정/ncpa+이/jp+ㄴ/etm 노벨문학상/nq+의/jcm 유력/ncps 수상자/ncn+로/jca 고은/nq 시인/ncn+이/jcc 거론/ncpa+되/xsv+고/ecc 있/paa+다/ef . /sf 통신은 통/ncn+신/ncn+은/jxc 스웨덴/nq+의/jcm 노벨상/ncn 관측통/ncn+들/xsn 사이/ncn+에/jca …. Analyzed Korean Document

Korean Syntactic Tree Tagged Corpus • Registered at Bo. RA (Bank of Resource for

Korean Syntactic Tree Tagged Corpus • Registered at Bo. RA (Bank of Resource for Language and Annotation) – – http: //bora. or. kr Corpus 5. Manual sentence analysis corpus 31, 091 Sentences from 97 different sources. Length: 1 ~ 33 Eojeols Average 11. 35 Eojeols • Related document – Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department Technical Report, CS/TR-97 -112, 1997 (In Korean) – Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp. 421~429, 1997 (In Korean)

Question & Comments

Question & Comments