Text Clustering A Case Study A Multilingual Text

  • Slides: 14
Download presentation
Text Clustering: A Case Study A Multilingual Text Mining Approach Based On Self-Organizing Maps

Text Clustering: A Case Study A Multilingual Text Mining Approach Based On Self-Organizing Maps

Introduction l Related Concepts of Text Mining(文件探勘) --Data mining , Information Retrieval ( IR

Introduction l Related Concepts of Text Mining(文件探勘) --Data mining , Information Retrieval ( IR ) --Machine learning , Automatically organize --Text Categorization --unstructured / semi-structured data l Why Multilingual Text Mining? --monolingual vs. multilingual --parallel corpora --language-independent algorithm

System Architecture Corpora Selection Feature Selection SOM Words Cluster Map Documents Cluster Map Discovery

System Architecture Corpora Selection Feature Selection SOM Words Cluster Map Documents Cluster Map Discovery Algorithm Semantic Analysis Translation preprocessing training Analysis

Self-Organizing Maps (SOM) Unsupervised learning l Automatic cluster generation l High-dimensionality two-dimensionality l Intuitive

Self-Organizing Maps (SOM) Unsupervised learning l Automatic cluster generation l High-dimensionality two-dimensionality l Intuitive neighborhood relations l

SOM Abstraction Illustration M neurons N samples C clusters

SOM Abstraction Illustration M neurons N samples C clusters

Measure of similarity for clustered items p q Similarity between two words / documents

Measure of similarity for clustered items p q Similarity between two words / documents : || G(Np)-G(Nq) || -1 D(p, q) = ( 1 + 2 )

Experimental Discussion l Corpora Selection : Sinorama Magazine If you could flip through the

Experimental Discussion l Corpora Selection : Sinorama Magazine If you could flip through the first issue, from January of 1976, you would discover that the early Sinorama Pictorial was a slim collection of photos of national development, scenic spots, and traditional customs. It had a heavily propagandistic feel, and was only for overseas distribution. Nevertheless, it rapidly began to change. Sometimes the changes were gradual, as the contents became richer and more realistic. Sometimes there were major change of format. Ultimately, it has become a unique publication which reflects current society, explores the wisdom of our ancestors, and introduces East-West cultural interchange. 003 e. txt 翻開民國六十五年元月的光華創刊號, 發現最早期的「光華畫報雜誌」,的確只 是重大建設、觀光勝地、風土民情的「圖 片集錦」簿冊,文宜味十足,並且只對海 外發行。然而很快地,它開始有了改變, 有時是漸進式地日見豐實,有時則是大幅 度的改頭換面,終於成為第一本能反映社 會現況,探詢先人智慧寶藏、介紹東西文 化交流的獨特刊物。 003 c. txt

Word cluster map sinorama 作 光華 bad bridg caught childlik chingju chrissi commun comprehens

Word cluster map sinorama 作 光華 bad bridg caught childlik chingju chrissi commun comprehens contribut countryw cultiv curios drove easili endlessli eventu fulfil goal greatest highest impart inexhaust inferior jiafong joi magazin mission model modesti plant potenti problem profession pursu record repres respons scholar sens serv spark specialist transmit wai wang wonder 人中 人生 不只 不亞於 不斷 之間 充當 本國 生態 目的 丟人 她們 成就 自然 似乎 我們 赤忱 使命感 其他 委員 孩子 後來 後進 既然 根基 留下 追求 做好 執著 培養 專家 帶動 啟發 深厚 現在 責任 這些 提到 傳遞 敬業樂群 楷模 態度 榮譽 撰述 潛力 稿子 學者 擔任 樹 立 橋樑 環保 總是 總編輯 謙遜 職守 灌輸 An example of resulting word clusters from the trained word cluster map.

Multilingual Text Mining from Parallel Chinese-English Corpora • The document cluster map for the

Multilingual Text Mining from Parallel Chinese-English Corpora • The document cluster map for the tested English articles E 005_002. txt E 002_001. txt E 006_001. txt E 003_002. txt E 004_002. txt E 001_001. txt E 001_002. txt E 009_001. txt E 005_001. txt E 006_002. txt E 003_001. txt E 004_001. txt E 007_002. txt E 002_002. txt E 008_001. txt E 008_002. txt E 009_002. txt • The document cluster map for the tested Chinese articles C 008_001. txt C 009_002. txt C 007_001. txt C 007_002. txt C 008_002. txt C 004_001. txt C 004_002. txt C 001_001. txt C 001_002. txt C 002 -001. txt C 002_002. txt C 006_001. txt C 003_001. txt C 005_002. txt C 005 -001. txt C 003_002. txt

Multilingual Text Mining from Hybrid Chinese-English Corpora • The document cluster map for the

Multilingual Text Mining from Hybrid Chinese-English Corpora • The document cluster map for the hybrid corpus that contains tested English and Chinese articles. E 15 E 42 E 45 E 29 C 49 E 49 C 02 E 20 C 20 E 33 E 47 E 34 E 51 E 55 C 54 E 01 E 43 E 48 E 54 C 08 E 26 E 40 C 19 E 19 C 48 E 32 C 07 E 07 C 37 E 05 C 56 E 04 E 16 E 56 C 00 C 01 C 03 C 04 C 05 C 06 C 09 C 10 C 11 C 12 C 13 C 14 C 15 C 16 C 17 C 18 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 28 C 29 C 30 C 31 C 32 C 33 C 34 C 35 C 36 C 38 C 40 C 41 C 42 C 43 C 45 C 46 C 47 C 50 C 51 C 52 C 53 C 57 E 00 E 03 E 06 E 09 E 10 E 13 E 17 E 18 E 25 E 28 E 30 E 31 E 35 E 36 E 38 E 41 E 46 E 50 E 52 E 57 C 44 E 21 E 44 E 22 E 11 E 12 E 14 E 53 E 27 C 39 E 39