COMP 323 Foundations of Chinese Computing COMP 323

  • Slides: 37
Download presentation
COMP 323 Foundations of Chinese Computing COMP 323 Lecture 1

COMP 323 Foundations of Chinese Computing COMP 323 Lecture 1

Course Introduction n n Lecturer ¨ Qin LU ¨ csluqin@comp. polyu. edu. hk ¨

Course Introduction n n Lecturer ¨ Qin LU ¨ csluqin@comp. polyu. edu. hk ¨ Room PQ 814, Tel. 27667247 Teaching Assistant (Responsible for some Labs and Project Assignments) ¨ Chen Yirong ¨ csyrchen@comp. polyu. edu. hk ¨ Room QT 416, Tel. 2766 7326 COMP 323 Lecture 1 2

Course Introduction n n COMP 323 Reference Books ¨ CJKV Information Processing: Chinese, Japanese,

Course Introduction n n COMP 323 Reference Books ¨ CJKV Information Processing: Chinese, Japanese, Korean and Vietnamese Computing (PL 1074. 5 . L 86) ¨ An Introduction to Chinese, Japanese and Korean Computing (QA 76. H 7795) ¨ 計算機中文信息處理 (PL 1074. 5. C 42) and others Tutorials and labs: PQ 604 A ¨ Tuesday Group: 9: 30 – 10: 30 Tuesdays ¨ Thursday Group: 9: 30 – 10: 30 Thursdays ¨ Try to finish the labs and the online assignment/QA during lab hours COMP 323 Lecture 1 3

Course Introduction n n COMP 323 Website ¨ Web. CT ¨ Lecture notes available

Course Introduction n n COMP 323 Website ¨ Web. CT ¨ Lecture notes available Wed. by 5 pm ¨ Print as Note. Page Method of Assessment ¨ Course Work 55% n 2 Programming Assignments 20% n 2 online quizzes 20% n 1 online homework 5% n 4 online QA(labs) 8% n Class attendance (punctuation) 2% ¨ Final Examination 45% COMP 323 Lecture 1 4

Course Introduction n Introduction to Chinese Computing ¨ Computer processing of data related to

Course Introduction n Introduction to Chinese Computing ¨ Computer processing of data related to Chinese, involving any human-computer interaction activity where communication is achieved using Chinese language. About one-fifth of the people in the world speak some form of Chinese as their native language, making it the language with the most native speakers. COMP 323 Lecture 1 5

Course Introduction n Fundamental Problems with Chinese Computing ¨ At Chinese Character Level n

Course Introduction n Fundamental Problems with Chinese Computing ¨ At Chinese Character Level n Large and not Closed Character Set n Computer Representation, Input and Output ¨ At Chinese Language Level n Lack of Morphological Variation n Lack of Grammar ¨ Very Arbitrary and Flexible ¨ Superimposed Grammar n Texts are Running Together COMP 323 Lecture 1 6

Course Introduction n Fundamental Problems with Chinese Computing COMP 323 Lecture 1 7

Course Introduction n Fundamental Problems with Chinese Computing COMP 323 Lecture 1 7

Course Introduction n Fundamental Problems with Chinese Language ¨ Bi-lingual, Tri-lingual and Multi-lingual Computing

Course Introduction n Fundamental Problems with Chinese Language ¨ Bi-lingual, Tri-lingual and Multi-lingual Computing n Question: Is Hong Kong a multi-lingual society? n How can a system be designed so that it can be used by different languages with minimal changes? n How can a system be designed so that it can be used for multiple languages? n Distinguish Chinese and English Characters ¨ Chinese Text, English Text or Chinese Text Mixed Together with English Text? COMP 323 Lecture 1 8

Course Introduction n Fundamental Problems with Chinese Language ¨ Bi-lingual, Tri-lingual and Multi-lingual Computing

Course Introduction n Fundamental Problems with Chinese Language ¨ Bi-lingual, Tri-lingual and Multi-lingual Computing n Example: Count the Number of (Chinese and/or English) Characters or Words Multilingual Computing ? 多語言文字處理技術 COMP 323 Lecture 1 9

Tentative Teaching Content n n n Characteristics of Chinese Language ¨ Reading System (Pronunciation)

Tentative Teaching Content n n n Characteristics of Chinese Language ¨ Reading System (Pronunciation) ¨ Writing System (Look) Computer Representation of Chinese Characters ¨ Character Set Standards (GB, Big 5 and Unicode. . . ) ¨ Encoding Schemes (ISO and UTF …) Chinese Character Input ¨ Chinese Input Processing by (Pen, Image, Speech and) Key Stroke n Shape-based Keystroke Input Method n Phonetic-based Keystroke Input Method COMP 323 Lecture 1 10

Tentative Teaching Content n n Chinese Character Output ¨ Bitmap and Outline Font Representation

Tentative Teaching Content n n Chinese Character Output ¨ Bitmap and Outline Font Representation ¨ Compression ¨ Scaling Problem Software Development for Chinese ¨ Text Processing, such as Character Searching, Editing, and Deletion … ¨ Software Localization and Internationalization COMP 323 Lecture 1 11

Tentative Teaching Content n n n Chinese Language Processing ¨ Word Segmentation ¨ Part-of-Speech

Tentative Teaching Content n n n Chinese Language Processing ¨ Word Segmentation ¨ Part-of-Speech (POS) Tagging ¨ Syntactic Analysis (Grammatical Analysis) Chinese Information (Document) Retrieval ¨ Document Retrieval Models ¨ Language-Related Issues Advanced Topics (possibly) ¨ Information Extraction ¨ Text Summarization COMP 323 Lecture 1 12

Lecture 1 Characteristics of Chinese COMP 323 Lecture 1

Lecture 1 Characteristics of Chinese COMP 323 Lecture 1

The Chinese Language n General Characteristics ¨ The official language in China is mandarin

The Chinese Language n General Characteristics ¨ The official language in China is mandarin (普通話), but there are many dialects in spoken form (50+). n Different Pronunciation across Different Dialects n Relatively Unified Writing System n Dialect-specific Characters and Variant Character Writing 叻吓吔呃咁咗咩哂哋唔唥唧啱啲喐喥喺嗰嘅嘜嘞嘢 n Different words express the same meaning, e. g. 係 and 是 (to be) n Word order reversal, e. g. 找尋 and 尋找 (look for) COMP 323 Lecture 1 14

The Chinese Language COMP 323 Lecture 1 15

The Chinese Language COMP 323 Lecture 1 15

The Chinese Language n Characteristics of Chinese Characters ¨ Each Chinese character associates with

The Chinese Language n Characteristics of Chinese Characters ¨ Each Chinese character associates with three features, namely its look (called graphemics), its pronunciation (called phonetics), and its meaning (called semantics). Graphemics (The Look) Phonetics (The Sound) COMP 323 Lecture 1 Semantic (The Meaning) 16

Chinese Writing System n Radicals ¨ Remark: Several radicals can stand alone as single

Chinese Writing System n Radicals ¨ Remark: Several radicals can stand alone as single and meaningful Chinese characters. Radical Standalone Examples 本未术札朽朴朳杀杂机朵权 Yes 木 炜炬炅炖炒炝炙炘炊炆炕炉 Yes 火 伈芯志忐吣忘忍态忠念忿忽 Yes 心 石 Yes 岩矾矿宕砀码研砆砌砂泵砍 COMP 323 Lecture 1 18

Chinese Writing System n Strokes (筆劃) ¨ Radicals in turn are composed of smaller

Chinese Writing System n Strokes (筆劃) ¨ Radicals in turn are composed of smaller units, called strokes. n 30+ strokes are the most basic elements of a character. n 5 basic strokes are “一” (横, a horizontal stroke), “丨” (竖, a vertical stroke), “丶” (点, dot), “丿” (撇, a stroke curved to the left) and “乙” (折, a bend stroke). COMP 323 Lecture 1 19

Chinese Writing System n Strokes ¨ Stroke Order (筆順) n The strokes for each

Chinese Writing System n Strokes ¨ Stroke Order (筆順) n The strokes for each Chinese character are to be drawn in a certain defined order. n Basic principles are: from left to right, top to bottom, outside to inside, horizontal before vertical, left slant before right slant, center before two sides, etc. n See Animations here http: //www. chinawestexchange. com/Chinese/cha racters. htm COMP 323 Lecture 1 20

Chinese Writing System n Tree Structure of Chinese Characters COMP 323 Lecture 1 21

Chinese Writing System n Tree Structure of Chinese Characters COMP 323 Lecture 1 21

Chinese Writing System n Character Classifications and Formation ¨ Type 1: Pictographs (Picture Characters)

Chinese Writing System n Character Classifications and Formation ¨ Type 1: Pictographs (Picture Characters) (象形) n They look like things they represent, e. g. Does this character 月 really look like a moon to you? Centuries ago, it was written like this: n Other examples are 日 (sun), 山 (mountain), 水 (water), 鸟 (bird), 火 (fire), 木 (tree), 車 (car, cart), and 口 (month, opening), etc. COMP 323 Lecture 1 22

Chinese Writing System n Evolution of Chinese Characters COMP 323 Lecture 1 23

Chinese Writing System n Evolution of Chinese Characters COMP 323 Lecture 1 23

Chinese Writing System n Character Classifications and Formation ¨ Type 2: (Simple) Ideographs (指事

Chinese Writing System n Character Classifications and Formation ¨ Type 2: (Simple) Ideographs (指事 or 表意) n They represent abstract concepts or ideas, such as numbers and directions, e. g. 一 (one), 二 (two), 三 (three), and 中 (center, middle), 上 (above), 下 (below) etc. COMP 323 Lecture 1 24

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs (會意)

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs (會意) n Pictographs and ideographs can be combined to represent more complex characters, and usually reflect the combined meaning of them. sun 日 + moon 月 = bright 明 n Examples: person 人 + person 人 = agree/follow 从 n More sun 日 + tree 木 = east (sun rising above Interesting the trees in the east) 東 Animations tree 木 + tree 木 = forest 林 from Internet + one more tree 木 = full of trees 森 http: //www. language. berkeley. edu/fanjian/compo und_ideographs. html COMP 323 Lecture 1 25

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs COMP

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs COMP 323 Lecture 1 26

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs COMP

Chinese Writing System n Character Classifications and Formation ¨ Type 3: Compound Ideographs COMP 323 Lecture 1 27

Chinese Writing System n Character Classifications and Formation ¨ Type 4: Phonetic Ideographs (形聲)

Chinese Writing System n Character Classifications and Formation ¨ Type 4: Phonetic Ideographs (形聲) n They usually have at least two component characters, one influences the sound and the other influences the meaning. n For example, For the character “跳” ( jump ), the n They account left part “足“ means “foot”. The for more than meanings of those characters that contain “足” are related to “foot” in 90% of all a certain way. The right part “兆” Chinese indicates the sound. They share characters the same vowel. in use today. COMP 323 Lecture 1 28

Chinese Writing System Thought to be the oldest types of characters, pictographs were originally

Chinese Writing System Thought to be the oldest types of characters, pictographs were originally pictures of things. During the past 5, 000 years or so they have become simplified and stylised. Ideographs are graphical representations of abstract ideas. Compound pictographs and ideographs combine or more pictographs or ideographs to form new characters. Both component parts contribute to the meaning of the compound character. COMP 323 Lecture 1 29

Chinese Writing System Semantic-phonetic compounds represent around 90% of all existing characters and consist

Chinese Writing System Semantic-phonetic compounds represent around 90% of all existing characters and consist of two parts: a semantic component or radical which hints at the meaning of the character, and a phonetic component which gives a clue to the pronunciation of the character. Characters containing the same phonetic component may have the same sound and the same tone, the same sound but a different tone, the same initial or final sound, or a different sound a different tone. Phonetic components are generally a more reliable indication of pronunciation than semantic components are of meaning. COMP 323 Lecture 1 30

Chinese Writing System n Traditional and Simplified Characters ¨ Over time, frequently used and

Chinese Writing System n Traditional and Simplified Characters ¨ Over time, frequently used and complex Chinese characters tend to be simplified. retain only one part from the traditional character ¨ More about Pitfalls and Complexities of Chinese to Chinese Conversion http: //www. cjk. org/cjk/c 2 cbasis. htm COMP 323 Lecture 1 31

Chinese Writing System n Chinese Language (Chinese Text) ¨ Chinese characters are subsequently combined

Chinese Writing System n Chinese Language (Chinese Text) ¨ Chinese characters are subsequently combined with other Chinese characters as words to form more complex ideas and concepts. ¨ Question: How many Chinese characters? The Chinese writing system is open-ended, meaning that there is no upper limit to the number of characters. The largest Chinese dictionaries include about 56, 000 characters, but most of them are archaic, obscure or rare variant forms. Knowledge of about 3, 000 characters enables you to read about 99% of the characters in Chinese newspapers and magazines. To read Chinese literature, technical writings or classical Chinese, though, you need to be familiar with about 6, 000 characters. COMP 323 Lecture 1 32

Chinese Reading System n Pronunciation ¨ The phonetic information is not explicit. n Sometimes,

Chinese Reading System n Pronunciation ¨ The phonetic information is not explicit. n Sometimes, you can guess the pronunciation through the component characters. n Sometimes, the pronunciation has no relation to its components at all. n It makes the learning of Chinese difficult without a phonetic transcription system. ¨ Phonetic transcription: Dictation of pronunciations n Symbols to indicate all sounds in the language - sufficient n One sound is denoted by only one symbol - Uniqueness COMP 323 Lecture 1 33

Chinese Reading System n Pronunciation ¨ Pinyin: dictating Mandarin Chinese n Vowel (元音, Initial)

Chinese Reading System n Pronunciation ¨ Pinyin: dictating Mandarin Chinese n Vowel (元音, Initial) and Consonant (輔音, final) For example, consider Beijing: bei: b is an initial, and ei is a final jing: j is an initial, and ing is a final In speech, Chinese words are created using just 21 beginning sounds called initials, and 37 ending sounds called finals. Initials and finals, of course, combine to create the basic sounds of Chinese. n More about Pronunciation http: //www. chineseoutpost. com/language/pronunciation/mandarinchinese-initials-and-finals-table-1. asp 34 COMP 323 Lecture 1

Chinese Reading System n Pronunciation ¨ Pinyin COMP 323 Lecture 1 35

Chinese Reading System n Pronunciation ¨ Pinyin COMP 323 Lecture 1 35

Chinese Reading System n Pronunciation ¨ Tones of Chinese n Chinese is a tonal

Chinese Reading System n Pronunciation ¨ Tones of Chinese n Chinese is a tonal Language. n Mandarin has 4 (5) tones and Cantonese has 6 (9) tones, which makes it much harder to learn than Mandarin. COMP 323 Lecture 1 36

Chinese Reading System n Pronunciation Everyone seems to know this one: Yes, just by

Chinese Reading System n Pronunciation Everyone seems to know this one: Yes, just by saying “ma” in different tones, you can ask, “Did mother scold the horse? ” 妈骂马吗? (mā mà mă ma? ) ¨ Tones differentiate meanings. 鞏俐 (Gong Li, with third and fourth tones), is the name of the star of “Raise the Red Lantern” and other contemporary Chinese films. However, 公里 (gong li, with first and third tones, means kilometer. COMP 323 Lecture 1 37