A Proposed Tag Set for Exchanging WordSegmented Text

  • Slides: 15
Download presentation
A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc. com. tw

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang [email protected] com. tw http: //www. bdc. com. tw/~shin/ Behavior Design Corporation

Technical Issues in Designing a Tag Set for a Stratified WS Standard n Why

Technical Issues in Designing a Tag Set for a Stratified WS Standard n Why a Stratified WS Standard ? u “Words” are generated by various mechanisms u No common agreement on all WS criteria due to different processing models of different researchers and institutions on the mechanisms u Stratification help the exchange of corpora and conversion to appropriate private word units n What Tags and Attributes, and Why?

Text Generation Mechanisms Behind Word Stratification n Lexicon Selection u Basic Lexicon (“Standard Dictionary”)

Text Generation Mechanisms Behind Word Stratification n Lexicon Selection u Basic Lexicon (“Standard Dictionary”) u Derivational Processes (non-enumerable) F simple variants (color/colour; 呆子/獃子; 兇手/凶手) F regular expressions (numbers, word patterns) F regular derivational processes (proper nouns, abbreviations, compounding, …) n Text Planning u Writing Variants (symbols, punctuations)

What Tags/Attributes, and Why? n Tags for carrying linguistics information u word boundary u

What Tags/Attributes, and Why? n Tags for carrying linguistics information u word boundary u level of stratification (in terms of a standard) F misc. n (e. g. , symbols and punctuations in text) Tags for carrying conforming information u standard/substandard of conformance F so as to convert to-and-from private systems easily F to allow user extension on (sub)standard(s) & overcome time variant issues

Tags for carrying linguistics information n Tags: F <w 0>(~信級詞): words in standard dictionary

Tags for carrying linguistics information n Tags: F (~信級詞): words in standard dictionary F (~達級詞): morphologically derived F (~雅級詞): derived through compounding regularity n Attributes: F POS (part of speech), tt (token type, derived word type), hwds (embedded head words), rel (relationship among embedded head words)

Tags for carrying conforming information n Tags: <wstxt>, <ws 0 p>, <ws 1 p>,

Tags for carrying conforming information n Tags: , , , ,

(un-segmented para. ) u paragraphs of various stratification level conforming to specified standard/substandards n Attributes: WS, Dict, MR, NUM, NAM, CMPR, DR, GR, specifying: u conformed “standard resources” u user extension: F e. g. , Dict=“CNS-WS-Dict-1998 -1, X-BDC-WS-Dict-1998 -2”

Attributes On Standardized Resources: Why ? n Official standard (and thus tags) should be

Attributes On Standardized Resources: Why ? n Official standard (and thus tags) should be defined in terms of explicitly specified and unambiguously testifiable resources!! u with (optional) mechanism for user extension u e. g. , Charset registry, RFC (Internet standards) n Every resource is assigned a symbolic name (referenced in attribute) for conforming test F for conversion to/evaluation in private systems

Attributes On Standardized Resources (Cont. ) n WS: WS standard, the collection of a

Attributes On Standardized Resources (Cont. ) n WS: WS standard, the collection of a set of substandards (such as Dict, MR, . . . ). F n e. g. , CNS-WS-1998 -1=CNS-WS-{Dict, MR, …}-1998 -1 Dict: standard dictionary (basic lexicon) F qualified basic words F POS: optional (referred by other substandards) n MR: morphological rules/standard F qualified affix/prefix/suffix F qualified combination patterns

Attributes on Standardized Resources (Cont. ) [& Arguable] NUM: numbering rules/patterns n NAM: naming

Attributes on Standardized Resources (Cont. ) [& Arguable] NUM: numbering rules/patterns n NAM: naming rules/patterns n u qualified family names u length constraints, abbreviations, standard translations of foreign names, . . .

Attributes on Standardized Resources (Cont. ) [& Arguable] CMPR: compound formation rules/patterns n DR:

Attributes on Standardized Resources (Cont. ) [& Arguable] CMPR: compound formation rules/patterns n DR: other derivational rules not in the above substandards n GR: private rules/patterns/description n

Example: Simplest Encoding n n n <!-- The whole segmented text is enclosed by

Example: Simplest Encoding n n n u u u 中文 分詞 標準 必須 一 步 小心 地 制定. u u u 中文 分詞 標準 必須 一步一步 小心地 制定. u u u 中文分詞標準 必須 一步一步 小心地 制定. u

Example: Using Word Tags n n n n <!-- use word tags to identify

Example: Using Word Tags n n n n u 中文分詞標準必須 中文 分詞 標準 必須 u u 小心. 制定 制定. u 中文分詞標準 中文 分詞 標準 u 必須 必須 u u 小心 小心 u 制定. 制定.

Example: Derived Words and Token Type (TT) Attribute n n n <ws 1 p

Example: Derived Words and Token Type (TT) Attribute n n n u F 孩子 u u F 一萬 F u

Example: Application of (Hwrds, Rel) Attributes for Punctuations n n n n n <!--

Example: Application of (Hwrds, Rel) Attributes for Punctuations n n n n n u 高中() u 中山 u 中山() 意義相同. . .

Future Issues n Specification of the Official WS Standard u standard resources and substandards

Future Issues n Specification of the Official WS Standard u standard resources and substandards to be defined in the first official version n Construction of Basic Lexicon u basic vs. derivational words u standardization of the derivational parts n Registration of User Extension & Evolution of the Official Standard