Processing of Large Document Collections 1 Helena AhonenMyka

Organization of the course z. Classes: 17. 9. , 22. 10. , 23. 10.

Schedule z 17. 9. Character sets, preprocessing of text, text categorization z 22. 10.

In this part. . . z. Character sets zpreprocessing of text ztext categorization

1. Character sets z. Abstract character vs. its graphical representation zabstract characters are grouped

Character sets z. For instance yfor English: xuppercase letters A-Z xlowercase letters a-z xpunctuation

Character sets z. To represent text digitally, we need a mapping between (abstract) characters

Character sets z For each character in the character repertoire, the character set defines

Character sets z. The mere existence of a character set supports operations like editing

Character sets: standars z. Character sets can be arbitrary, but in practice standardization is

Character sets: standards z. ASCII z. ISO-8859 (e. g. ISO Latin 1) z. Unicode

ASCII z. American Standard Code for Information Interchange z. A seven bit code ->

ASCII z. With 7 bits, the set of code points is too small for

Extended ASCII z. Problem: ydifferent manufacturers each developed their own 8 -bit extensions to

ISO 8859 z Standardization of 8 -bit character sets z In the 80´s: multipart

Unicode z 256 is not enough code points yfor ideographically represented languages (Chinese, Japanese…)

Unicode z 16 -bit character set, e. g. 65, 536 code points znot sufficient

Unicode z. Code values for all the characters used to write contemporary ’major’ languages

Unicode ypunctuation marks ytechnical and mathematical symbols yarrows ydingbats (pointing hands, stars, …) yboth

Unicode z. Code values for nearly 39, 000 symbols are provided zsome part is

Unicode: encodings z Encoding is a mapping that transforms a code value into a

Unicode: encodings z. UTF-8 y. ASCII code values are likely to be more common

Unicode: encodings z. UTF-16: expansion method ytwo 16 -bit values are combined to a

2. Preprocessing of text z. Text cannot be directly interpreted by the many document

Vector model z. A document is usually represented as a vector of term weights

Vector model z. Different approaches: ydifferent ways to understand what a term is ydifferent

Terms z. Words ytypical choice yset of words, bag of words zphrases ysyntactical phrases

Terms z. Part of the text is not considered as terms yvery common words

Weights of terms z. Weights usually range between 0 and 1 zbinary weights may

Structure z. Either the full text of the document or selected parts of it

Dimensionality reduction z. Many algorithms cannot handle high dimensionality of the term space (=

Dimensionality reduction z. Local dimensionality reduction yfor each category, a reduced set of terms

Dimensionality reduction z. Dimensionality reduction by term selection ythe terms of the reduced term

Dimensionality reduction by term selection z Goal: select terms that, when used for document

Dimensionality reduction by term selection z. Many functions available ydocument frequency: keep the high

Dimensionality reduction by term selection z. Information-theoretic term selection functions, e. g. ychi-square yinformation

Dimensionality reduction by term extraction z. Term extraction attempts to generate, from the original

Dimensionality reduction by term extraction z Term clustering ytries to group words with a

3. Text categorization z. Text classification, topic classification/spotting/detection zproblem setting: yassume: a predefined set

Text categorization z. Two major approaches: yknowledge engineering -> end of 80’s xmanually defined

Text categorization z Let y. D: a domain of documents y. C = {c

We assume. . . z. Categories are just symbolic labels yno additional knowledge of

-> general methods z. Methods do not depend on any application -dependent knowledge yin

Single-label vs. multi-label z. Single-label text categorization yexactly 1 category must be assigned to

Single-label, multi-label z. The binary case (and, hence, the singlelabel case) is more general

Category-pivoted vs. document-pivoted z. Two different ways for using a text classifier zgiven a

Category-pivoted vs. document-pivoted z The distinction is important, since the sets C and D

Category-pivoted vs. document-pivoted z. Some algorithms may apply to one style and not the

Hard-categorization vs. ranking categorization z. Hard categorization ythe classifier answers T or F z.

Applications of text categorization z. Automatic indexing for Boolean information retrieval systems zdocument organization

Automatic indexing for Boolean IR systems z In an information retrieval system, each document

Document organization z. Indexing with a controlled vocabulary is an intance of the general

Text filtering z Classifying a stream of incoming documents dispatched in an asynchronous way

Word sense disambiguation z. Given the occurrence in a text of an ambiguous word,

Word sense disambiguation z Indexing by word senses rather than by words z text

Hierarchical categorization of Web pages z. E. g. Yahoo like web hierarchical catalogues ztypically,

Knowledge engineering approach z. In the 80´s: knowledge engineering techniques ybuilding manually expert systems

Knowledge engineering approach z. Drawback: rules must be manually defined by a knowledge engineer

Machine learning approach z A general inductive process (learner) automatically builds a classifier for

Machine learning approach z The learner is domain independent yusually available ’off-the-shelf’ z the

Training set, test set, validation set z. Initial corpus of manually classified documents ylet

Training set, test set, validation set z. The initial corpus is divided into two

Training set, test set, validation set z. The documents in the test are not

Training set, test set, validation set z. Training set can be split to two

Inductive construction of classifiers z. A ranking classifier for a category ci ydefinition of

Inductive construction of classifiers z. A hard classifier for a category ydefinition of a

Learners z. Probabilistic classifiers (Naïve Bayes) zdecision tree classifiers zdecision rule classifiers zregression methods

Rocchio method z. A classifier is a vector of the same dimension as the

Slides: 69

Download presentation

Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki

Organization of the course z. Classes: 17. 9. , 22. 10. , 23. 10. , 26. 11. ylectures (Helena Ahonen-Myka): 10 -12, 13 -15 yexercise sessions (Lili Aunimo): 15 -17 yrequired presence: 75% z. Exercises are given (and returned) each week y required: 75% z. Exam: 4. 12. at 16 -20, Auditorio z. Points: Exam 30 pts, exercises 30 pts

Schedule z 17. 9. Character sets, preprocessing of text, text categorization z 22. 10. Text summarization z 23. 10. Text compression z 26. 11. … to be announced… zself-study: basic transformations for text data, using linguistic tools, etc.

In this part. . . z. Character sets zpreprocessing of text ztext categorization

1. Character sets z. Abstract character vs. its graphical representation zabstract characters are grouped into alphabets yeach alphabet forms the basis of the written form of a certain language or a set of languages

Character sets z. For instance yfor English: xuppercase letters A-Z xlowercase letters a-z xpunctuation marks xdigits 0 -9 xcommon symbols: +, = yideographic symbols of Chinese and Japanese yphonetic letters of Western languages

Character sets z. To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers) zthis mapping is a character set zthe domain of the character set is called a character repertoire (= the alphabet for which the mapping is defined)

Character sets z For each character in the character repertoire, the character set defines a code value in the set of code points z in English: y 26 letters in both lower- and uppercase yten digits + some punctuation marks z in Russian: cyrillic letters z both could use the same set of code points (if not a bilingual document) z in Japanese: could be over 6000 characters

Character sets z. The mere existence of a character set supports operations like editing and searching of text zusually character sets have some structure ye. g. integers within a small range yall lower-case (resp. upper-case) letters have code values that are consecutive integers (simplifies sorting etc. )

Character sets: standars z. Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs, . . . ) zearly standards were designed for English only, or for a small group of languages at a time

Character sets: standards z. ASCII z. ISO-8859 (e. g. ISO Latin 1) z. Unicode z. UTF-8, UTF-16

ASCII z. American Standard Code for Information Interchange z. A seven bit code -> 128 code points zactually 95 printable characters only ycode points 0 -31 and 128 are assigned to control characters (mostly outdated) z. ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)

ASCII z. With 7 bits, the set of code points is too small for anything else than American English zsolution: y 8 bits brings more code points (256) y. ASCII character repertoire is mapped to the values 0 -127 yadditional symbols are mapped to other values

Extended ASCII z. Problem: ydifferent manufacturers each developed their own 8 -bit extensions to ASCII xdifferent character repertoires -> translation between them is not always possible yalso 256 code values is not enough to represent all the alphabets -> different variants for different languages

ISO 8859 z Standardization of 8 -bit character sets z In the 80´s: multipart standard ISO 8859 was produced z defines a collection of 8 -bit character sets, each designed for a group of languages z the first part: ISO 8859 -1 (ISO Latin 1) ycovers most Western European languages y 0 -127: identical to ASCII, 128 -159 (mostly) unused, 96 code values for accented letters and symbols

Unicode z 256 is not enough code points yfor ideographically represented languages (Chinese, Japanese…) yfor simultaneous use of several languages zsolution: more than one byte for each code value za 16 -bit character set has 65, 536 code points

Unicode z 16 -bit character set, e. g. 65, 536 code points znot sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions y. CJK-consolidation: characters of these scripts are given the same value if they look the same

Unicode z. Code values for all the characters used to write contemporary ’major’ languages yalso the classical forms of some languages y. Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan y. Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts

Unicode ypunctuation marks ytechnical and mathematical symbols yarrows ydingbats (pointing hands, stars, …) yboth accented letters and separate diacritical marks (accents, tildes…) are included, with a mechanism for building composite characters xcan also create problems: two characters that look the same may have different code values x->normalization may be necessary

Unicode z. Code values for nearly 39, 000 symbols are provided zsome part is reserved for an expansion method (see later) z 6, 400 code points are reserved for private use ythey will never be assigned to any character by the standard, so they will not conflict with the standard

Unicode: encodings z Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission z identity mapping for a 8 -bit code? yit may be necessary to encode 8 -bit characters as sequences of 7 -bit (ASCII) characters ye. g. Quoted-Printable (QP) xcode values 128 -255 as a sequence of 3 bytes x 1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value x 233 -> E 9 -> =E 9

Unicode: encodings z. UTF-8 y. ASCII code values are likely to be more common in most text than any other values xin UTF-9 encoding ASCII characters are sent themselves (high-order bit 0) xother characters (two bytes) are encoded using up to six bytes (high-order bit is set to 1)

Unicode: encodings z. UTF-16: expansion method ytwo 16 -bit values are combined to a 32 -bit value -> a million characters available

2. Preprocessing of text z. Text cannot be directly interpreted by the many document processing applications zan indexing procedure is needed ymapping of a text into a compact representation of its content zwhich are the meaningful units of text? zhow these units should be combined? yusually not ”important”

Vector model z. A document is usually represented as a vector of term weights zthe vector has as many dimensions as there are terms (or features) in the whole collection of documents zthe weight represents how much the term contributes to the semantics of the document

Vector model z. Different approaches: ydifferent ways to understand what a term is ydifferent ways to compute term weights

Terms z. Words ytypical choice yset of words, bag of words zphrases ysyntactical phrases ystatistical phrases yusefulness not yet known?

Terms z. Part of the text is not considered as terms yvery common words (function words): xarticles, prepositions, conjunctions ynumerals zthese words are pruned ystopword list zother preprocessing possible ystemming, base words

Weights of terms z. Weights usually range between 0 and 1 zbinary weights may be used y 1 denotes presence, 0 absence of the term in the document zoften the tfidf function is used yhigher weight, if the term occurs often in the document ylower weight, if the term occurs in many documents

Structure z. Either the full text of the document or selected parts of it are indexed ze. g. in a patent categorization application ytitle, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention zsome parts may be considered more important ye. g. higher weight for the terms in the title

Dimensionality reduction z. Many algorithms cannot handle high dimensionality of the term space (= large number of terms) zusually dimensionality reduction is applied zdimensionality reduction also reduces overfitting yclassifier that overfits the training data is good at re-classifying the training data but worse at classifying previously unseen data

Dimensionality reduction z. Local dimensionality reduction yfor each category, a reduced set of terms is chosen for classification that category yhence, different subsets are used when working with different categories zglobal dimensionality reduction ya reduced set of terms is chosen for the classification under all categories

Dimensionality reduction z. Dimensionality reduction by term selection ythe terms of the reduced term set are a subset of the original term set z. Dimensionality reduction by term extraction ythe terms are not the same type of the terms in the original term set, but are obtained by combinations and transformations of the original ones

Dimensionality reduction by term selection z Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application z wrapper approach ythe reduced set of terms is found iteratively and tested with the application z filtering approach ykeep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task

Dimensionality reduction by term selection z. Many functions available ydocument frequency: keep the high frequency terms xstopwords have been already removed x 50% of the words occur only once in the document collection xe. g. remove all terms occurring in at most 3 documents

Dimensionality reduction by term selection z. Information-theoretic term selection functions, e. g. ychi-square yinformation gain ymutual information yodds ratio yrelevancy score

Dimensionality reduction by term extraction z. Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness zdue to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation

Dimensionality reduction by term extraction z Term clustering ytries to group words with a high degree of pairwise semantic relatedness ygroups (or their centroids) may be used as dimensions z latent semantic indexing ycompresses document vector into vectors of a lowerdimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence

3. Text categorization z. Text classification, topic classification/spotting/detection zproblem setting: yassume: a predefined set of categories, a set of documents ylabel each document with one (or more) categories

Text categorization z. Two major approaches: yknowledge engineering -> end of 80’s xmanually defined set of rules encoding expert knowledge on how to classify documents under the given gategories ymachine learning, 90’s -> xan automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories

Text categorization z Let y. D: a domain of documents y. C = {c 1, …, c|C|} : a set of predefined categories y. T = true, F = false z The task is to approximate the unknown target function ’: D x C -> {T, F} by means of a function : D x C -> {T, F}, such that the functions ”coincide as much as possible” z function ’ : how documents should be classified z function : classifier (hypothesis, model…)

We assume. . . z. Categories are just symbolic labels yno additional knowledge of their meaning is available z. No knowledge outside of the documents is available yall decisions have to be made on the basis of the knowledge extracted from the documents ymetadata, e. g. , publication date, document type, source etc. is not used

-> general methods z. Methods do not depend on any application -dependent knowledge yin operational applications all kind of knowledge can be used zcontent-based decisions are necessarily subjective yit is often difficult to measure the effectiveness of the classifiers yeven human classifiers do not always agree

Single-label vs. multi-label z. Single-label text categorization yexactly 1 category must be assigned to each dj D z. Multi-label text categorization yany number of categories may be assigned to the same dj D z. Special case of single-label: binary yeach dj must be assigned either to category ci or to its complement ¬ ci

Single-label, multi-label z. The binary case (and, hence, the singlelabel case) is more general than the multilabel yan algorithm for binary classification can also be used for multi-label classification ythe converse is not true

Category-pivoted vs. document-pivoted z. Two different ways for using a text classifier zgiven a document, we want to find all the categories, under which it should be filed > document-pivoted categorization (DPC) zgiven a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)

Category-pivoted vs. document-pivoted z The distinction is important, since the sets C and D might not be available in their entirety right from the start z DPC: suitable when documents become available at different moments in time, e. g. filtering email z CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)

Category-pivoted vs. document-pivoted z. Some algorithms may apply to one style and not the other, but most techniques are capable of working in either mode

Hard-categorization vs. ranking categorization z. Hard categorization ythe classifier answers T or F z. Ranking categorization ygiven a document, the classifier might rank the categories according to their estimated appropriateness to the document yrespectively, given a category, the classifier might rank the documents

Applications of text categorization z. Automatic indexing for Boolean information retrieval systems zdocument organization ztext filtering zword sense disambiguation zhierarchical categorization of Web pages

Automatic indexing for Boolean IR systems z In an information retrieval system, each document is assigned one or more keywords or keyphrases describing its content ykeywords belong to a finite set called controlled dictionary z TC problem: the entries in a controlled dictionary are viewed as categories yk 1 x k 2 keywords are assigned to each document ydocument-pivoted TC

Document organization z. Indexing with a controlled vocabulary is an intance of the general problem of document base organization ze. g. a newspaper office has to classify the incoming ”classified” ads under categories such as Personals, Cars for Sale, Real Estate etc. zorganization of patents, filing of newspaper articles. . .

Text filtering z Classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer z e. g. newsfeed yproducer: news agency; consumer: newspaper ythe filtering system should block the delivery of documents the consumer is likely not interested in

Word sense disambiguation z. Given the occurrence in a text of an ambiguous word, find the sense of this particular word occurrence z. E. g. y. Bank of England ythe bank of river Thames y”Last week I borrowed some money from the bank. ”

Word sense disambiguation z Indexing by word senses rather than by words z text categorization ydocuments: word occurrence contexts ycategories: word senses z also resolving other natural language ambiguities ycontext-sensitive spelling correction, part of speech tagging, prepositional phrase attachment, word choice selection in machine translation

Hierarchical categorization of Web pages z. E. g. Yahoo like web hierarchical catalogues ztypically, each category should be populated by ”a few” documents znew categories are added, obsolete ones removed zusage of link structure in classification zusage of the hierarchical structure

Knowledge engineering approach z. In the 80´s: knowledge engineering techniques ybuilding manually expert systems capable of taking text categorization decisions yexpert system: consists of a set of rules xwheat & farm -> wheat xwheat & commodity -> wheat xbushels & export -> wheat xwheat & winter & ~soft -> wheat

Knowledge engineering approach z. Drawback: rules must be manually defined by a knowledge engineer with the aid of a domain expert yany update necessitates again human intervention ytotally domain dependent y-> expensive and slow process

Machine learning approach z A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert z from these characteristics the learner gleans the characteristics that a new unseen document should have in order to be classified under ci z supervised learning (= supervised by the knowledge of the training documents)

Machine learning approach z The learner is domain independent yusually available ’off-the-shelf’ z the inductive process is easily repeated, if the set of categories changes z manually classified documents often already available ymanual process may exist z if not, it still easier to manually classify a set of documents than to build and tune a set of rules

Training set, test set, validation set z. Initial corpus of manually classified documents ylet dj belong to the initial corpus yfor each pair <dj, ci> it is known if dj should be filed under ci zpositive examples, negative examples of a category

Training set, test set, validation set z. The initial corpus is divided into two sets ya training (and validation) set ya test set zthe training set is used to build the classifier zthe test set is used for testing the effectiveness of the classifiers yeach document is fed to the classifier and the decision is compared to the manual category

Training set, test set, validation set z. The documents in the test are not used in the construction of the classifier zalternative: k-fold cross-validation yk different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set yindividual results are then averaged

Training set, test set, validation set z. Training set can be split to two parts zone part is used for optimising parameters ytest which values of parameters yield the best effectiveness ztest set and validation set must be kept separate

Inductive construction of classifiers z. A ranking classifier for a category ci ydefinition of a function that, given a document, returns a categorization status value for it, i. e. a number between 0 and 1 ydocuments are ranked according to their categorization status value

Inductive construction of classifiers z. A hard classifier for a category ydefinition of a function that returns true or false, or ydefinition of a function that returns a value between 0 and 1, followed by a definition of a threshold xif the value is higher than the threshold -> true xotherwise -> false

Learners z. Probabilistic classifiers (Naïve Bayes) zdecision tree classifiers zdecision rule classifiers zregression methods zon-line methods zneural networks zexample-based classifiers (k-NN) zsupport vector machines

Rocchio method z. Linear classifier method zfor each category, an explicit profile (or prototypical document) is constructed ybenefit: profile is understandable even for humans

Rocchio method z. A classifier is a vector of the same dimension as the documents zweights: zclassifying: cosine similarity of the category vector and the document vector