Natural Language Processing And Computational Linguistics Using TFIDF

  • Slides: 27
Download presentation
Natural Language Processing And Computational Linguistics Using TF-IDF Anomalies to Cluster Documents on Subject

Natural Language Processing And Computational Linguistics Using TF-IDF Anomalies to Cluster Documents on Subject Matter An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies Whitney St. Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph. D. Robert Patton, Ph. D. Computational Sciences and Engineering Division

Purposes of document clustering · Data overabundance - You. Tube generates 200 terabytes of

Purposes of document clustering · Data overabundance - You. Tube generates 200 terabytes of data per day · How do we sift through those kinds of quantities? - Searching · Reduces the set tremendously - Document Clustering · Is a knowledge discovery technique · Categorizes results into meaningful groups · Allows the user to browse quickly to the target OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 2

Document clustering users · Financial analysts - Identify certain trends to develop forecasts about

Document clustering users · Financial analysts - Identify certain trends to develop forecasts about a particular company · Business Intelligence - Identify products that are associated with or dependent upon one another · Military - Identify terrorist cells from blog activity and movement of materials · You! - Narrow down hundreds of thousands of internet search results to find the kinds of sites you want OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 3

Current document clustering technique • A word-by-word comparison of each document is made to

Current document clustering technique • A word-by-word comparison of each document is made to determine similarity • Unfortunately, this method… • Does not handle context very well Sniff Dog Sniffer Dog • Compares several hundred/ several thousand words for each document • Is very computationally expensive • Requires expensive SIMD machines OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 4

Contributions to the field • Identify only those words which are more indicative of

Contributions to the field • Identify only those words which are more indicative of the subject matter – If airline occurs 20% more than is “normal, ” it has something to do with the subject • Examine both simple and complex noun phrases to address the context of the document • Generate much smaller vectors, containing an average of 82% fewer terms! • Cluster more accurately because only “important” words are chosen OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 5

Our method Train the system to recognize “normal” term occurrences Calculate frequencies for the

Our method Train the system to recognize “normal” term occurrences Calculate frequencies for the test data Compare the frequencies to detect anomalies Analyze the “distance” of each document to each other document Cluster those documents which are most similar to each other Analyze result using F-Measure OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 6

Establishing the baseline • Train the program to recognize what is “normal” for a

Establishing the baseline • Train the program to recognize what is “normal” for a given term – Need an entire English language corpus • Corpus: a large, structured set of texts compiled to be representative of a language • uses hundreds of thousands of words in every allowable way • Using a corpus, the program can • Establish usage statistics • Learn linguistic rules Example: The Brown Corpus http: //www. edict. com. hk/concordance/WWWConcapp. E. htm OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 7

Extracting words and phrases Parts-of-speech tagger Token extractor Convert to vectors OAK RIDGE NATIONAL

Extracting words and phrases Parts-of-speech tagger Token extractor Convert to vectors OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 8

Part-of-speech tagging · Tags every word in the sentence with the correct part-of -speech

Part-of-speech tagging · Tags every word in the sentence with the correct part-of -speech · Achieves an accuracy of 97. 24% - Is necessary because token extraction methods are each dependent upon correct tagging · Passes the tagged sentence to the token extractor OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 9

Token extractor · Extracts - Words - Simple noun phrases - Complex noun phrases

Token extractor · Extracts - Words - Simple noun phrases - Complex noun phrases OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 10

Word extraction · Uses POS tagged data to identify only adjectives, verbs, and nouns

Word extraction · Uses POS tagged data to identify only adjectives, verbs, and nouns · Uses the Porter stemmer to identify unique words - cut common suffixes such as –ing, -tion, -es, -s · Example: “recreation” and “recreational” are both identified as “recreat” OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 11

Why nouns? · Are named entities · Answer the question “What” · Are less

Why nouns? · Are named entities · Answer the question “What” · Are less ambiguous than verbs - Example: “cook up a good meal” or “cook up a new solution” OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 12

Simple noun phrase extraction · Accepts only consecutive nouns - Example: summer intern, union

Simple noun phrase extraction · Accepts only consecutive nouns - Example: summer intern, union representative · Provides a set of short, highly descriptive phrases OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 13

Complex noun phrase extraction techniques · Static Rule-based/ Finite State Automata - Rely on

Complex noun phrase extraction techniques · Static Rule-based/ Finite State Automata - Rely on the aptitude of linguist formulating rule set · Machine Learning - Rely on the “completeness” of the training set OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 14

Static rule-based extraction · Establishes a list of linguistic rules - A determiner preceding

Static rule-based extraction · Establishes a list of linguistic rules - A determiner preceding a noun marks the beginning of a noun phrase - A determiner may not precede a noun phrase noun/ pronoun/ determiner S 0 determiner/a djective S 1 adjective OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY noun/ pronoun NP Relative clause/ Prepositional phrase/ noun 15

Static extraction shortcomings · Unanticipated rules - The subjective nature of language · Difficulty

Static extraction shortcomings · Unanticipated rules - The subjective nature of language · Difficulty finding non-recursive, base NP’s - [The man [whose red hat [I borrowed yesterday]RC [in the street]PP [that is next to my house]RC ]NP lives [next door]NP. - [The man]NP whose [red hat]NP I borrowed [yesterday]NP in [the street]NP that is next to [my house]NP lives [next door]NP. · Structural ambiguity OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 16

Structural ambiguity example “I saw the man with the telescope. ” S NP I

Structural ambiguity example “I saw the man with the telescope. ” S NP I VP V saw S NP I NP DET N the man PP VP V saw PP PRP DET N with the telescope NP PRP DET N with the telescope the man OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 17

Machine learning extraction · Is all about - Uses a corpus · Is based

Machine learning extraction · Is all about - Uses a corpus · Is based on statistics - The more it sees a particular occurrence, the more likely it is to prefer it · Makes better educated guesses about structural ambiguity · Discovers thousands of unanticipated rules OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 18

Transformation-based complex noun phrase extraction An ‘error-driven’ approach for learning an ordered set of

Transformation-based complex noun phrase extraction An ‘error-driven’ approach for learning an ordered set of rules 1. 2. 3. 4. 5. Generate all rules that correct at least one error. For each rule: (a) Apply to a copy of the most recent state of the training set. (b) Score result Select rule with best score. Update training set by applying selected rule. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 19

Determining anomaly sets · TF-IDF: Term Frequency – Inverse Document Frequency - Number of

Determining anomaly sets · TF-IDF: Term Frequency – Inverse Document Frequency - Number of local occurrences of term multiplied by uniqueness measure of term in document set · TF-ICF: Term Frequency – Inverse Corpus Frequency 30 - Average number of corpus occurrences of term multiplied by uniqueness measure of term in the corpus 25 tf-idf / tf-icf 20 15 air Corpus Data Document 1 Document 2 Document 3 Document 5 Document 6 airlin airwai 10 aerospac air 5 aircraft aviat 0 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY words 20

Each document has its own anomaly vector Document 1 • • Aeronaut Engin Aviat

Each document has its own anomaly vector Document 1 • • Aeronaut Engin Aviat Standard Alarm Fly Safeti Certif Document 47 Document 73 • • • • Ocean Problem Coastal Committee Enforc Protect Pollution Environ Ash Ground Erupt Mountain Caribbean Smolder Immedi volcano OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 21

Clustering the data · Unweighted Pair Group Method with Average means OAK RIDGE NATIONAL

Clustering the data · Unweighted Pair Group Method with Average means OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 22

Performance Metrics Used ·Precision = number of correct responses number of responses ·Recall =

Performance Metrics Used ·Precision = number of correct responses number of responses ·Recall = number of correct responses number correct in key ·F-measure = 2 RP R+ P OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 23

RESULTS Cluster Results using Vector Space Model 80% Cluster Results using modified Vector Space

RESULTS Cluster Results using Vector Space Model 80% Cluster Results using modified Vector Space Model with anomaly sets 89% With 82% fewer comparisons! OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 24

Future Work · Determine clustering results for both simple and complex noun phrases ·

Future Work · Determine clustering results for both simple and complex noun phrases · Could be applied to other clustering techniques, such as swarming OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 25

Acknowledgements · The Research Alliance in Math and Science program · Computational Sciences and

Acknowledgements · The Research Alliance in Math and Science program · Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U. S. Department of Energy. · Dr. Cathy Jiao · Dr. Robert Patton · Dr. Thomas Potok OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 26

QUESTIONS? OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 27

QUESTIONS? OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 27