e Classifier Tool for Taxonomies Scott Spangler spanglesalmaden

  • Slides: 18
Download presentation
e. Classifier: Tool for Taxonomies Scott Spangler spangles@almaden. ibm. com IBM Almaden Research Center

e. Classifier: Tool for Taxonomies Scott Spangler spangles@almaden. ibm. com IBM Almaden Research Center San Jose, CA 1

Assertions on Taxonomy Generation • Manual methods are too labor intensive, limit scope and

Assertions on Taxonomy Generation • Manual methods are too labor intensive, limit scope and scale, and are not maintainable • Canned taxonomies are a niche solution • There are many “natural” or “right” taxonomies, even on the same collection • Clustering, canned taxonomies and other methods are good starting points, but not enough 2

Salient Features of e. Classifier • Clustering algorithm independent – bias towards speed for

Salient Features of e. Classifier • Clustering algorithm independent – bias towards speed for interaction • Classification algorithm independent – evaluate multiple algorithms for given taxonomy – pick best algorithm for each level in taxonomy • Multiple methods to seed taxonomy: – import, clustering, query based • Multiple methods for evaluating, editing and validating taxonomies • Given a taxonomy, analysis/discovery against structured and unstructured information 3

e. Classifier Principles • Apply multiple text mining algorithms to textual data sets in

e. Classifier Principles • Apply multiple text mining algorithms to textual data sets in a practical manner. • Provide consistently good results, the goal is not perfection. • Utilize domain expertise by giving the user control over the mining process. • Provide tools, metrics and reports to draw useful conclusions from the analysis. 4

The Mining Process • • Create a dictionary of terms (words and phrases) Prune

The Mining Process • • Create a dictionary of terms (words and phrases) Prune dictionary (prune irrelevant terms) Cluster documents based on this dictionary Examine the resulting taxonomy, modifying based on domain expertise • Create multiple taxonomies (divide and conquer) • Do deeper analysis by creating keyword classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends 5

The Class Table For viewing and understanding each level in a taxonomy 6

The Class Table For viewing and understanding each level in a taxonomy 6

Understanding Class Metrics • Class Naming Convention • • • Shortest possible name that

Understanding Class Metrics • Class Naming Convention • • • Shortest possible name that covers the examples “, ” => OR “&” => AND X_Y => X followed by Y NONE => no useful text Miscellaneous => No easy description • Cohesion • A measure of similarity between documents in the same class (0 -different terms, 100 -same terms) • Distinctness • A measure of similarity between documents in different classes (0 -very similar, 100 -very unique) 7

Dictionary Tool • Edit -> Dictionary Tool • Use this to edit the features

Dictionary Tool • Edit -> Dictionary Tool • Use this to edit the features on which the taxonomy is based • Delete irrelevant or ambiguous terms • Generate and edit synonyms 8

Dictionary Generation Files • Stop. Words • words excluded from the dictionary • Synonyms

Dictionary Generation Files • Stop. Words • words excluded from the dictionary • Synonyms • different forms of the same semantic term • Include. Words • words that always appear in dictionary • Stock Phrases • text to be ignored in creating dictionary • Synonyms and Stock Phrases can be automatically generated and then edited 9

Refinement of Classes • Subclass Classes • Subdivide an existing class into multiple subclass

Refinement of Classes • Subclass Classes • Subdivide an existing class into multiple subclass at the next level in the taxonomy • • Merge Classes Delete Classes Rename Class Undo • Don’t be afraid to try things • Save • . obj files contain all information e. Classifier uses • . class files contain class membership • Read 10

Class View • For understanding the concepts and contents of a given class •

Class View • For understanding the concepts and contents of a given class • View the text • Most typical • Least typical • View the source Web page • View distinguishing terms • View deduced rules for classification and related documents 11

Keyword Searching • Edit->Keyword Search • Search for Dictionary terms • Use “and” ,

Keyword Searching • Edit->Keyword Search • Search for Dictionary terms • Use “and” , “or” and “_” • Searching within a class • Related Words • Look at Trends • Create new Classes • See where the matching documents occur via Class Table 12

Document/Page Viewer • Sorting Documents • Most typical • Least typical • View distinguishing

Document/Page Viewer • Sorting Documents • Most typical • Least typical • View distinguishing terms • Representative use of important words • Moving documents • Trend • Reports 13

Keyword Class Generation • Execute->Classify by Keywords • Open queries (KCG files) • One

Keyword Class Generation • Execute->Classify by Keywords • Open queries (KCG files) • One query per line • . AND. , . OR. , (, ) • Add, Rename, Delete queries • Prioritize – Move up and down • Multiple/only one class • Ambiguous/first matching class • Run Queries • Save Queries • Run e. Classifier 14

Comparing Taxonomies • File->Compare Taxonomies • File->Read Structured Information • Co-occurrence counts and affinities

Comparing Taxonomies • File->Compare Taxonomies • File->Read Structured Information • Co-occurrence counts and affinities • Trend • View documents • Transpose • Report (CSV) 15

Dictionary Co-occurrence • View->Dictionary Cooccurrence • Type ahead searching • Co-occurrence counts and affinities

Dictionary Co-occurrence • View->Dictionary Cooccurrence • Type ahead searching • Co-occurrence counts and affinities • Trend • View documents • Zoom in • Change Metric -> dependency 16

Advanced Features • Visualization • Subclass from Structured Information • Make Classifier • Read

Advanced Features • Visualization • Subclass from Structured Information • Make Classifier • Read Template • Import Category • Add a category from another saved taxonomy • Select Metrics • Add other columns to the Class table • BIW 17

Visualization • Look at relationships between selected classes • Discover sub-clusters • Find “borderline”

Visualization • Look at relationships between selected classes • Discover sub-clusters • Find “borderline” examples • View/Move Documents • Navigator • Touring 18