e Classifier Tool for Taxonomies Scott Spangler spanglesalmaden
- Slides: 18
e. Classifier: Tool for Taxonomies Scott Spangler spangles@almaden. ibm. com IBM Almaden Research Center San Jose, CA 1
Assertions on Taxonomy Generation • Manual methods are too labor intensive, limit scope and scale, and are not maintainable • Canned taxonomies are a niche solution • There are many “natural” or “right” taxonomies, even on the same collection • Clustering, canned taxonomies and other methods are good starting points, but not enough 2
Salient Features of e. Classifier • Clustering algorithm independent – bias towards speed for interaction • Classification algorithm independent – evaluate multiple algorithms for given taxonomy – pick best algorithm for each level in taxonomy • Multiple methods to seed taxonomy: – import, clustering, query based • Multiple methods for evaluating, editing and validating taxonomies • Given a taxonomy, analysis/discovery against structured and unstructured information 3
e. Classifier Principles • Apply multiple text mining algorithms to textual data sets in a practical manner. • Provide consistently good results, the goal is not perfection. • Utilize domain expertise by giving the user control over the mining process. • Provide tools, metrics and reports to draw useful conclusions from the analysis. 4
The Mining Process • • Create a dictionary of terms (words and phrases) Prune dictionary (prune irrelevant terms) Cluster documents based on this dictionary Examine the resulting taxonomy, modifying based on domain expertise • Create multiple taxonomies (divide and conquer) • Do deeper analysis by creating keyword classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends 5
The Class Table For viewing and understanding each level in a taxonomy 6
Understanding Class Metrics • Class Naming Convention • • • Shortest possible name that covers the examples “, ” => OR “&” => AND X_Y => X followed by Y NONE => no useful text Miscellaneous => No easy description • Cohesion • A measure of similarity between documents in the same class (0 -different terms, 100 -same terms) • Distinctness • A measure of similarity between documents in different classes (0 -very similar, 100 -very unique) 7
Dictionary Tool • Edit -> Dictionary Tool • Use this to edit the features on which the taxonomy is based • Delete irrelevant or ambiguous terms • Generate and edit synonyms 8
Dictionary Generation Files • Stop. Words • words excluded from the dictionary • Synonyms • different forms of the same semantic term • Include. Words • words that always appear in dictionary • Stock Phrases • text to be ignored in creating dictionary • Synonyms and Stock Phrases can be automatically generated and then edited 9
Refinement of Classes • Subclass Classes • Subdivide an existing class into multiple subclass at the next level in the taxonomy • • Merge Classes Delete Classes Rename Class Undo • Don’t be afraid to try things • Save • . obj files contain all information e. Classifier uses • . class files contain class membership • Read 10
Class View • For understanding the concepts and contents of a given class • View the text • Most typical • Least typical • View the source Web page • View distinguishing terms • View deduced rules for classification and related documents 11
Keyword Searching • Edit->Keyword Search • Search for Dictionary terms • Use “and” , “or” and “_” • Searching within a class • Related Words • Look at Trends • Create new Classes • See where the matching documents occur via Class Table 12
Document/Page Viewer • Sorting Documents • Most typical • Least typical • View distinguishing terms • Representative use of important words • Moving documents • Trend • Reports 13
Keyword Class Generation • Execute->Classify by Keywords • Open queries (KCG files) • One query per line • . AND. , . OR. , (, ) • Add, Rename, Delete queries • Prioritize – Move up and down • Multiple/only one class • Ambiguous/first matching class • Run Queries • Save Queries • Run e. Classifier 14
Comparing Taxonomies • File->Compare Taxonomies • File->Read Structured Information • Co-occurrence counts and affinities • Trend • View documents • Transpose • Report (CSV) 15
Dictionary Co-occurrence • View->Dictionary Cooccurrence • Type ahead searching • Co-occurrence counts and affinities • Trend • View documents • Zoom in • Change Metric -> dependency 16
Advanced Features • Visualization • Subclass from Structured Information • Make Classifier • Read Template • Import Category • Add a category from another saved taxonomy • Select Metrics • Add other columns to the Class table • BIW 17
Visualization • Look at relationships between selected classes • Discover sub-clusters • Find “borderline” examples • View/Move Documents • Navigator • Touring 18
- Taxonomies of cultural patterns are
- Vef minox
- Taxonomies of cultural patterns are
- Alyssa spangler
- David spangler
- Obito fetal grados de maceracion
- James murray spangler inventions
- 1 989e30 kg
- Discriminative classifier
- Nearest neighbour classifier
- Naive bayes classifier for dummies
- Sorting classifier
- Learning classifier system
- Minimum distance classifier
- Adaboost classifier
- Cat dog classifier
- Perceptron classifier
- Lazy learning knn
- Minimum distance classifier