Guest Lecture to SingaporeMIT Alliance Artificial Intelligence Technologies

  • Slides: 90
Download presentation
Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies for Web Intelligence Ah-Hwee Tan Laboratories

Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies for Web Intelligence Ah-Hwee Tan Laboratories for Information Technology, Singapore Oct 11, 2002

Outline • What is Web Intelligence (WI)? • How to do WI? • Technologies

Outline • What is Web Intelligence (WI)? • How to do WI? • Technologies and Tools (disclaimer: snapshots only) • What’s next?

Web Intelligence and … spying on the web

Web Intelligence and … spying on the web

Web Intelligence • Scanning, tracking, and analyzing information on the world wide web for

Web Intelligence • Scanning, tracking, and analyzing information on the world wide web for the purpose of competitive intelligence • Intelligence as in Central Intelligence Agency

The other definition of Web Intelligence • Web Intelligence Consortium (WIC) (http: //wi-consortium. org/)

The other definition of Web Intelligence • Web Intelligence Consortium (WIC) (http: //wi-consortium. org/) • Artificial Intelligence (AI), Information Technology (IT), + Web • Intelligence as in Artificial Intelligence

Competitive Intelligence (CI) (Fuld & Company, 2000, 2001) • Highlight the importance of gathering,

Competitive Intelligence (CI) (Fuld & Company, 2000, 2001) • Highlight the importance of gathering, analyzing, and distributing competitive information to gain competitive advantages • Too risky to do business without CI • SCIP grew from 150 (1991) to 7000 (2000) • Press articles has increased from 100 (1991) to 6000 (2000). ”

Competitive Intelligence Cycle (Fuld & Company, 2000, 2001) Planning & Direction Evaluation & Tracking

Competitive Intelligence Cycle (Fuld & Company, 2000, 2001) Planning & Direction Evaluation & Tracking Information Gathering Analysis & Production

AI Technologies for Web Intelligence • Information Gathering – Getting the information (search, information

AI Technologies for Web Intelligence • Information Gathering – Getting the information (search, information retrieval) • Analysis and Production – Putting things in perspectives (clustering, categorization) – Gaining insights (info/knowledge extraction, discovery) • Evaluation and Tracking

Technologies for Search • Purpose: Getting the right information • Challenges – Too much

Technologies for Search • Purpose: Getting the right information • Challenges – Too much information, irrelevant information, outof-date information • Technologies – Information retrieval, Page. Rank • Tools – General: Google, Alta. Vista, Excite, etc – Specialized: Patent (Delphion), News (Lexis. Nexis)

SMART (Salton, 1971) • One of the first, and still best IR systems •

SMART (Salton, 1971) • One of the first, and still best IR systems • vector space model for representing documents • automatic indexing • Given a new query – converts to a vector – uses a similarity measure to compare it to the documents – Return top n documents • can perform relevance feedback

Document Representation • Vector Space Model – Bag of words, e. g. operating, system

Document Representation • Vector Space Model – Bag of words, e. g. operating, system – Terms/Phrases, e. g. operating systems • N-grams (Huffman, TREC-4, 1995) • Syntactic 3 -tuples (Kanagasa & Pan, PRICAI- 2000) • Concept-Relation-Concept (Paik et al, US 6, 263, 335)

Indexing • Goal – To select a set of important keyword features among all

Indexing • Goal – To select a set of important keyword features among all words appear in the document set • How – remove stop words, reduce to root form – pick terms based on part-of-speech tagging – keyword weighting

Feature Weighting • Goal – To represent a doc using a real-valued vector •

Feature Weighting • Goal – To represent a doc using a real-valued vector • How: An example – For doc dj and keyword wi, calculate • Term frequency (TF) = TF(wi, dj) • Inverse Document Frequency (IDF) = log (N/DF(wi)) • TF. IDF Iij = TF. IDF – Normalize Ij = (Ij 1/Im, Ij 2/Im, …, Ij. N/Im) • where Im = max (Iij) for all i

Page. Rank (Page & Brin, 1998) • using its vast link structure as an

Page. Rank (Page & Brin, 1998) • using its vast link structure as an indicator of an individual page's value • A page that receives many links is important • A page receives a link from an important page is also important • combines Page. Rank with sophisticated textmatching techniques to find pages that are both important and relevant

How to Search Tips from an Intelligence scout (Courtesy of LIT’s Planning Group)

How to Search Tips from an Intelligence scout (Courtesy of LIT’s Planning Group)

LIT KSKS Process 1) 2) 3) 4) KIT (Identify your Key Intelligence Topic) Sources

LIT KSKS Process 1) 2) 3) 4) KIT (Identify your Key Intelligence Topic) Sources (and resources) KIQ (Key Intelligence Questions) Search Strategy

Key Intelligence Topic • Identify your Key Intelligence Topic(s) • Drill down – instead

Key Intelligence Topic • Identify your Key Intelligence Topic(s) • Drill down – instead of “Ubiquitous Computing”, what sub topics are you REALLY interested in? – a “taxonomy” will be useful

KIT • Start with a good descriptive paragraph on your topic, name a few

KIT • Start with a good descriptive paragraph on your topic, name a few applications • Think out of the box - terminologies used by “reporters” “journalists” “laymen”

Sources

Sources

… and Resources • TIME and MANPOWER and TRAINING • Monitoring = Project –

… and Resources • TIME and MANPOWER and TRAINING • Monitoring = Project – Monitoring : long periods of time, identify the delta (change) – Project: specific, determined period of time. Objective/goal is to know as much as possible on topic

Key Intelligence Questions • Known Analysis Techniques: 5 F, 5 C, SCP, TOWS •

Key Intelligence Questions • Known Analysis Techniques: 5 F, 5 C, SCP, TOWS • LIT methodology: KIQ technique (Combo of above) • Your KIQs form the backbone of your analysis (WYAIWYG)

…. KIQs • Ask yourself 5 -8 Key Intelligence Questions • Establish key indicators

…. KIQs • Ask yourself 5 -8 Key Intelligence Questions • Establish key indicators or proxy indicators

Sample KIQs Supply Environment Supply/weakness/ threats Environment/ opportunities Demand/ opportunities - Top industry players?

Sample KIQs Supply Environment Supply/weakness/ threats Environment/ opportunities Demand/ opportunities - Top industry players? (big, small, listed, unknown) Region? Profiles. - R&D labs? Region? - Major research trends? - Products available? Prototypes? Technologies? - Research challenges? (problems and issues) - Upcoming markets (segments? size? Time frame) - IP and opportunities for LIT? Strength/ opportunities

Questions • • Where are the markets for the applications? What time frame for

Questions • • Where are the markets for the applications? What time frame for market release? What are the price points? Who are the top # players? (by countries/region/labs/companies) What products available? Any prototypes? What are the technologies behind these? What are the research trends/ challenges? Any IP opportunities?

Search Strategy Sources and URLs • Search “Magnets” (word/phrase spotting) • Tools • Reiterate!

Search Strategy Sources and URLs • Search “Magnets” (word/phrase spotting) • Tools • Reiterate! •

Magnets • Magnets are specific, well used terms to increase probability – append to

Magnets • Magnets are specific, well used terms to increase probability – append to your normal search string • Trends, surveys, forecasts, estimates, units shipped, scenarios • CEO + interview • market research report, table of contents • see handout “Appendix B. cheat sheet on magnets”

Recap • KIT (sub topics) – terms (known to you): – terms (used elsewhere

Recap • KIT (sub topics) – terms (known to you): – terms (used elsewhere during a search) • Sources – Specific syntax – Magnets • KIQs • Tools - Search

Tools for Search Copernics (PC) • Google, Alta. Vista Link Search • (web, free)

Tools for Search Copernics (PC) • Google, Alta. Vista Link Search • (web, free) • Lexis Nexis (web, subscription) – Use advance search – purpose: increase relevance – tablebase Info. Tech Trends (web, subscription) • Delphion Patent Server (web, subscription) •

Copernics: Search, File, Track

Copernics: Search, File, Track

Google (www. google. com) - a tool for search

Google (www. google. com) - a tool for search

Google: Search

Google: Search

Tips for using Google • Try the obvious first. If you're looking for information

Tips for using Google • Try the obvious first. If you're looking for information on java project , enter ”Java project" rather than ”java". • Use words likely to appear on a site with the information you want. ”Java Project Spanish Inquisition" gets better results than ”spanish java". • Make keywords as specific as possible.

All terms • By default, Google only returns pages that include all of your

All terms • By default, Google only returns pages that include all of your search terms. There is no need to include "and" between terms. Keep in mind that the order in which the terms are typed will affect the search results.

Stop words • If a common word is essential to getting the results you

Stop words • If a common word is essential to getting the results you want, you can include it by putting a "+" sign in front of it. (Be sure to include a space before the "+" sign. ) • Star Wars Episode +1 • “Star Wars Episode 1”

Google: not case sensitive • Google searches are NOT case sensitive. All letters, regardless

Google: not case sensitive • Google searches are NOT case sensitive. All letters, regardless of how you type them, will be understood as lower case. For example, searches for "george washington", "George Washington", and "g. Eo. Rg. E w. As. Hi. Ng. To. N" will all return the same results.

Google: no stemming • Google does not use "stemming" or support "wildcard" searches. In

Google: no stemming • Google does not use "stemming" or support "wildcard" searches. In other words, Google searches for exactly the words that you enter in the search box.

Find out who links to you • Find out who links to the Java

Find out who links to you • Find out who links to the Java Project • link: www. xyz. com

Google: Site search • The word "site" followed by a colon enables you to

Google: Site search • The word "site" followed by a colon enables you to restrict your search to a specific site. To do this, use the site: sampledomain. com syntax • spanish inquisition site: www. javadeveloper. com

Altavista: Link search • Useful if you are looking for news surrounding “small” “unknown”

Altavista: Link search • Useful if you are looking for news surrounding “small” “unknown” “unlisted” company which may be your competitor • Instead of searching for the small company, search for “who else” links or write about that “small” company. • Who else? (what can you find out about the small company) • its interested investors or alliances, its suppliers. Research collaborations • Use the Good Old Alta Vista

Alta Vista “link” search • Link: infineon +”fabric” +”wearable” • who else links to

Alta Vista “link” search • Link: infineon +”fabric” +”wearable” • who else links to infineon? Who else is interested in infineon? • note: why is www left out in the link search? • Link: lit. a-star. edu. sg -lit. a-star. edu. sg • everyone else except krdl (not interested in self citations) • link: lit. a-star. edu. sg -lit. a-star. edu. sg url: edu • who are the edu (usually univ, including research) with interest or collaborating with krdl • link: lit. a-star. edu. sg -lit. a-star. edu. sg url: edu -url: edu. sg • same as above not interested in local univ.

Lexis Nexis - The Legal and News Provider

Lexis Nexis - The Legal and News Provider

Lexis Nexis - Power Search - Relevance e. g headline(“smart homes”) - Proximity and

Lexis Nexis - Power Search - Relevance e. g headline(“smart homes”) - Proximity and Stemming e. g comput! (stemming) e. g w/10 (within 10 words) e. g w/p (within paragraph) - Limit currency (90 days, previous year), then expand

Example “red-eye correction” - (red eye) w/p patent

Example “red-eye correction” - (red eye) w/p patent

Lexis Nexis Power Tip 2 - Find the Elusive “Market Numbers” Specific source within

Lexis Nexis Power Tip 2 - Find the Elusive “Market Numbers” Specific source within Lexis Nexis • Select RDS Table. Base • Text articles accompanied by tabulated data from market research consultants and investment house. • Supplement with another useful “table” database “Infotechtrends”

Lexis Nexis’ RDS Table. Base “market size” data

Lexis Nexis’ RDS Table. Base “market size” data

Results

Results

Handset leaders? Strategy Analytics, a Boston-based research firm, estimates that Nokia and Samsung Electronics

Handset leaders? Strategy Analytics, a Boston-based research firm, estimates that Nokia and Samsung Electronics Co. Ltd. , Seoul, South Korea, were the only leading handset makers to make a profit last year.

Data and Tables (2) - Info. Tech Trends Data compiled from various IT related

Data and Tables (2) - Info. Tech Trends Data compiled from various IT related trade magazines - Login with “ip” address

Technologies for Organizing • Clustering – Organizing information into groups based on similarity functions

Technologies for Organizing • Clustering – Organizing information into groups based on similarity functions and thresholds – e. g. Northern. Light, Bulls. Eye, Vivisimo • Categorization – Organizing information into a “predefined” set of classes – e. g. Yahoo!, Autonomy Knowledge Server

Clustering (Sch 64, Wis 69) • Grouping of information based on their similarities •

Clustering (Sch 64, Wis 69) • Grouping of information based on their similarities • Unsupervised/self-organizing, require no training or predefinition of classes • Many methods available – Agglomerative, K-means, SOFM, ART, etc • Purpose is to identify groupings or themes automatically

Agglomerative Hierarchical Clustering (Barnard & Downs, 1992) • Bottom up, hierarchical • Algorithm –

Agglomerative Hierarchical Clustering (Barnard & Downs, 1992) • Bottom up, hierarchical • Algorithm – – Given N input, begin with N clusters Merge pairs of clusters that are closest Update similarity matrix Repeat until 1 cluster remains • Simple • Too slow to run

K-means (Tou & Gonzalez, 74) • Bottom-up, flat approach • Algorithm – Initialize K

K-means (Tou & Gonzalez, 74) • Bottom-up, flat approach • Algorithm – Initialize K reference clusters – Assign each data point to the nearest cluster centroid – Recalculate the centroid of each cluster using the means of the input – Repeat until convergence

Self-Organizing Map (Kohonen, 1997) • Initialize K cluster vectors (with neighborhood relationship) • Given

Self-Organizing Map (Kohonen, 1997) • Initialize K cluster vectors (with neighborhood relationship) • Given an input, identify the closet cluster • Update the cluster vector together with those in the neighborhood to the input vector • Repeat and shrink the neighborhood until convergence

Tools for Search & Organizing • Bulls. Eye (PC) • Northern. Light (web, free)

Tools for Search & Organizing • Bulls. Eye (PC) • Northern. Light (web, free) • Vivisimo (web, free) • Aurigin/Theme. Scape for Patents (web, subscription)

Bulls. Eye: Search, Organize, File, Track

Bulls. Eye: Search, Organize, File, Track

Northern. Light (http: //www. northernlight. com)

Northern. Light (http: //www. northernlight. com)

Northern. Light Custom Search Folders™ group your results by Subject (e. g. , hypertension,

Northern. Light Custom Search Folders™ group your results by Subject (e. g. , hypertension, baseball, camping, expert systems, desserts) Type (e. g. , press releases, product reviews, resumes, recipes) Source (e. g. personal pages, magazines, encyclopedias, databases) Language (e. g. , English, German, French, Spanish)

Introducing Vivisimo (www. vivisimo. com) - a tool for search and clustering

Introducing Vivisimo (www. vivisimo. com) - a tool for search and clustering

Vivisimo • Meta-search engine • Supports the most advanced features of the major search

Vivisimo • Meta-search engine • Supports the most advanced features of the major search engines using one Vivísimo syntax • Vivísimo translates your query into the corresponding syntax of each underlying search engine.

Vivisimo

Vivisimo

Text Categorization • A user defines a set of categories or classes • Assigning

Text Categorization • A user defines a set of categories or classes • Assigning a text document to one or more of the predefined categories or document classes • Theme extraction – The Simplest form of text mining

Statistical Text Categorization • Supervised learning approach • Examples – – – Decision tree

Statistical Text Categorization • Supervised learning approach • Examples – – – Decision tree (C 4. 5, C 5) K Nearest Neighbor (KNN) Bayes classifier Linear least square fit (LLSF) Support vector machine (SVM) Neural Networks • Assume the availability of a large pre-labeled training corpus

Autonomy’s Intelligent Data Operating Layer (IDOL) Server • Enterprise software • Functions – –

Autonomy’s Intelligent Data Operating Layer (IDOL) Server • Enterprise software • Functions – – – – retrieval clustering categorization Community & collaboration XML Agents. . .

Clustering: Pros and Cons • Pros – Unsupervised/self-organizing, require no training or predefinition of

Clustering: Pros and Cons • Pros – Unsupervised/self-organizing, require no training or predefinition of classes – Able to identify new themes • Cons – Users have no control – Difficult to navigate due to ever changing cluster structure

Categorization: Pros and Cons • Require learning (supervised) and/or definition of classification rules/knowledge •

Categorization: Pros and Cons • Require learning (supervised) and/or definition of classification rules/knowledge • Every info has to be assigned to one or more class(es) • Good control but lack flexibility to handle new information

User-configurable Clustering (Tan & Pan, PAKDD-02) • New way of information organization and content

User-configurable Clustering (Tan & Pan, PAKDD-02) • New way of information organization and content management • Combines automatic clustering with userdefined structure (preferences) • Reduces to a clustering system if no user indication given • Allows personalization in a direct, intuitive, and interactive manner • Control + flexibility

Adaptive Resonance Associative Map (ARAM) (Tan, Neural Networks, 1995) Information Clusters F 2 Vigilance

Adaptive Resonance Associative Map (ARAM) (Tan, Neural Networks, 1995) Information Clusters F 2 Vigilance check b F 1 a a r - x b x - rb + + Information Vector Vigilance check A B Preference Vector

FOCI (http: //textmining. lit. org. sg/FOCI) - a tool for search, clustering, personalization, tracking,

FOCI (http: //textmining. lit. org. sg/FOCI) - a tool for search, clustering, personalization, tracking, and sharing

Flexible Organizer for Competitive Intelligence (FOCI) (Tan et. al, IJCAI-01 workshop, CIKM-01, KAIS Journal

Flexible Organizer for Competitive Intelligence (FOCI) (Tan et. al, IJCAI-01 workshop, CIKM-01, KAIS Journal forthcoming) • A platform for gathering, organizing, tracking, analyzing, and sharing intranet and internet based competitive information • New way turning raw information into competitive knowledge • First multilingual CI software – Based on LIT Multilingual Efficient Analyzer – English and Chinese • Domain localization (Technology)

FOCI Architecture User’s CI Portfolio Content Gathering Content Management Content Mining Domain-Specific Knowledge Content

FOCI Architecture User’s CI Portfolio Content Gathering Content Management Content Mining Domain-Specific Knowledge Content Publishing Visualization Front End Intranet/ Internet

FOCI - Personalized Content Management • Portfolio created through Search • Unsupervised clustering •

FOCI - Personalized Content Management • Portfolio created through Search • Unsupervised clustering • Loop – Personalization by users – Reorganization of clusters • Saving of personalized portfolio • Tracking of new information

Personalization Functions • Marking/labeling (selected) clusters – Personal interpretation • Inserting Clusters – Indicate

Personalization Functions • Marking/labeling (selected) clusters – Personal interpretation • Inserting Clusters – Indicate preference on groupings • Merging clusters – Indicate preferences on similarities • Splitting clusters – Indicate preferences on differences • . . .

Clustering by URL + Title + Description

Clustering by URL + Title + Description

A partially Personalized Portfolio

A partially Personalized Portfolio

A fully Personalized Portfolio

A fully Personalized Portfolio

Organizing New Information (Without Personalization) 42 documents from Direct. Hit, Netscape, and Business. Wire

Organizing New Information (Without Personalization) 42 documents from Direct. Hit, Netscape, and Business. Wire

Organizing New Information (Based on Personalized Portfolio)

Organizing New Information (Based on Personalized Portfolio)

Technologies for Analyzing • To analyze document content in terms of entities and relations

Technologies for Analyzing • To analyze document content in terms of entities and relations • Challenges • Need to understand natural language • Technologies – – Information extraction Knowledge extraction Concept map visualization Discovery of new knowledge

Information Extraction vs Knowledge Extraction Similarity Text Semi-structured or Structured Form Differences Predefined/pre-trained templates

Information Extraction vs Knowledge Extraction Similarity Text Semi-structured or Structured Form Differences Predefined/pre-trained templates Need to handle new concept Flat/relational Deep structure For building databases For building knowledge base Records and fields Facts and rules

Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) • Concept extraction

Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) • Concept extraction

Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) • Concept mapping

Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) • Concept mapping • Q&A

Technology Landscape • Search and organizing – already mature – many vendors – Autonomy,

Technology Landscape • Search and organizing – already mature – many vendors – Autonomy, Verity, Mohomine, Semio, Stratify, . . . • Analysis – still in research – real knowledge discovery

What’s next? • Autonomous agents – Personal software to be spy for you •

What’s next? • Autonomous agents – Personal software to be spy for you • Semantic Web (www. w 3. org, www. semanticweb. org, www. ontoweb. org) – XML/RDF – web-based applications and services

Semantic Web (Tim Berners-Lee et al, Scientific American, May 2001) Assumption The real power

Semantic Web (Tim Berners-Lee et al, Scientific American, May 2001) Assumption The real power of WWW as a platform for knowledge repository and sharing has yet to be unleashed Vision Automated services, interweaving computers and human being SW will bring structure to the web, creating an environment where software agents. . . can readily carry out sophisticated tasks for users

Semantic Web + Agents Information Mining/ Knowledge Management Ontology Standard (XML, RDF) The Old

Semantic Web + Agents Information Mining/ Knowledge Management Ontology Standard (XML, RDF) The Old Web

More readings • Intelligence Software Report – (http: //www. fuld. com/softwareguide/index. html) – more

More readings • Intelligence Software Report – (http: //www. fuld. com/softwareguide/index. html) – more info integration and data analysis software • Taxonomy & Content Classification – A Delphi Group White Paper (www. delphigroup. com) – more content/information management software