Information Society 2002 Ljubljana Slovenija A Shopping Agent
Information Society 2002 Ljubljana, Slovenija A Shopping Agent for the WWW Aleksander Pivk Department of Intelligent Systems Jozef Stefan Institute Ljubljana, Slovenia 16 th October 2002 ICEIS 2002
What is an (intelligent) agent? • An intelligent agent is a computer system capable of flexible, autonomous action in some environment. • Examples: – Environment: internet agent, OS agent, desktop agent, www agent, etc. – Task: information agent, shopping agent, interface agent, email agent, notification agent, etc. IS 2002 2
Information agents • Task: – access/integrate information from a variety of data sources • Types: – Information Retrieval Agents • search engines – Information Filtering Agents • mail agents, news-delivery agents – Information Extraction Agents • wrappers – Information Integration Agents • meta-search engine, comparison-shopping IS 2002 3
Information Extraction • IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. Examples: a) from weather report identify locations, dates, temperatures (high and low); b) from online stores get product names, their images, and prices. NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392 -5751 IS 2002 4
Wrappers • A wrapper is … – a procedure or a rule that explains how to extract information from an information source – tailored to a particular document collection – appropriate to semi-structured information source • Why using wrappers? – heterogeneous information sources – different styles of user interface and different formats of output display IS 2002 5
Implemented System • Shin. A – (SHopp. INg Assistant) – Customized Comparison Shopping Agent – simple heuristic-based approach – little domain-knowledge used IS 2002 6
Shin. A – Shopping Assistant IS 2002 7
Our focus • Wrapper learning in real time – to realize customized comparison shopper • Little use of domain knowledge – rather use simple heuristics – exploit the characteristics of semi-structured documents • Flexible and Practical – handle both table-type and list-type displays – handle noisy product description (missing attributes) – handle single product description in multiple lines IS 2002 8
Learning Query Scheme Templates <form site= "amazon. com"> <name>searchform</name> <method>post</method> <action>www. amazon. com/exec/obidos/search-handle-form</action> <input type= "text" name="field-keywords" size=“ 15" /> <input type= "image" name= "Go"/> <select name= "index"> <option value= “all products" selected /> <option value= "books" /><option value= "…" /> </select> </form> IS 2002 9
Learning product descriptions • Table-type display of 5 different PDU’s • Task – recognize each PDU – recognize attributes within PDU – learn rules to extract attributes PDU - Product Description Unit IS 2002 10
PDU Pattern Learning: Algorithm • First phase – ignore irrelevant parts of HTML source (header, advertisements, footer) – the remaining HTML source is broken into logical lines • Second phase – categorize each logical line – 9 different categories (PRICE, TITLE, IMAGE, URL_LINK, TTAG, LBTAG, etc. ) • Third phase – find most frequent pattern(s) for PDU(s) in the sequence of logical line categories IS 2002 11
PDU Pattern Learning: Example A fragment of the HTML source of the search result for the query “intelligent agent“ to Amazon bookstore. <img src="http: //g-images. amazon. com/images/G/01/v 9/130668. jpg" width="80“ height="80" vspace="2" alt=""> </td> <p> <a href="http: //www. amazon. com/book. asp? id=010101&book=130668"> Intelligent Internet Agents: Agent-Based Information Discovery on the Internet </a> $59. 95 --2 --4 --5 --3 --1 --9 --5 --0 { 0: price; 1: title; 2: image; 3: link; 4: table tag; 5: line tag, 9: other tag; } IS 2002 Extracted PDU pattern: 244531950 12
Simple Heuristics • Recognizing a title – contains at least one query word – text line that corresponds to pre-determined pattern’s title • Recognizing a price – contains a currency symbol ($, €) – contains a currency token (EUR, SIT) – contains digit(s) with relevant delimiters (‘, ’; ‘. ’) • Recognizing an image – unique image url-address within pattern • Able to recognize attributes with heuristic rules – examples: ISBN numbers, dates, discount rates • Unable to recognize other attributes – authors, review comments, recommendation status IS 2002 13
Conclusion • Limitations – query search box must exist – price information must exist – extracts only a few attributes (title, price, image, link, …) • Future work – – IS 2002 more use of domain knowledge (ontologies) extract other non-price attributes use of XML-based wrappers applications to other domains 14
- Slides: 14