Enabling web browsers to augment web sites filtering

Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial Intelligence Laboratory UIST 2006 · Montreux, Switzerland 1

Automatic web content scraping (2003 ― now) Web data extraction based on partial tree alignment. WWW 2005. 1. Zhai, Y. , and B. Liu. 2. Hogue, A. and D. Karger. Thresher: automating the unwrapping of from the World Wide Web. WWW 2005. 3. Reis, D. C. , P. B. Golgher, A. S. Silva, and A. F. Laender. extraction using tree edit distance. WWW 2004. Lerman, K. , L. Getoor, S. Minton, and C. Knoblock. Using the automatic segmentation of tables. SIGMOD 2004. 5. Ramaswamy, L. , et. al. Automatic web pages. WWW 2004. 6. Wang, J. -Y. , and F. Lochovsky. Data WWW 2003. 7. Arasu, A. and H. Garcia-Molina. 8. Liu, B. , R. Grossman, and Y. Zhai. semantic content Automatic Web news structure of Web sites for detection of fragments in dynamically generated extraction and label assignment for Web databases. Extracting structured data from Web pages. SIGMOD 2003. Mining data records in Web pages. SIGKDD 2003. 2

… but no one has tried to put … Automatic structured web content scraping technologies in the hands of end-users 3

… let’s run through a real task … Paperback books published in 2005 or later by John Grisham on Amazon 4

… that was a demo of putting … Automatic structured web content scraping technologies in the hands of end-users 5

Sifter browser extension 6

Outline • Motivations 1. User Interface Design • Extraction • Augmentation 2. Extraction Algorithm • Evaluations 1. Extraction Algorithm 2. User Interface Design • Conclusions 7

Motivations • Not all web sites are designed based on task analysis and user analysis. • Faceted browsing? • Maps view? • Calendar view? • Features are not implemented consistently across sites. • Web browsers can provide a unified sorting/filtering interface. • Not all users have exactly the same needs. • No site can ever design for all users. • Each web browser can tailor experience to its owner. 8

Motivations 9

Outline • Motivations 1. User Interface Design • Extraction • Augmentation 2. Extraction Algorithm • Evaluations 1. Extraction Algorithm 2. User Interface Design • Conclusions 10

User Interface Design – Extraction • Web content extraction is a system precondition poorly understood by users. • If it doesn’t let me do this, … • If the web site understands that this is the original price ( $8. 99 ), … • If I can see that this is a date (“last Christmas”), … 11

User Interface Design – Extraction • Extraction is lengthy and error-prone. • We explore UI potentials even in the face of fragile extraction. • This lets us know which aspects of extraction should be improved first, and in which ways. • We minimize the steps required to kick-start extraction. • But we give the user an chance to make correction early. 12

UI Design - Extraction 1 st click preview of results controls for making correction 2 nd click if all goes well 13

Outline • Motivations 1. User Interface Design • Extraction • Augmentation 2. Extraction Algorithm • Evaluations 1. Extraction Algorithm 2. User Interface Design • Conclusions 14

User Interface Design - Augmentation • Novelty • Presentation of data remains unchanged • … except for a few asterisks. • Presentation might be well-designed with domain specific knowledge, and worth to keep as-is. • Semantics of the data are in the presentation. • We want to maintain visual context. • Filtering and sorting are supported without resorting to field names. 15

User Interface Design - Augmentation • By keeping the original visual presentation of the data, and then applying automatic content extraction technology, we can provide additional functionalities without needing, trying, or pretending to understand the semantics of the data. format? binding? medium? who cares? ! 16

… ssshhhh … Semantics is Overrated 17

Outline • Motivations 1. User Interface Design • Extraction • Augmentation 2. Extraction Algorithm • Evaluations 1. Extraction Algorithm 2. User Interface Design • Conclusions 18

Extraction Algorithm Detection of 1. Items of interest 2. Subsequent pages 3. Fields within items 19

Extraction Algorithm - Assumptions 1. Items occupy most of the page area. 2. Each item contains links. Find THE set of similar links whose outer containers occupy the largest page area compared to other sets of links. 20

BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 21

A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 22

DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 23

TD/DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 24

TR/TD/DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 25

TABLE/TR/TD/DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 26

BODY/TABLE/TR/TD/DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 27

BODY/TABLE/TR/TD/DIV/A BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A Found similar links! 28

BODY/TABLE/TR/TD/DIV/A/. . BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 29

BODY/TABLE/TR/TD/DIV/A/. . BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 30

BODY/TABLE/TR/TD/DIV/A/. . BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 31

BODY/TABLE/TR/TD/DIV/A/. . BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A 32

BODY/TABLE/TR/TD/DIV/A/. . BODY TABLE BODY TR - item 1 TD TABLE TD DIV Item 1 TR A TR - item 2 TD TD TD Item 2 TR TD TD DIV A A TD DIV A Found one potential set of items! 33

Extraction Algorithm – Subsequent page detection 34

Extraction Algorithm – Subsequent page detection • URL parameters • http: //amazon. com/. . . ? . . . &page=2&. . . • http: //amazon. com/. . . ? . . . &page=3&. . . • http: //amazon. com/. . . ? . . . &page=4&. . . 35

Outline • Motivations 1. User Interface Design • Extraction • Augmentation 2. Extraction Algorithm • Evaluations 1. Extraction Algorithm 2. User Interface Design • Conclusions 36

Evaluations – Extraction algorithm • Test conducted over 30 web sites: • Amazon, Best. Buy, CNET Reviews, Froogle, Target, Walmart, … • Item detection • Items on 27 / 30 collections can be identified by xpaths (in the remaining 3, items consist of sibling/cousin nodes) • … but only 24 / 27 were automatically detected • Subsequent page detection • For 22 / 27 collections, subsequent pages could be identified. • For 19 / 22 collections, original numbers of items were recovered. • Overall • 19 / 30 = 63% accuracy • We measure accuracy at the level of whole collections, not individual items. 37

Evaluations – User Interface Design • Extraction algorithm is still fragile • Formative evaluation of UI • Is “web content extraction” too high a conceptual barrier? • Is in-place sorting/filtering augmentation usable? • No field name – usable? • Is such augmentation useful? 38

39

Evaluations – User Interface Design • Task 1: Structured • This task lets subjects get familiar with the UI. • No specific help or tutorial is provided. • Subject follows a sequence of high-level instructions to ultimately perform a complex query. • sort by price • filter by date • Subject is given 5 min to perform a similar query using the web site. • Task 2: Unstructured • Subject judges whether a sale of several products is good. 40

Evaluations – User Interface Design • Task 1: Structured • 8/8 subjects completed the task using our system. • 5/8 … using the web site within 5 minutes. • 1/8 knew about Amazon’s Advanced Search. • All subjects were familiar with Amazon. • A unified filtering/sorting UI can be more usable than different UIs on different sites. • Task 2: Unstructured • 7/8 subjects completed the task using our system. • 1 refused to complete the task. 41

Evaluations – UI Design • Survey responses indicate • Our system is usable and useful • … while it offers advanced functionalities. 42

Conclusions • In our work, we … • Preserve original presentation to leverage the semantics within it; • Provide filter/sort functionalities without field names; • Put automatic web content extraction technologies into the hands of end-users; • Show evidence that it’s usable and useful. • For future work, we will focus on … • Error recovery; • Merging data from several sites. 43

More information • http: //simile. mit. edu/wiki 2/Sifter • Firefox extension installation file • Open source code + build instructions • Links to video and user study data 44