Classifying and Searching HiddenWeb Text Databases Panos Ipeirotis

Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Department of Information Systems New York University - IOMS Department

Motivation? “Surface” Web vs. “Hidden” Web l “Surface” Web – – – 12/19/2021 l Link structure Crawlable Documents indexed by search engines “Hidden” Web – – No link structure Documents “hidden” in databases Documents not indexed by search engines Need to query each collection individually Panos Ipeirotis - New York University - IOMS Department 2

Hidden-Web Databases: Examples Search on U. S. Patent and Trademark Office (USPTO) database: [wireless network] 39, 270 matches (USPTO database is at http: //patft. uspto. gov/netahtml/search-bool. html) Search on Google restricted to USPTO database site: [wireless network site: patft. uspto. gov] 1 match Database Query Database Matches Site-Restricted Google Matches USPTO wireless network 39, 270 1 Library of Congress visa regulations >10, 000 0 Pub. Med thrombopenia 29, 022 2 as of Oct 3 rd, 2005 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 3

Interacting With Hidden-Web Databases Browsing: Yahoo!-like directories n n n Invisible. Web. com Search. Engine. Guide. com Populated Manually n 12/19/2021 Searching: Metasearchers Panos Ipeirotis - New York University - IOMS Department 4

Outline of Talk n Classification of Hidden-Web Databases n Search over Hidden-Web Databases n Managing Changes in Hidden-Web Databases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 5

Hierarchically Classifying the ACM Digital Library ACM DL ? ? 12/19/2021 ? ? ? ? Panos Ipeirotis - New York University - IOMS Department 6

Text Database Classification: Definition n For a text database D and a category C: n n Coverage(D, C) = number of docs in D about C Specificity(D, C) = fraction of docs in D about C n Assign a text database to a category C if: n n 12/19/2021 Database coverage for C at least Tc Tc: coverage threshold (e. g. , > 100 docs in C) Database specificity for C at least Ts Ts: specificity threshold (e. g. , > 40% of docs in C) Panos Ipeirotis - New York University - IOMS Department 7

Brute-Force Classification “Strategy” 1. Extract all documents from database 2. Classify documents on topic (use state-of-the-art classifiers: SVMs, C 4. 5, RIPPER, …) 3. Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 8

Classification: Goal & Challenges n Goal: Discover database topic distribution n Challenges: n No direct access to full contents of Hidden-Web databases n Only limited search interfaces available n Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 9

Query-based Database Classification: Overview 1. Train document classifier 2. Extract queries from classifier 3. Adaptively issue queries to database 4. Identify topic distribution based on adjusted number of query matches 5. Classify database 12/19/2021 Panos Ipeirotis - New York University - IOMS Department TRAIN CLASSIFIER EXTRACT QUERIES QUERY DATABASE Sports: +nba +knicks Health: +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE 10

Training a Document Classifier 1. Get training set (set of pre-classified documents) 2. Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] 3. TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars Train classifier (SVM, C 4. 5, RIPPER, …) QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION Document Classifier 12/19/2021 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 11

Training a Document Classifier 1. Get training set (set of pre-classified documents) 2. Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] 3. TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars Train classifier (SVM, C 4. 5, RIPPER, …) QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION Document Classifier 12/19/2021 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 12

Extracting Query Probes Transform classifier model into queries n Trivial for “rule-based” classifiers (RIPPER) n Easy for decision-tree classifiers (C 4. 5) for which rule generators exist (C 4. 5 rules) ACM TOIS 2003 TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars C 4. 5 rules QUERY DATABASE n Trickier for other classifiers: we devised rule- extraction methods for linear classifiers (linearkernel SVMs, Naïve-Bayes, …) +sars 1254 IDENTIFY TOPIC DISTRIBUTION Rule extraction Example query for Sports: +nba +knicks 12/19/2021 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 13

Querying Database with Extracted Queries n Issue each query to database to obtain number of matches without retrieving any documents TRAIN CLASSIFIER EXTRACT QUERIES n Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) QUERY DATABASE Sports: +nba +knicks Health: +sars 1254 IDENTIFY TOPIC DISTRIBUTION SIGMOD 2001 12/19/2021 ACM TOIS 2003 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 14

Identifying Topic Distribution from Query Results Query-based estimates of topic distribution not perfect TRAIN CLASSIFIER n Document classifiers not perfect: n Rules for one category match documents from other categories EXTRACT QUERIES Sports: +nba +knicks Health +sars n Querying not perfect: n n Queries for same category might overlap Queries do not match all documents in a category QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes 12/19/2021 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 15

Confusion Matrix Adjustment of Query Probe Results correct class comp sports 0. 80 0. 10 0. 00 sports 0. 08 0. 85 0. 04 health 0. 02 0. 15 0. 96 assigned class 10% of “sport” documents match queries for “computers” Real Coverage health comp 12/19/2021 Incorrect topic distribution derived from query probing Correct (but unknown) topic distribution X 1000 5000 Estimated Coverage = 50 800+500+0 = 1300 80+4250+2 = 4332 20+750+48 = 818 This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results Panos Ipeirotis - New York University - IOMS Department 16

Confusion Matrix Adjustment of Query Probe Results Coverage(D) ~ M-1 . ECoverage(D) TRAIN CLASSIFIER EXTRACT QUERIES Adjusted estimate of topic distribution Probing results n M usually diagonally dominant for “reasonable” document classifiers, hence invertible Sports: +nba +knicks Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION n Compensates for errors in query-based estimates of topic distribution 12/19/2021 Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 17

Classification Algorithm (Again) 1. TRAIN CLASSIFIER Train document classifier One-time process 2. Extract queries from classifier 3. Adaptively issue queries to database 4. Identify topic distribution based on adjusted number of query matches 5. Classify database 12/19/2021 EXTRACT QUERIES QUERY DATABASE Sports: +nba +knicks Health +sars 1254 IDENTIFY For every TOPIC database DISTRIBUTION Panos Ipeirotis - New York University - IOMS Department CLASSIFY DATABASE 18

Experimental Setup n 72 -node 4 -level topic hierarchy from Invisible. Web/Yahoo! (54 leaf nodes) n 500, 000 Usenet articles: n n Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54, 000 articles (1, 000 articles per leaf), 27, 000 articles to construct confusion matrix n 500 “Controlled” databases built using 419, 000 newsgroup articles (to run detailed experiments) comp. hardware rec. music. classical n 130 real Web databases picked from Invisible. Web (first 5 under each topic) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department rec. photo. * 19

Experimental Results: Controlled Databases n Accuracy (using F-measure): n Above 80% for most <Tc, Ts> threshold combinations tried n Degrades gracefully with hierarchy depth n Confusion-matrix adjustment helps n Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 20

Experimental Results: Web Databases n Accuracy (using F-measure): n ~70% for best <Tc, Ts> combination n n Learned thresholds that reproduce human classification Tested threshold choice using 3 -fold cross validation n Efficiency: n 120 queries per database on average needed for choice of thresholds, no documents retrieved n Only small part of hierarchy “explored” n Queries are short: 1. 5 words on average; 4 words maximum (easily handled by most Web databases) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 21

Hidden-Web Database Classification: Summary n Handles autonomous Hidden-Web databases accurately and efficiently: n ~70% F-measure n Only 120 queries issued on average, with no documents retrieved n Handles large family of document classifiers (and can hence exploit future advances in machine learning) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 22

Outline of Talk Classification of Hidden-Web Databases n Search over Hidden-Web Databases n Managing Changes in Hidden-Web Databases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 23

Interacting With Hidden-Web Databases n Browsing: Yahoo!-like directories Searching: Metasearchers Content not accessible through Google } n … Pub. Med Query Metasearcher … USPTO … 12/19/2021 Panos Ipeirotis - New York University - IOMS Department NYTimes Archives … Library of Congress 24

Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: thrombopenia vocabulary, word frequencies Pub. Med (11, 868, 552 documents) Metasearcher … aids cancer heart hepatitis thrombopenia … 121, 491 1, 562, 477 691, 360 Pub. Med 121, 129 24, 826 ? NYTimes Archives USPTO . . thrombopenia 24, 826 thrombopenia 0 thrombopenia 18 . . . Databases typically do not export such summaries! 12/19/2021 Panos Ipeirotis - New York University - IOMS Department . . . 25

Extracting Representative Document Sample Focused Sampling 1. 2. 3. • • 4. 5. 6. Train a document classifier Create queries from classifier Adaptively issue queries to databases Retrieve top-k matching documents for each query Save #matches for each one-word query Identify topic distribution based on adjusted number of query matches Categorize the database Generate content summary from document sample Focused sampling retrieves documents only from “topically dense” areas from database 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 26

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in Pub. Med database (95% of the words appear in less than 0. 1% of db) 10, 000 . . 103 102 2· 104 4· 104 105 Rank n Many words appear in “relatively few” documents (Zipf’s law) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 27

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in Pub. Med database (95% of the words appear in < 0. 1% of db) 10, 000 . . endocarditis ~10, 000 docs / ~0. 1% 103 102 2· 104 4· 104 105 Rank n Many words appear in “relatively few” documents (Zipf’s law) n Low-frequency words are often important 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 28

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete Sample=300 Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in Pub. Med database (95% of the words appear in < 0. 1% of db) 9, 000 . . endocarditis ~9, 000 docs / ~0. 1% 103 102 2· 104 4· 104 105 Rank n Many words appear in “relatively few” documents (Zipf’s law) n Low-frequency words are often important n Small document samples miss many low-frequency words 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 29

Sample-based Content Summaries Challenge: Improve content summary quality without increasing sample size Main Idea: Database Classification Helps n Similar topics ↔ Similar content summaries n Extracted content summaries complement each other 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 30

Databases with Similar Topics n CANCERLIT contains “metastasis”, not found during sampling n Cancer. BACUP contains “metastasis” n Databases under same category have similar vocabularies, and can complement each other 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 31

Content Summaries for Categories n Databases under same category share similar vocabulary n Higher level category content summaries provide additional useful estimates n All estimates in category path are potentially useful 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 32

Enhancing Summaries Using “Shrinkage” n Estimates from database content summaries can be unreliable n Category content summaries are more reliable (based on larger samples) but less specific to database n By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 33

Shrinkage-based Estimations Adjust estimate for metastasis in D: λ 1 * 0. 002 + λ 2 * 0. 05 + λ 3 * 0. 092 + λ 4 * 0. 000 12/19/2021 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department 34

Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] = λ 1 * 0. 002 + λ 2 * 0. 05 + λ 3 * 0. 092 + λ 4 * 0. 000 Pr [treatment | D] = λ 1 * 0. 015 + λ 2 * 0. 12 + λ 3 * 0. 179 + λ 4 * 0. 184 … n Automatic computation of λi weights using an EM algorithm n Computation performed offline No query overhead Avoids “sparse data” problem and decreases estimation risk 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 35

Shrinkage Weights and Summary new estimates CANCERLIT Shrinkage-based metastasis old estimates λroot=0. 02 λhealth=0. 13 λcancer=0. 20 λcancerlit=0. 65 2. 5% 0. 2% 5% 9. 2% 0% aids 14. 3% 0. 8% 7% 2% 20% football 0. 17% 2% 1% 0% 0% … … … Shrinkage: n Increases estimations for underestimates (e. g. , metastasis) n Decreases word-probability estimates for overestimates (e. g. , aids) n …it also introduces (with small probabilities) spurious words (e. g. , football) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 36

Is Shrinkage Always Necessary? n Shrinkage used to reduce uncertainty (variance) of estimations n Small samples of large databases high variance n In sample: 10 out of 100 documents contain metastasis n In database: ? out of 10, 000 documents? n Small samples of small databases small variance n In sample: 10 out of 100 documents contain metastasis n In database: ? out of 200 documents? n Shrinkage less useful (or even harmful) when uncertainty is low 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 37

Adaptive Application of Shrinkage to databases for each query n When word frequency estimates are Probability n Database selection algorithms assign scores uncertain, assigned score has high variance n shrinkage improves score estimates Unreliable Score Estimate: Use shrinkage 0 1 Database Score for a Query n When word frequency estimates are reliable, Solution: Use shrinkage adaptively in a query- and database-specific manner 12/19/2021 Reliable Score Estimate: Shrinkage might hurt Probability assigned score has small variance n shrinkage unnecessary 0 Panos Ipeirotis - New York University - IOMS Department Database Score for a Query 1 38

Searching Algorithm One-time process 1. 2. Classify databases and extract document samples Adjust frequencies in samples For each query: For each database D: n Assign score to database D (using extracted content summary) For every query n Examine uncertainty of score n If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 39

Results: Database Selection n Metric: R(K) = Χ / Υ n n X = # of relevant documents in the selected K databases Y = # of relevant documents in the best K databases For CORI (a state-ofthe-art database selection algorithm) with stemming over TREC 6 testbed 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 40

Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases n Managing Changes in Hidden-Web Databases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 41

Never-update Policy n Naive practice: construct summary once, never update n Extracted (old) summary may: n Miss new words (from new documents) n Contain obsolete words (from deleted document) n Provide inaccurate frequency estimates 12/19/2021 NY Times (Oct 29, 2004) NY Times (Mar 29, 2005) Word • tsunami • recount • grokster … #Docs (0) 2, 302 2 Panos Ipeirotis - New York University - IOMS Department #Docs 250 (0) 78 42

Updating Content Summaries: Questions n Do content summaries change over time? n Which database properties affect the rate of change? n How to schedule updates with constrained resources? ICDE 2005 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 43

Data for our Study: 152 Web Databases Study period: Oct 2002 – Oct 2003 n 52 weekly snapshots for each database n 5 million pages in each snapshot (approx. ) n 65 Gb per snapshot (3. 3 Tb total) Examined differences between snapshots: n 1 week: 5% new words, 5% old words disappeared n 20 weeks: 20% new words, 20% old words disappeared 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 44

Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” n Initially used to measure length of survival of patients under different treatments (hence the name) n Used to measure effect of different parameters (e. g. , weight, race) on survival time n We want to predict “time until next update” and find database properties that affect this time 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 45

Survival Analysis for Summary Updates n “Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i. e. , an update is required) n Old summary changes at time t if: KL divergence(current, old) > τ change sensitivity threshold n Survival analysis estimates probability that a database summary changes within time t 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 46

Survival Times and “Incomplete” Data “Survival times” for a database “Censored” cases X X X week Week 52, end of study n Many observations are “incomplete” (aka “censored”) n Censored data give partial information (database did not change) 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 47

Using “Censored” Data S(t), best fit, using censored data X X X S(t), bestcensored fit, using data censored data “as-is” S(t), best fit, ignoring n By ignoring censored cases we get (under) estimates perform more update operations than needed n By using censored cases “as-is” we get (again) underestimates n Survival analysis “extends” the lifetime of “censored” cases 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 48

Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression n Uses effectively “censored” data (i. e. , database did not change within time T) n Derives effect of database properties on rate of change n E. g. , “if you double the size of a database, it changes twice as fast” n No assumptions about the form of the survival function 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 49

Baseline Survival Functions by Domain Effect of domain: n GOV changes slower than any other domain n EDU changes fast in the short term, but slower in the long term n COM and other commercial sites change faster than the rest 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 50

Results of Cox PH Analysis n Cox PH analysis gives a formula for predicting the time between updates for any database n Rate of change depends on: n domain n database size n history of change n threshold τ By knowing time between updates we can schedule update operations better! 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 51

Scheduling Updates Database Rate of change λ average time between updates 10 weeks 40 weeks Tom’s Hardware 0. 088 5 weeks 46 weeks USPS 0. 023 12 weeks 34 weeks With plentiful resources, we update sites according to their rate of change 12/19/2021 Panos Ipeirotis - New York University - IOMS Department When resources are constrained, we update less often sites that change “too frequently” 52

Classification & Search: Overall Contributions n Support for browsing and searching Hidden-Web databases n No need for cooperation: Work with autonomous Hidden-Web databases n Scalable and work with large number of databases n Not restricted to “Hidden”-Web databases: Work with any searchable text database Classification and content summary extraction implemented and available for download at: http: //sdarts. columbia. edu 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 53

Current Work: Integrated Access to Hidden-Web Databases Query: [good drama movies playing in newark tomorrow] Current top Google result: as of Oct 2 nd, 2005 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 54

Future Work: Integrated Access to Hidden-Web Databases Query: [ good drama movies playing in newark tomorrow] query review databases query movie databases query ticket databases All information already available on the web n n n 12/19/2021 Review databases: Rotten Tomatoes, NY Times, TONY, … Movie databases: All Movie Guide, IMDB Tickets: Moviefone, Fandango, … Panos Ipeirotis - New York University - IOMS Department 55

Future Work: Integrated Access to Hidden-Web Databases Query: [ good drama movies playing in newark tomorrow] query review databases query movie databases query ticket databases n Challenges: n Short term: n Learn to interface with different databases n Adapt database selection algorithms n Long term: n Understand semantics of query n Extract “query plans” and optimize for distributed execution n Personalization n Security and privacy 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 56

Thank you! 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 57

Other Work: Approximate Text Matching VLDB’ 01 WWW’ 03 Matching similar strings within relational DBMS important: data resides there Service A Service B Jenny Stamatopoulou Panos Ipirotis John Paul Mc. Dougal Jonh Smith Aldridge Rodriguez Stamatopulou, Jenny Panos Ipeirotis John P. Mc. Dougal John Smith Al Dridge Rodriguez Exact joins not enough: Typing mistakes, abbreviations, different conventions Introduced algorithms for mapping approximate text joins into SQL: 1. No need for import/export of data 2. Provides crucial building block for data cleaning applications 3. Identifies many interesting matches Joint work with Divesh Srivastava, Nick Koudas (AT&T Labs-Research) and others 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 58

No Good Category for Database n General “problem” with supervised learning n Example: English vs. Chinese databases n Devised technique to analyze if can work with given database: 1. Find candidate textfields 2. Send top-level queries 3. Examine results & construct similarity matrix 4. If “matrix rank” small Many “similar” pages returned n Web form is not a search interface n Textfield is not a “keyword” field n Database is of different language n Database is of an “unknown” topic 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 59

Database not Category Focused n 12/19/2021 Extract one content summary per topic: n Focused queries retrieve documents about known topic n Each database is represented multiple times in hierarchy Panos Ipeirotis - New York University - IOMS Department 60

Near Future Work: Definition and analysis of query-based algorithms n Currently query-based algorithms are evaluated only empirically n Possible to model querying process using random graph theory and: n n Analyze thoroughly properties of the algorithms Understand better why, when, and how the algorithms work n Interested in exploring similar directions: n n Adapt hyperlink-based ranking algorithms Use results in graph theory to design sampling algorithms Web. DB 2003 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 61

Crawling- vs. Query-based Classification for CNN Sports Efficiency Statistics: Crawling-based Query-based Time Files Size Time Queries Size 1325 min 270, 202 8 Gb 2 min (-99. 8%) 112 357 Kb (-99. 9%) Accuracy Statistics: IEEE DEB – March 2002 Crawling-based classification is classified correctly only after downloading 70% of the documents in CNN-Sports 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 62

Real Confusion Matrix for Top Node of Hierarchy 12/19/2021 Health Sports Science Computers Arts Health 0. 753 0. 018 0. 124 0. 021 0. 017 Sports 0. 006 0. 171 0. 021 0. 016 0. 064 Science 0. 016 0. 024 0. 255 0. 047 0. 018 Computers 0. 004 0. 042 0. 080 0. 610 0. 031 Arts 0. 004 0. 027 0. 031 0. 298 Panos Ipeirotis - New York University - IOMS Department 63

Adjusting Document Frequencies n Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency n We know document frequency and rank r of the words in sample Frequency in sample 100 1 12 78 Rank in sample 12/19/2021 Panos Ipeirotis - New York University - IOMS Department rank …. VLDB 2002 64

Adjusting Document Frequencies n Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency n We know document Frequency in database frequency and rank r of the words in sample n We know real document frequency f of some words from one-word queries 1 12 78 Rank in sample 12/19/2021 Panos Ipeirotis - New York University - IOMS Department rank …. VLDB 2002 65

Adjusting Document Frequencies n Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency n We know document frequency and rank r of the words in sample Estimated frequency in database n We know real document frequency f of some words from one-word queries n We use curve-fitting to estimate the absolute frequency of all words in sample 12/19/2021 1 12 78 Panos Ipeirotis - New York University - IOMS Department rank …. VLDB 2002 66

Measuring Changes over Time n Recall: How many words in current summary also in old (extracted) summary? n n 12/19/2021 Shows how well old summaries cover the current (unknown) vocabulary Higher values are better n Precision: How many words in old (extracted) summary still in current summary? n n Shows how many obsolete words exist in the old summaries Higher values are better Panos Ipeirotis - New York University - IOMS Department Results for complete summaries (similar for approximate) 67

Modeling Goals n Goal: Estimate database-specific survival time distribution n Exponential distribution S(t) = exp(-λt) common for survival times n n n λ captures rate of change Need to estimate λ for each database Preferably, infer λ from database properties (with no “training”) n Intuitive (and wrong) approach: data + multiple regression n n 12/19/2021 Study contains a large number of “incomplete” observations Target variable S(t) typically not Gaussian Panos Ipeirotis - New York University - IOMS Department 68

Other Experiments n Effect of choice of document classifiers: n n RIPPER C 4. 5 Naïve Bayes SVM ACM TOIS 2003 n Benefits of feature selection n Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models n Effect of query-overlap elimination step n Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification IEEE Data Engineering Bulletin 2002 12/19/2021 Panos Ipeirotis - New York University - IOMS Department 69