GigaMining Corinna Cortes and Daryl Pregibon ATT LabsResearch
Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999
Case Study Statistical modeling n Processing of multi-GB databases n Data warehousing n Prediction and classification n User interfaces n
Three Goals Daily perform meaningful mining on multi. GB of data n Classify telephone numbers as business or residential (pattern deviation, etc. ) n Maintain operational data for each phone number. n
Quantity of data 1997: 275 million phone calls per week day -- total of 76 billion for whole year n 65 M unique TNs per weekday n 350 M unique TNs over a 40 -day period n “Universe list”: Set of all TNs observed on network, each with a 7 -byte profile n
Contents of each profile Inactivity -- number of days since TN used n Minutes of use -- average daily minutes TN is observed on network n Frequency -- estimated number of days between observing a TN n “Bizocity” -- Business-like behavior of TN n n Stored for inbound/outbound, toll/toll-free
Calculation of each variable n Inactivity: Set to 0 if observed, and (Inactivity++) if not observed. n Other variables are calculated via an exponential weighted average: n X(TN)new = λX(TN)today + (1 -λ)X(TN)old, 0<λ<1
Aging factor λ Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time. n Most recent day’s activity is weighted higher than 2 weeks ago. n Weight of a call k days ago is wk = (1 -λ)k λ n Old data is “aged out” as new data is “blended in” n
“Bizocity” Concerns over whether a TN is residential or business. n Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc. n
“Bizocity” continued AT&T has confirmed residential/business status for 30% of 350 M TNs. n Incomplete data is due to lack of communication with local companies, additional lines, out of date information. n Behavioral estimate is generated by observing behavior of all 350 M TNs, generating a bizocity score, and combining it with previous days’ totals. n
Generating “Bizocity” When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers). n Those with known biz/res status are flagged, and training sets are generated. n Noise and outliers are usually eliminated by the volume of data. n
Generating “Bizocity” -examples Example: Long calls originating at night are usually residential, not business. n Example: Residential calls peak in eve. , business calls peak between 9 am-5 pm n Example: Business calls are generally shorter, call other businesses, or call 800 services. n
Processed every 24 hours Provides better aggregate data for each TN n Reduces I/O by 75% n Have to store all call details and sort them. n Each call is reduced to a 32 -byte binary record, resulting in 8 GB daily. n Sorting takes 30 min. (3 GB RAM, 1 processor) n
Processing -- continued 4 d data cube is generated n Dimensions are day-of-week, time-of-day, duration, and biz/res/800 status (7 x 6 x 5 x 3) n Have previously developed logistic regression models for scoring TNs based on each profile (to estimate “Bizocity”) n n Biz(TN)new = λBiz(TN)today + (1 -λ)Biz(TN)old 0<λ<1
Processing -- continued Training set is used to classify TNs with unknown status based on probabilities n Inactive TNs are not updated n “Bizocity” scores for unknown TNs are generated using probabilities n
Accuracy of prediction of status is 75% n Failures due to incorrectly provided status of shifting status (ex. home businesses, cell phones, etc. ) n
Data Structures Exploit the “exchange” concept (1 st 6 digits form an exchange) n Only about 150, 000 of 1 M exchanges are in use n All 10, 000 TNs for each exchange are stored sequentially, whether used or not n Each data structure is 2 GB for each variable (lower bound is 1. 5 GB) n
Interface Variety of visualization tools (start at top, drill-down) n Web interface with password protection n Images are computed on the fly n C-code directly computes images in gif format n
Toll Fraud Detection Same methodology, but event-driven n Only have to track about 15 M TNs. n Profiles are about 512 bytes each (7. 5 GB) n
- Slides: 18