Mtech Projects 2002 Sunita Sarawagi Sequence mining n

  • Slides: 13
Download presentation
Mtech Projects 2002 Sunita Sarawagi

Mtech Projects 2002 Sunita Sarawagi

Sequence mining n n n Several real-life mining applications on sequence data Classical applications

Sequence mining n n n Several real-life mining applications on sequence data Classical applications u Speech, language, handwritten are all complex sequences Newer applications u Bio-informatics: DNA and proteins u Telecommunication: Network alarms, network packet data u Retail data mining: Customer behavior

Sequence mining: problems Existing work scattered and application specific n Field in dire need

Sequence mining: problems Existing work scattered and application specific n Field in dire need of consolidated algorithms and software solutions n More technical details can be discussed after we finish this topic in class on March 3 n

Sensor databases and mining Several distributed sensors that push data to centralized database servers

Sensor databases and mining Several distributed sensors that push data to centralized database servers n Example: Automatic Vehicle Location systems consisting of sensors at bus stops, an entry in the server each time a bus passes a stop. n Goal: Build a DBMS for managing this data and supporting queries like “when is the next bus to X going to arrive”? n

Problems Cross-disciplinary covering several areas n A mining sub-problem: predicting arrival time based on

Problems Cross-disciplinary covering several areas n A mining sub-problem: predicting arrival time based on u Previous arrival patterns of same bus u Traffic conditions derived from other buses with common routes n A database query problem: u Approximate search based on spoken queries

Multi-relational data mining Existing mining software assume data in a single relation n Real-life

Multi-relational data mining Existing mining software assume data in a single relation n Real-life data over multiple relations n Existing tools rely on manual preprocessing before commencing mining, this is timeconsuming and in-accurate. n Design and implement mining algorithms for multi-relational data n

Who should apply n n n Fascinated by the areas of data mining, data

Who should apply n n n Fascinated by the areas of data mining, data bases, machine learning Want to get a flavor of cutting-edge research Enjoyed the courses Have a knack for algorithm design and implementation Are wery software savvy Wants to stretch his learning/knowledge rather than slide through with an “easy” project.

Possible achievements Understand one topic deeply, learn to innovate n Produce software that several

Possible achievements Understand one topic deeply, learn to innovate n Produce software that several people use n Write papers in really top-quality international conferences n Demo the software in leading international forums n

Industries in the area IBM IRL n Strand Genomics n GE Capital n TCS

Industries in the area IBM IRL n Strand Genomics n GE Capital n TCS bio-informatics n PSPL n Startups like Vistaar n Outside india: several n

Sample outcomes form some previous MTPs

Sample outcomes form some previous MTPs

Automatic segmentation of free text records, 2000 Batch A HMM-based address segmenter n Software

Automatic segmentation of free text records, 2000 Batch A HMM-based address segmenter n Software licensed by a Data Cleaning company n Paper in one of the two premium database conferences u ACM SIG on Management of Data (SIGMOD) 2001, Santa Barbara USA. n

ICUBE – Intelligent Rollups MTP work integrated in ICube, demo-ed at SIGMOD 2000 held

ICUBE – Intelligent Rollups MTP work integrated in ICube, demo-ed at SIGMOD 2000 held in Texas, USA n Icube software adopted by a startup n Paper at the other premium database conference, VLDB 2001 held in Rome, Italy. n

Data deduplication using active learning Software likely to be transferred to National Informatics Corporation,

Data deduplication using active learning Software likely to be transferred to National Informatics Corporation, Pune n Practical application of an interesting idea from machine learning n Paper at KDD 2002 conference held in Canda n Demos at VLDB 2002 Hongkong, ICDE 2003 Bangalore n