YahooDAIS Seminar CS 591 DAI Orientation Cheng Xiang

  • Slides: 43
Download presentation
Yahoo!-DAIS Seminar (CS 591 DAI) Orientation Cheng. Xiang (“Cheng”) Zhai Department of Computer Science

Yahoo!-DAIS Seminar (CS 591 DAI) Orientation Cheng. Xiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Basic Information about CS 591 DAI (=Yahoo!-DAIS seminar) n n n Meets at 4

Basic Information about CS 591 DAI (=Yahoo!-DAIS seminar) n n n Meets at 4 -5 pm, Tuesdays, 0216 SC 1 Credit Hour Miss at most two talks At each meeting, you can expect q q n n n An interesting research talk Opportunity for research discussion An attendance sheet for collecting signatures And snacks (thanks to Yahoo!) Seminar coordinators: Yucheng Chen, Hao Luo Website: http: //dais. cs. uiuc. edu/seminars. html (under construction) Mailing list: [email protected] uiuc. edu (make sure you subscribe to it)

The DAIS Group: Data & Information Systems + many grad/undergrad students + few postdocs/visitors

The DAIS Group: Data & Information Systems + many grad/undergrad students + few postdocs/visitors Prof. Emerita Kevin Chang Jiawei Han Aditya Parameswaran Saurabh Sinha Tandy Warnow (+ Bioenginering) Jian Peng Marianne Winslett Hari Sundaram (+Advertising) Bioinformatics Cheng Zhai

Landscape of DAIS Research Intelligence Intelligent information agent Decision/Task support Information analysis & Data

Landscape of DAIS Research Intelligence Intelligent information agent Decision/Task support Information analysis & Data mining Data/Information Access Recommendation Search Browsing Storage Gigabytes Terabytes. Petabytes Productivity (Web, email, Decision …) making (government, business, personal) Education Health/Medical/Biolog y Scalability Application impact 5

DAIS & Related Areas in CS Intelligence Artificial Intelligence Statistic s Intelligent information agent

DAIS & Related Areas in CS Intelligence Artificial Intelligence Statistic s Intelligent information agent Decision/Task support Information analysis & Data mining Data/Information Access Recommendation Search Browsing Parallel Storage Computing Gigabytes Terabytes. Petabytes Theory & Algorithms Scalability Productivity (Web, email, Systems & Decision making (government, business, personal) …) Education Networking Health/Medical/Biolog y Application impact Human-Computer Interaction, Graphics 6

Overview of DAIS Faculty Research

Overview of DAIS Faculty Research

Kevin C. Chang: Bridging Structured and Unstructured Data How to bring structured/semantic-rich access to

Kevin C. Chang: Bridging Structured and Unstructured Data How to bring structured/semantic-rich access to the myriad and massive unstructured data which accounts for most of the world's information? q q q How to search the Web that Google does not see? How to reach into and across pages that Google only takes you to? How to listen to buzz of the world and make sense?

Project 1. Meta. Querier: Exploring and Integrating the Deep Web Meta. Explorer • source

Project 1. Meta. Querier: Exploring and Integrating the Deep Web Meta. Explorer • source discovery • source modeling • source indexing FIND sources db of dbs Meta. Integrator • source selection • schema integration • query mediation QUERY sources unified query interface

Project 2. WISDM: Data-aware Search over the Web Data Aware Search 10

Project 2. WISDM: Data-aware Search over the Web Data Aware Search 10

Demo: Entity Search-- “university of california #location”

Demo: Entity Search-- “university of california #location”

Project Big. Social: Social Data Analytics

Project Big. Social: Social Data Analytics

Aditya Parameswaran n n Started August 2014 i. stanford. edu/~adityagp n Interests: Data Management,

Aditya Parameswaran n n Started August 2014 i. stanford. edu/~adityagp n Interests: Data Management, Mining and Algorithms n Specific topics of interest q q Interactive Data Analytics Crowd-powered Analytics Approximate Analytics Visual Analytics and Data Mining

Research Style Build Real Data Analytics Tools / Systems Data Science and Applied ML

Research Style Build Real Data Analytics Tools / Systems Data Science and Applied ML Data Systems Data Mining Theory Design Algorithms with Guarantees

Research Goal: Simplifying Data Analytics “…. (in the next few years) we project a

Research Goal: Simplifying Data Analytics “…. (in the next few years) we project a need for 1. 5 million additional analysts in the United States who can analyze data effectively…“, -- Mc. Kinsey Big Data Study, 2012 How do we make it easier for novice data analysts to get insights from data?

Simplifying Data Analytics: Four Aspects Unstructured Crowd Powered Analytics Querying Visualizations Interactive Analytics Visual

Simplifying Data Analytics: Four Aspects Unstructured Crowd Powered Analytics Querying Visualizations Interactive Analytics Visual Analytics Scale Approximate Analytics

Example Projects • Crowd-powered search • Crowd-powered data extraction & cleaning • Interactive query

Example Projects • Crowd-powered search • Crowd-powered data extraction & cleaning • Interactive query synthesis • Speculative querying and caching • Recommending visualizations automatically • Approximate visualizations with guarantees • Fundamental principles: cost, latency, error • Browsing-based query processing Crowd Query Visual Approx

Hari Sundaram n Prof. Hari Sundaram used a separate file for his presentation, which

Hari Sundaram n Prof. Hari Sundaram used a separate file for his presentation, which is available at; http: //times. cs. uiuc. edu/czhai/pub/DAIS-fall 14 -Hari. pdf

Jiawei Han: Data Mining & Information Networks How to perform data mining effectively in

Jiawei Han: Data Mining & Information Networks How to perform data mining effectively in massive data and in heterogeneous information networks? How to mine structures and construct networks from unstructured real-world data? Specific research subareas: q Effective methods for mining heterogeneous networks q Construction of heterogeneous networks from unstructured data q Multi-dimensional unstructured data summarization & OLAP q Truth discovery and outlier mining in networked data q Spatiotemporal and cyber-physical data mining (e. g. , mobile objects, sensor/mobile data mining)

Recent Research Focus on Data Mining n Network Construction, Search and Mining on Real

Recent Research Focus on Data Mining n Network Construction, Search and Mining on Real datasets q News Network: 10 M news articles + news of last 70 years q Computer Science Research Network: DBLP + citations + abstracts + Web pages of researcher + other related web pages q Tweet network and other social media network q Bio-Medical Research Network: Pub. Med and other medical sources q Networkfication of Knowledge-Bases: Wikipedia, DBPedia, Freebase q Cyber-Physical Networks (internet of things) n Other frontiers: trajectory mining, truth finding, anomaly, … Two ARL-funded projects in 2014 -2016 for Network Science Collaborative Technology Alliance (NSCTA) n 20

Constructing Unified, Structured Knowledge Networks Current State-of-the-Art § Natural language processing § Graph-based mining

Constructing Unified, Structured Knowledge Networks Current State-of-the-Art § Natural language processing § Graph-based mining § Theory of planned behavior Near perfect reliable network construction through progressive source processing and network refinement Research Topics/Technical Approach Automated Construction of adaptable knowledge networks § Preliminary network construction via NLP and social network techniques § Exploitation of links for network refinement § Streaming updates Multi-dimension truth analysis Team: H. Ji (RPI) (lead), J. Han (UIUC), G. Cao (PSU), C. § Credibility analysis by processing the interactions of the Voss (ARL), W. Wallace (RPI) physical and social/cognitive states of the social Rotations: Ji: post-doc (Hong): 3 mo. at ARL; Han: network and its interactions with the information student (Liu/Brova): 3 mo. at ARL; Wallace: student (Yulia): 3 network mo. at ARL; each PI: visits to ARL totaling 1 wk or more. RPI Information processing & social/cognitive modeling students visit UIUC for one week each. § Build socio-cognitive models to predict human behavior Army Needs and Benefits § User-oriented and constraint-aware information § Exploitation of unstructured data for improved processing situational analysis § Predictive tools to understand adversarial intent Construction of latent social and information networks require understanding connecting the dots in documents by linking entities and understanding human behavior 21

Distributed, User-Oriented Multi-Scale Network Summarization and OLAP Current State-of-the-Art § Summarization on relational and

Distributed, User-Oriented Multi-Scale Network Summarization and OLAP Current State-of-the-Art § Summarization on relational and text data § Social and cognitive computing § Online analytical processing on data cubes Support of distributed, multiscale, multi-genre network summarization, OLAP and situation analysis for diverse user groups Research Topics/Technical Approach Multi-scale network summarization & aggregation § Creation and enrichment of multi-dimension information from text and unstructured data § Network cube construction: conflict resolution, topical hierarchy generation, multidimensional indexing & selective, partial cube materialization User-oriented adaptation of network cube views Team: J. Han (UIUC) (lead), J. Hendler (RPI), § User- or social network-oriented OLAP T. Hanratty (ARL), H Ji (RPI), B. Welles (NEU) § Accommodate dynamic updates of underlying social Rotations: Han: student (Tao/Song): 3 mo. at ARL; Ji: and/or communication networks student (Zhang): 3 mo. at ARL; each PI: visits to ARL, total 1 Cost- and constraint-aware network cube wk; 3 univ. mutual-visits: PIs + students § Distributed, user- or community-oriented cube views Army Needs and Benefits § Cost-, constraint- and availability- aware drilling, search § Summarization/visualization matched to the user and analysis cognitive abilities for improved situational awareness Cognitive modeling, visualization, and supporting human § Flexible and efficient situation analysis for diverse groups decision making of users § Experimental platform: news, blogs, tweets, etc. Distributed, user-oriented, multi-scale summarization of networks to support online information processing and situation analysis for diverse groups of users 22

From Data Mining to Mining Info. Networks Han, Kamber and Pei, Data Mining, 3

From Data Mining to Mining Info. Networks Han, Kamber and Pei, Data Mining, 3 rd ed. 2011 Yu, Han and Faloutsos (eds. ), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012 23

Cheng. Xiang Zhai: Intelligent text information management & analysis How can we develop intelligent

Cheng. Xiang Zhai: Intelligent text information management & analysis How can we develop intelligent algorithms and systems to help people manage and exploit large amounts of text data (e. g. , Web pages, blog articles, news, email, literature…)? Two subtopics: q Information retrieval: how can we connect the right information with the right users at the right time with minimum or no user effort? q Text mining: How can we automatically discover useful knowledge from text? How can we mine text data together with non-textual data in an integrative manner? Applications: Web, biomedical, health, education, … Research methodology: q Emphasize general & principled solutions without manual effort q Mainly use statistical models, machine learning, and natural language processing techniques

Sample Project: Latent Aspect Rating Analysis How to infer aspect ratings? How to infer

Sample Project: Latent Aspect Rating Analysis How to infer aspect ratings? How to infer aspect weights? Value Location Service …. . Value Location Service

Solution: Latent Rating Regression Model Aspect Segmentation Reviews + overall ratings + Latent Rating

Solution: Latent Rating Regression Model Aspect Segmentation Reviews + overall ratings + Latent Rating Regression Aspect segments Term weights Aspect Rating Aspect Weight location: 1 amazing: 1 walk: 1 anywhere: 1 room: 1 nicely: 1 appointed: 1 comfortable: 1 nice: 1 accommodating: 1 smile: 1 friendliness: 1 attentiveness: 1 0. 0 0. 9 0. 1 0. 3 0. 1 0. 7 0. 1 0. 9 0. 6 0. 8 0. 7 0. 8 0. 9 Topic model for aspect discovery 1. 3 0. 2 1. 8 0. 2 3. 8 0. 6

Aspect-Based Opinion Summarization

Aspect-Based Opinion Summarization

Reviewer Behavior Analysis & Personalized Ranking of Entities People like expensive hotels because of

Reviewer Behavior Analysis & Personalized Ranking of Entities People like expensive hotels because of good service Query: 0. 9 value 0. 1 others Non-Personalized People like cheap hotels because of good value

Tandy Warnow

Tandy Warnow

Gene Tree Estimation: first align, then construct the tree S 1 S 2 S

Gene Tree Estimation: first align, then construct the tree S 1 S 2 S 3 S 4 = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACA S 1 S 2 S 3 S 4 = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC----TCAC--GACCGACA S 2 Sounds easy, but every good approach is NP-hard, and statistical methods (based on stochastic models of evolution) are very slow. S 4 S 3 Accuracy is essential, datasets are big, and they are also messy. Species tree estimation is even harder, because gene trees can be different from the species tree!

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI • Approx.

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI • Approx. 50 species, whole genomes T. Warnow UT-Austin S. Mirarab Md. S. Bayzid UT-Austin Plus many other people… • 8000+ genes, UCEs Challenges: Maximum likelihood tree estimation on multi-million-site sequence alignments Massive gene tree incongruence My students and I developed a new technique (“Statistical Binning”) to enable a statistical estimation of the avian species tree, taking gene tree incongruence into account (both papers under review in Science)

1 kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia

1 kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci i. Plant T. Warnow, UT-Austin S. Mirarab, UT-Austin N. Nguyen, UT-Austin Md. S. Bayzid UT-Austin Plus many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13, 000 gene families (most not single copy) Gene Tree Incongruence Challenges: Multiple sequence alignment of datasets with > 100, 000 sequences Gene tree incongruence My students and I developed ASTRAL – a technique to estimate species trees on large datasets (ECCB), and used it to analyze this dataset (under review in PNAS) UPP – new multiple sequence alignment method that can analyze up to 1, 000 sequences (in preparation)

Computer science and mathematics issues: Heuristics for NP-hard optimization problems Graph algorithms Statistical estimation

Computer science and mathematics issues: Heuristics for NP-hard optimization problems Graph algorithms Statistical estimation on messy data Mining sets of trees/alignments High Performance Computing Mathematical modelling Probabilistic analysis of algorithms Bioinformatics Problems Multiple Sequence Alignment Gene Tree Estimation Species tree estimation (when gene trees conflic Genome rearrangement phylogeny Phylogenetic network estimation Metagenomic data analysis Also: Computational Historical Linguistics

Jian Peng: Machine learning for computational biology Biological data Machine learning Knowledge

Jian Peng: Machine learning for computational biology Biological data Machine learning Knowledge

Biological data integration Disease-related genes Functional homologs Hypothesis Experiments

Biological data integration Disease-related genes Functional homologs Hypothesis Experiments

Biological data integration: network biology q Modeling information diffusion on biological networks q Integrating

Biological data integration: network biology q Modeling information diffusion on biological networks q Integrating networks from multiple species q Inference of gene function from network data

Other computational biology projects n Protein science q q q n Translational bioinformatics q

Other computational biology projects n Protein science q q q n Translational bioinformatics q q n Structure prediction Protein folding Viral proteins Drug discovery and optimization Drug repositioning Genomics q q Large-scale read mapping Algorithms for genome assembly

Machine learning: modeling complex data n Graphical models q q n Learning representations for

Machine learning: modeling complex data n Graphical models q q n Learning representations for heterogeneous data q n Latent variable models Efficient learning and inference algorithms Causal/correlation structures Applications to protein folding, gene expression analysis and biological network construction Low-dimensional embedding for network, text and molecular data sets Learning structured prediction with complex loss functions q Applications to biology, computer vision, speech recognition and natural language processing

Saurabh Sinha: Bioinformatics How is information about us encoded in our DNA ? How

Saurabh Sinha: Bioinformatics How is information about us encoded in our DNA ? How does this information evolve, giving rise to what Darwin called “endless forms most beautiful”? Research questions: q Gene regulation: How are genes turned on and off in precisely orchestrated ways? q Comparative genomics: What can we learn by comparing genomes of tens of different species? q Regulatory evolution: Can we build a mathematical model of evolution? q Genomics of behavior: How does DNA encode animal behavior ? http: //www. sinhalab. net/ 39

Genomics of behavior: honeybee • What causes older bees to be more aggressive than

Genomics of behavior: honeybee • What causes older bees to be more aggressive than younger ones? • What causes Africanized bees to be more aggressive than European ones? • What causes a bee to become aggressive if you annoy them? • DNA sequence analysis shows that origins of aggression are

Genomics of aging Find genes associated with aging, by searching the DNA sequence for

Genomics of aging Find genes associated with aging, by searching the DNA sequence for certain patterns. Knock down one such gene; old cells became young !

A complete bioinformatics pipeline to data …to analysis … From cells … to hypotheses

A complete bioinformatics pipeline to data …to analysis … From cells … to hypotheses & experiments

QUESTIONS? 43

QUESTIONS? 43