Big Data at BITEM Research Group TextWeb Mining

Big Data at BITEM Research Group • (Text|Web) Mining Research Group – patrick. ruch@hesge. ch, http: //bitem. hesge. ch • Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… • Specialised in (semi|un)structured data – We like text, text and more text – Especially on the noisy/dirty Web • Technological expertise: Couch. DB replication, Solr. Cloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

Web Sources No. SQL Replication Forum Couch DB RSS Couch DB Twitter API Solr Cloud Cleaning Normalisation 26’ 000 per day Drugbank 19’ 000 drug names checked each 10 mn Pharmacovigilance on Big Social Media Data 7 M of docs in 9 months Dynamic and Real Time Data Analysis Trends Analysis Correlation Analysis Novelty Detection

40’ 000 concepts [Big-scale Multiclass Multilabel Classifier] Lazy learning ! Proteins annotation based on litterature by curators annotated articles 23 000 articles Manual annotation planned for 2045 ! (Baumgartner et al) GOA Machine Learning based on Information Retrieval methods Managing the data deluge for proteins annotation Assisting curators Macro reading of litterature Profiling any textual content

Patent retrieval The real situation (0. 5 -1 TB) Experiments Database 13 millions of patents Database A sample of 1 million of patents Extraction 33 days Extraction 2. 5 days XML patents 0. 221 Tb XML patents 17 Gb Normalization 33 days Normalization 2. 5 days XML patents + metadata 0. 234 Tb XML patents + metadata 18 Gb Indexing 5 days Indexing 10 hours Index 0. 1 Tb Index 3 Gb 4