Multimedia search engine Michal Krsek UISK Charles University

  • Slides: 20
Download presentation
Multimedia search engine Michal Krsek, UISK Charles University at Prague & CESNET Ivan Doležal,

Multimedia search engine Michal Krsek, UISK Charles University at Prague & CESNET Ivan Doležal, CESNET Michal Illich, Jyxo

Electronic Media • TV & radio • Organized in channels • Zero democracy in

Electronic Media • TV & radio • Organized in channels • Zero democracy in programming (by channel management) • Centralized production (big guys business)

Internet • Not only web (audio/video and others) – remember archie. sura. net? •

Internet • Not only web (audio/video and others) – remember archie. sura. net? • IPTV / Live / Video on demand • Navigation only via web => not easy to find specific program in A/V

Search options I • Voice recognition – Language identification – Accents • Video recognition

Search options I • Voice recognition – Language identification – Accents • Video recognition – Text interpretation (bush vs. Bush) – Low video quality

Search options II • Indexing of web pages – Yahoo! does (google bomb target)

Search options II • Indexing of web pages – Yahoo! does (google bomb target) Metadata – “Out of the band Metadata” (as in librarian world) – Metadata in files (added during editing or encoding)

Project description • • • Started in 2003 (oh yes, one year before Truveo)

Project description • • • Started in 2003 (oh yes, one year before Truveo) “Google for audio and video on Internet” No support from content owners Modular concept Start with. cz Internet

Technical description I • Crawler – Crawls web and collects addresses (URL) – Exports

Technical description I • Crawler – Crawls web and collects addresses (URL) – Exports URL of multimedia files – Software written by Jyxo (Linux console app)

Technical description II • Distiller – Imports addresses of multimedia files – Distills metadata

Technical description II • Distiller – Imports addresses of multimedia files – Distills metadata (and makes XML files) – Makes screenshots (if video in file) – C# software and mplayer (windows apps) – Runs in distributed environment

Technical description III • Database – Imports XML metadata files to full text DB

Technical description III • Database – Imports XML metadata files to full text DB – Responses back-end queries for web queries – And others fulltext things (i. e. language)

Crawls webpages crawling Gets addresses Filter A/V adresses distillation Gets metadata from multimedia files

Crawls webpages crawling Gets addresses Filter A/V adresses distillation Gets metadata from multimedia files indexing search Holds fulltext database Provides back end for querries www. yournamehere. edu

Distillation • Proces description – Get URL from DB – Get metadata from file

Distillation • Proces description – Get URL from DB – Get metadata from file available at URL – Get screenshots at 1, 30, 50 sec – Save metadata & screenshot

Distillation • Use of win 32 applications – Native players (WMP, RP, Qt) for

Distillation • Use of win 32 applications – Native players (WMP, RP, Qt) for metadata – Mplayer for screenshots • Takes average one minute – Slow servers/bandwidth – Streaming without fast fw

Distiller. GRID • <= need 16 years to distill 8. 500. 000 URLs •

Distiller. GRID • <= need 16 years to distill 8. 500. 000 URLs • Ideal application for GRID computing – Not need of real time response – Huge amount of computing time needed • Two ways to create GRID – Build dedicated system – Use of current capacities

Computing machines • PC/Windows based • HW independent • Secure environment – Security of

Computing machines • PC/Windows based • HW independent • Secure environment – Security of hosting system – Security of distillation process • Well connected • Not needed to run 24 x 7 • Easy to manage

Configuration • ~100 PCs in student labs • Running on demand during weekends •

Configuration • ~100 PCs in student labs • Running on demand during weekends • Virtual machines (MS VPC 2004) in hosting system (Win XP) • Three different HW configurations • Peak rate about 5000 URLs per minute • SQL as background -> pull distribution of work

Actual status I • HW – 20 crawlers – 2 servers for fulltext DB

Actual status I • HW – 20 crawlers – 2 servers for fulltext DB (<1. 400 USD) – Distillation stations (X office PC) – Connected by 1 Gb/s to CESNET 2 -> GEANT 2

Actual status II • Database – EU +. com, . edu – > 13.

Actual status II • Database – EU +. com, . edu – > 13. 000 URLs – > 8. 000 valid – > 2. 800. 000 with screenshots

Live show?

Live show?

Want to test? • URLs – http: //multimedia. jyxo. cz – http: //videoserver. cesnet.

Want to test? • URLs – http: //multimedia. jyxo. cz – http: //videoserver. cesnet. cz/videoarchiv_en. php – For XML interface send me e-mail

Questions ? Comments ? Michal Krsek, Michal. Krsek@cesnet. cz (academic service, cooperation) Michal Illich,

Questions ? Comments ? Michal Krsek, Michal. Krsek@cesnet. cz (academic service, cooperation) Michal Illich, michal@illich. cz (business service)