Language Technology Research Serving e Humanities New Ways

  • Slides: 25
Download presentation
Language Technology Research Serving e. Humanities New Ways of Accessing the USC Shoah Foundation

Language Technology Research Serving e. Humanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič Institute of Formal and Applied Linguistics Computer Science School Charles University in Prague, Czech Republic malach@knih. mff. cuni. cz | http: //www. malach-centrum. cz

From Testimonies to Flexible Access § The USC VHI Archive § § Testimonies of

From Testimonies to Flexible Access § The USC VHI Archive § § Testimonies of Holocaust survivors Center for Visual History Malach § Access Point to the USC Archive § Activities of CVHM Access using New Technology § Fulltext (transcript) Search § Cross-lingual Access § Thesaurus Translation Status and Future Plans 21. 11. 2012 J. Hajic: CVHM & Language Technology 2

Center for Visual History Malach § Access Point to the USC VHI’s Archive, http:

Center for Visual History Malach § Access Point to the USC VHI’s Archive, http: //www. usc. edu/vhi 21. 11. 2012 J. Hajic: CVHM & Language Technology 3

Contents of the Archive § Testimonies recorded in the 1990 s 21. 11. 2012

Contents of the Archive § Testimonies recorded in the 1990 s 21. 11. 2012 J. Hajic: CVHM & Language Technology 4

Recording the Testimonies § Visual History Foundation § California, USA (Universal Studios) § 1990

Recording the Testimonies § Visual History Foundation § California, USA (Universal Studios) § 1990 s § § Analog video recording technology, 30 minute tapes Teams of 3 people (moderator, video, audio) § Volume § § 56 countries, over 105, 000 hours of video 32 languages, ~52, 000 testimonies § Half of them in English 21. 11. 2012 J. Hajic: CVHM & Language Technology 5

Archiving the Testimonies § Digitization § 100 s of terabytes of data (NTSC/PAL quality)

Archiving the Testimonies § Digitization § 100 s of terabytes of data (NTSC/PAL quality) § Catalogization (indexing) § Thesaurus (55000 keywords) § hierarchical, timeline, places § Goal: § § Access (search) Material for projects 21. 11. 2012 J. Hajic: CVHM & Language Technology 6

Access (Search) § Search by keywords § At 1 -minute segments, beginning of topic

Access (Search) § Search by keywords § At 1 -minute segments, beginning of topic § Search by particular people, relations § Filter search by § Language spoken § Country of survivor § Experience (survivor/liberator/. . . ) § Not possible: “fulltext” search § Video access: locally available, or on order § Player: usual controls, also by segment, search within video 21. 11. 2012 J. Hajic: CVHM & Language Technology 7

Access Points § Internet: only limited access so far § § Throughput (technical limitations),

Access Points § Internet: only limited access so far § § Throughput (technical limitations), legal & ethical issues, . . . → Access Points § ~30 worldwide (USA; EU: Berlin, Budapest, Prague, Warsaw; secondary access available) § 2 - 20% of full archive locally § Fast “Internet 2” connection § Additional Services § Search and view: standard Internet browser 21. 11. 2012 J. Hajic: CVHM & Language Technology 8

Center for Visual History Malach § Charles University in Prague, est. 2009, coordinator: Jakub

Center for Visual History Malach § Charles University in Prague, est. 2009, coordinator: Jakub Mlynář 21. 11. 2012 J. Hajic: CVHM & Language Technology 9

Center for Visual History Malach § Supported by Charles University § Faculty of Mathematics

Center for Visual History Malach § Supported by Charles University § Faculty of Mathematics and Physics, CS School § § CS School Library & Institute of Formal and Applied Linguistics Part of LINDAT-Clarin, Language Data Infrastructure § Clarin ERIC – Pan-European network of LTH Centers § 12 workplaces, AV technology, materials § Technology (by Inst. of Formal and Applied Linguistics): § 1 Gbit network locally, dataserver (for video cache) § § 2000 testimonies locally (all Czech, Slovak, Polish, many in English) Geant connection, 5 -10 min. for 30 min. video from USC 21. 11. 2012 J. Hajic: CVHM & Language Technology 10

Center for Visual History Malach: Activities § Seminars § Anniversary seminar (January) § Seminars

Center for Visual History Malach: Activities § Seminars § Anniversary seminar (January) § Seminars for students, teachers § § Workshops § Co-organization of Raoul Wallenberg 100 th Anniversary workshop, Nov. 2012 § § Also: foreign visitors (Ukraine – summer 2012) w/Czech Parliament, Jewish Museum in Prague, Embassies Tutorials § Using the Archive, How-To-. . . ; Research on Language Technology (with Institute of Formal and Applied Linguistics) 21. 11. 2012 J. Hajic: CVHM & Language Technology 11

Center for Visual History Malach: Activities § Newsletter Web: 21. 11. 2012 J. Hajic:

Center for Visual History Malach: Activities § Newsletter Web: 21. 11. 2012 J. Hajic: CVHM & Language Technology 12

Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal

Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal reasons, etc. ) mid-2010 – fall 2012 21. 11. 2012 J. Hajic: CVHM & Language Technology 13

Why “Malach”? § Technology and UI Research Project § § 2002 -2007 Multilingual Access

Why “Malach”? § Technology and UI Research Project § § 2002 -2007 Multilingual Access to Large Audio ar. CHives § malach – “angel” in Hebrew § § Support: NSF (National Science Foundation) Visual History Foundation (predecessor of SFI/USC) § § § IBM Research, Yorktown Heights, NY, USA Johns Hopkins Univ. , Baltimore, MD, USA Univ. of Maryland, College Park, MD, USA Charles University in Prague, CZ (IFAL MFF UK) Univ. of West Bohemia, Pilsen, CZ (Dept. of Cybernetics) 21. 11. 2012 J. Hajic: CVHM & Language Technology 14

Research in the Malach Project § Research in the area of § Automatic Speech

Research in the Malach Project § Research in the area of § Automatic Speech Recognition (of the testimonies) § English, Czech, Slovak, Russian, Polish, Hungarian § § Automatic Translation of Thesaurus Keyword translation § Czech, English § Cross-lingual Audio/Voice Search § Part of the world-wide CLEF 2006, 2007 competition § User interfaces → current VHA search interface 21. 11. 2012 J. Hajic: CVHM & Language Technology 16

Automatic Speech Recognition § Core “Front-end” Technology § Current State-of-the-Art: 95% in controlled conditions

Automatic Speech Recognition § Core “Front-end” Technology § Current State-of-the-Art: 95% in controlled conditions § Problems: § § English: non-native speakers (virtually all 26, 000!) Czech: colloquial speech All: emotions, elderly people, imperfect recording Technology issues: not enough in-domain texts § Some improvement reached by 2007 21. 11. 2012 J. Hajic: CVHM & Language Technology 18

The AMALACH Project § Applied research project, 2012 -2015 § § Implement and integrate

The AMALACH Project § Applied research project, 2012 -2015 § § Implement and integrate (some) MALACH project results Czech National Cultural Heritage Funding § Partners: Charles Univ. , Univ. of West Bohemia (and USC) § Selling point: improved access for local (Czech) researchers § § USC Archive: 558 Czech-language testimonies § only a fraction (~ 12%) of 4613 Czech survivors! § Rest: mostly English spoken § Also: 12500 segments containing keyword “Czech” Solution: cross-lingual fulltext-like search § Needs speech recognition, automatic translation, thesaurus 21. 11. 2012 J. Hajic: CVHM & Language Technology 19

Cross-lingual Search Scheme § Archive transcript & query translation Query in A Seg. 1

Cross-lingual Search Scheme § Archive transcript & query translation Query in A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER Mar. 7, 2012 Translation to E Query in E Monolingual Search Transcr. Z C. . . B Transcr. A Translation to E Archive Transcript, E QUERY UFAL Intro PROCESSING 21. 11. 2012 ASR in multiple lang. The archive: all audio 20 [OFFLINE] J. Hajic: CVHM & Language Technology 20

Phonetic and Word Search (monolingual) § Automatic Speech Recognition (Univ. of WB) Word and

Phonetic and Word Search (monolingual) § Automatic Speech Recognition (Univ. of WB) Word and Phonetic Lattice VHF 04106 -0047. 18 VHF 04167 -0146. 32 VHF 05103 -0192. 98 ……………… Automatic Speech Recognition 21. 11. 2012 Transcript Database Search System J. Hajic: CVHM & Language Technology 21

Machine Translation § State-of-the-Art § Cf. Google (currently best for most language pairs) §

Machine Translation § State-of-the-Art § Cf. Google (currently best for most language pairs) § Still imperfect (applications need varying levels of quality) § Machine translation of speech transcripts § Big challenge: VERY noisy input § Speech recognition errors § Ungrammatical, non-native, emotional language § Good news § Used in search only (will probably never be shown to users) 21. 11. 2012 J. Hajic: CVHM & Language Technology 22

Statistical Machine Translation Technology § The idea (1940 s/1990 s) - imagine this: Czech

Statistical Machine Translation Technology § The idea (1940 s/1990 s) - imagine this: Czech text “Coding” English text § Translation by the reverse process: “decoding” § Probabilistic model of the translation process § And probabilistic model of the target language § Probabilities learned from (human) translations 21. 11. 2012 J. Hajic: CVHM & Language Technology 23

Speech and Language Technology in Search Query in A Seg. 1 in A Seg.

Speech and Language Technology in Search Query in A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER Mar. 7, 2012 Translation to E Query in E Monolingual Search Transcr. Z C. . . B Transcr. A Translation to E Archive Transcript, E QUERY UFAL Intro PROCESSING 21. 11. 2012 ASR in multiple lang. The archive: all audio 24 [OFFLINE] J. Hajic: CVHM & Language Technology 24

Status and Future Plans § Czech testimonies § Monolingual Fulltext Search System operational §

Status and Future Plans § Czech testimonies § Monolingual Fulltext Search System operational § § English speech recognition of the testimonies § § Work has started: data preparation ongoing Translation to Czech § Thesaurus: manually (high quality necessary) § § Will be used in the current interface as well Data: work ongoing, data preparation § § in CVHM, users can use both VHA and the UWB UI “Lattice” translation experiments underway Cross-lingual search: work starts in 2013 21. 11. 2012 J. Hajic: CVHM & Language Technology 25

Thank you! § § § VHI http: //www. usc. edu/vhi Institute of formal and

Thank you! § § § VHI http: //www. usc. edu/vhi Institute of formal and applied linguistics http: //ufal. mff. cuni. cz Center for Visual History Malach http: //malach-centrum. cz Dept. of Cybernetics, Univ. of West Bohemia, Pilsen, CZ http: //www. kky. zcu. cz The project “Malach” http: //malach. umiacs. umd. edu 21. 11. 2012 J. Hajic: CVHM & Language Technology 26

Closing § Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History,

Closing § Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History, Education and Media Wednesday, Nov 21, 2012 11: 00 Section A http: //www. preserving-survivors-memories. org 21. 11. 2012 J. Hajic: CVHM & Language Technology 27