Research on Human Language Technologies in Flanders Walter

Prehistory Computational Linguistics n Early isolated Ph. Ds n n n Willy Martin (Leuven,

Prehistory Computational Linguistics n Droste Ph. Ds: F. G. Droste (1969) Translating with the

Prehistory Speech Technology n Early Ph. Ds n n n Jean-Pierre Martens (Gent, 1982).

Prehistory Speech Technology n First collaborations on national level n n 1983 -1989: First

Start of an organized field n Research Initiative on Dutch speech and language technology

http: //clif. esat. kuleuven. be Start of an organized field n Computational Linguistics in

Start of an organized field n Flanders Language Valley campus n n Ypres, the

From CGN to STEVIN n What happened to the “long-term research programme on speech

From CGN to STEVIN n The position of Dutch in Language and Speech Technology

Others n Research within companies n n Karel De Grote university college (Antwerp) n

Funding situation in Flanders n Flemish HLT research funding situation is reasonably good, (except

Research situation in Flanders n The joint research groups cover a large part of

ELIS-DSSP Jean-Pierre Martens Electronics & Information Systems University of Ghent https: //speech. elis. ugent.

ELIS-DSSP n Embedding n n Key dates n n n research group of dept.

ELIS-DSSP n Software development n n n auditory model embedding AMPEX pitch extractor monophonic

Main research results n n n n improved LVCSR by means of data-driven pronunciation

ETRO-DSSP Werner Verhelst Electronics & Informatics University of Brussels http: //www. etro. vub. ac.

Laboratory for Digital Speech and Audio Processing – ETRO-DSSP n Embedding n n Key

Main development work n Software development n n system for automatic synchronization of studio

Main development work n Resource development n n Audiovisual recording studio Database with multi-sensor

Main research results n n n n n window and sampling effects in short-time

ESAT/PSI-Spraak Patrick Wambacq, Hugo Van hamme, Dirk Van Compernolle Electrical Engineering, Center for Processing

ESAT/PSI-Spraak n n n Speech processing research at K. U. Leuven (Dept. Electrical Engineering,

ESAT/PSI-Spraak: research themes n n n ASR novel architectures (episodic, hybrid, layered approaches) ASR

ESAT/PSI-Spraak: FLa. Vo. R n n FLAVOR: “Flexible Large Vocabulary Recognition”: IWT funded, Oct.

ESAT/PSI-Spraak: FLa. Vo. R n Through a Modular Recognizer Architecture • That assures a

ESAT/PSI-Spraak: SPACE n n SPACE: “SPeech Algorithms for Clinical and Educational applications”, IWT funded,

ESAT/PSI-Spraak: SPRAAK n n SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit: STEVIN project

Centrum voor Computerlinguïstiek Frank Van Eynde Faculty of Arts University of Leuven http: //www.

Centrum voor Computerlinguïstiek n n n founded in 1991 at the Faculty of Arts

Centrum voor Computerlinguïstiek n n n n formal syntax and semantics (Head-driven Phrase Structure

Machine Translation METIS-II (EU/FP 6) 2004 -2007 n n n Successor of Metis I

Corpus Annotation CGN / D-Coi / Lassy (STEVIN) Series of joined Flemish/Dutch projects n

CNTS Language Technology Group Walter Daelemans, Steven Gillis Faculty of Arts University of Antwerp

CNTS Center for Dutch Language and Speech n n n Founded in 1992 to

Research Topics n Computational Psycholinguistics n n Machine Learning of Language n n n

http: //www. biograph. be Biomedical Text Mining n Bio. Min. T (EU FP 5,

Computational Stylometry n n Computational techniques for stylometry (FWO project, 2007 -2010) Goals n

i. TEC Piet Desmet Faculty of Arts University of Leuven, Campus Kortrijk http: //www.

i. TEC n Interdisciplinary research on Technology, Education & Communication: Computer-assisted language learning n

Recent projects n Lingu@tic n Language learning environment Dutch & French based on video

Lingu@tic Development of a free language learning environment (Dutch & French) based on authentic

Medi@tic n Development of a database of learning objects Repository of free authentic video

DPC Annotated sentence aligned corpus n 10 million words NL-FR and NL-EN n Quality

LIIR Marie-Francine Moens Computer Science University of Leuven http: //www. cs. kuleuven. be/~liir/

Current research topics n Text analysis and information retrieval: n Content recognition in text

LINDO n Large scale multimedia INDexation of multimedia Objects n LIIR task: question answering

n Anticipatory Learning for Reliable Phishing Prevention • Recognition of phishing e-mails based on

n Cognitive-Level Annotation Using Latent Statistical Structure • The annotation of multimedia content by

Language and Translation Technology Team Véronique Hoste Faculty of Translation Studies University college Ghent

Language and Translation Technology Team Founded: May 1, 2006 n Embedding: n Faculty of

3 LT main 1. focus Terminology: terminological research, (automatic) construction of monolingual and multilingual

Readability projects n ABOP n n n IWT-TETRA (2007 -2009) In cooperation with Artesis,

Corpus construction n DPC (Dutch Parallel Corpus) n n STEVIN (2 nd call) (2006

Other n n n Para. Sense (HOF, 2007 -2012): unsupervised WSD based on parallel

Conclusions n n n For historical reasons, the majority of language technology research takes

Slides: 62

Download presentation

Research on Human Language Technologies in Flanders Walter Daelemans (Ed. ) CNTS Language Technology Group University of Antwerp walter. daelemans@ua. ac. be

Prehistory Computational Linguistics n Early isolated Ph. Ds n n n Willy Martin (Leuven, 1970): Analysis of a vocabulary by means of a computer Luc Steels (Antwerpen, 1977): Aspects of a Modular Theory of Language Early International MT involvement (Leuven) n n EUROTRA (early eighties until 1994), METAL (mid-eighties until 1992) Dutch language pairs

Prehistory Computational Linguistics n Droste Ph. Ds: F. G. Droste (1969) Translating with the computer, possibilities and problems. n Frank Van Eynde (Leuven, 1985): Meaning, translatability, and machine translation n Geert Adriaens (Leuven, 1986): Process Linguistics: Theory and Practice of a Cognitive-Scientific Approach to Natural Language Understanding n Walter Daelemans (Leuven, 1987): Studies in Language Technology: an object-oriented model of Dutch morphophonology and its applications n First groups start late eighties, early nineties n n Leuven CCL (Van Eynde) Antwerp CNTS (Daelemans, Gillis)

Prehistory Speech Technology n Early Ph. Ds n n n Jean-Pierre Martens (Gent, 1982). Quality degrading aspects of filtered speech. Luc Vanhove (Gent, 1984). Study and improvement of the linear prediction vocoder. Werner Verhelst (Brussels, 1985). Short-time cepstra and LPC analysis-synthesis of speech. Dirk Van Compernolle (Stanford, 1985). Speech Processing Strategies for a Multichannel Cochlear Prosthesis. First speech technology groups start up in the mid 80’s n n n Gent (ELIS): Marc Vanwormhoudt, Jean-Pierre Martens Leuven (ESAT): Dirk Van Compernolle Brussels (ETRO): Oscar Steenhaut, Werner Verhelst

Prehistory Speech Technology n First collaborations on national level n n 1983 -1989: First IWONL-IRSIA project on speech analysis / synthesis (with UGent, VUB, UCL, Bell Telephone, FPMS, ACEC, Philips, correlative systems) 1987 -1992: National stimulation program on artificial intelligence (with KULeuven, VUB, UCL, ULB, FPMS, UGent)

Start of an organized field n Research Initiative on Dutch speech and language technology (1993 -1994) n n n Flemish ministry of science and technology To “improve and strengthen the position of Dutch” App. 1 million euro Speech recognition research, corpora (Co. Gen, ANNO), pronunciation lexicon (Fonilex) In preparation of a long-term research programme on speech and text translation to and from Dutch

http: //clif. esat. kuleuven. be Start of an organized field n Computational Linguistics in Flanders (CLIF) n n n • Strengthen the integration of fundamental research on language and speech technology in Flanders to establish multidisciplinary, fundamental and applied research of natural language and Dutch in particular • Facilitate research activities of the participating research groups, improving (re-)usability of data for spoken and written language Positive effects n n Sponsor: Flemish National Science Foundation Scientific Research community 1995 -2010 (2 renewals), 12, 500 euro per year Goals Acts as a de facto spokesman of academic research community for government, Dutch language union, etc. Fruitful environment for cooperation All the main research groups are represented International advisory board

Start of an organized field n Flanders Language Valley campus n n Ypres, the centre of Europe Literally arising around Lernout & Hauspie Speech Products • 1985 founded, 1995 NASDAQ, 2001 bankrupt n n 125 million euro investment capacity in FLV Fund 1995 -2005 (liquidation) “Favorable place of business for HLT companies” CELE Research group (Kristiina Jokinen, Dirk Frimout) • University of Antwerp cooperation long term research on CAM Brain Machine • (turned out to be short term research)

From CGN to STEVIN n What happened to the “long-term research programme on speech and text translation to and from Dutch”? n n n CLIF recommendation: extend and valorize the “short term research projects” Opportunities for cooperation with The Netherlands CGN (Spoken Dutch Corpus) n n 1998 -2004 10 million words, linguistically annotated and linked to the signal • Tools, protocols, interface • From spontaneous to read speech n 5 million euro, 1/3 Flanders

From CGN to STEVIN n The position of Dutch in Language and Speech Technology (Bouma & Schuurman, 1998) n n Conclusion: many weak spots and omissions in the available infrastructure Advice: • Install Dutch-Flemish platform (coordinated by NTU) • Stimulate both fundamental and applied research • Set up an inter-university HLT education program in Flanders and reinforce the existing Dutch programs n Action plan for Dutch in language and speech technology: priorities for basic resources (Daelemans & Strik, 2002) n n Prepared the contents and priorities for STEVIN program (1/3 Flanders) n 2004 -2011, 11. 4 million euro

HLT Research Overview

Others n Research within companies n n Karel De Grote university college (Antwerp) n n Readability, subtitling, … Lessius university college (Leuven) n n Language and Computing Nuance … Terminology extraction and management, translation tools Erasmus university college (Brussels) n Terminology, translation tools, corpora

Funding situation in Flanders n Flemish HLT research funding situation is reasonably good, (except for basic research) n Basic research grants (hard to get) • FWO Ph. D and postdoc mandates • FWO research projects • (VNC Dutch-Flemish research projects) n University funding (Special Research Fund) • TOP, GOA, IUAP, … n Application-oriented research • • IWT Ph. D projects IWT SBO / GBOU / STWW projects TETRA projects European Research (Framework Programs)

Research situation in Flanders n The joint research groups cover a large part of the field of language and speech technology research n n n Speech recognition and speech synthesis MT, QA, Information Extraction, Summarization, Information Retrieval, Ontology and Terminology Learning Machine Learning / statistical methods, knowledge-based / linguistic methods, hybrid methods Text analysis (from morphology via syntax to semantics and discourse) Corpus development and annotation Less well-represented n n n Text generation from meaning representations (Spoken) dialogue systems Multimodality

Speech

ELIS-DSSP Jean-Pierre Martens Electronics & Information Systems University of Ghent https: //speech. elis. ugent. be/

ELIS-DSSP n Embedding n n Key dates n n n research group of dept. Electronics & Information Systems (ELIS) 1982: first Ph. D 1986: first Flemish aid for speech impaired persons, working with speech synthesis 1988: speech synthesis technology sold to L&H 1997: creation of spin-off company (Technology & Integration) in domain of alternative communication with speech technology Main research themes in speech technology n n auditory model based speech and music analysis acoustic and lexical modelling for ASR speech segmentation and labelling as a pre-processing step in speech transcription systems objective assessment of disordered speech

ELIS-DSSP n Software development n n n auditory model embedding AMPEX pitch extractor monophonic melody extractor real-time audio indexing system comprising the isolation of speech intervals, speaker turn detection, gender and speaker clustering AUTONOMATA grapheme-to-phoneme conversion toolkit supporting the development of error recovery from baseline system Tool for disordered speech intelligibility assessment Resource development n n Co. Ge. N (Corpus Gesproken Nederlands): ELIS + ESAT FONILEX (Phonetic Lexicon): together with UA, CCL CGN (Corpus Spoken Dutch): project leader for Flanders COST-278 multilingual broadcast news database

Main research results n n n n improved LVCSR by means of data-driven pronunciation variation modeling (ACCENT: FWO) real-time audio segmentation algorithm that came out first in a multilingual evaluation campaign (ATRANOS: IWT, COST 278: EU) improved spontaneous speech recognition by the proper treatment of disfluencies (ATRANOS: IWT) reliable prediction of disordered speech intelligibility by means of phonological features (SPACE: IWT) improved proper name recognition by means of a phonological feature model (SPACE: IWT) improved proper name synthesis by means of special purpose g 2 p converters that can be trained on very few transcribed data using a g 2 pp 2 p approach (AUTONOMATA: STEVIN) improved LVCSR by means of a data-driven compound composition and decomposition strategy (NBEST: STEVIN)

ETRO-DSSP Werner Verhelst Electronics & Informatics University of Brussels http: //www. etro. vub. ac. be/Res earch/DSSP/dssp. htm

Laboratory for Digital Speech and Audio Processing – ETRO-DSSP n Embedding n n Key dates n n n research group of dept. Electronics & Informatics, ETRO of the Vrije Universiteit Brussel 1985: first Ph. D 1988 -1991: collaboration with Institute for Perception Research, The Netherlands 2004: member of Interdisciplinary Institute for Broadband Telecommunication (IBBT) 2006: joint research group for audiovisual speech processing with Northwestern Polytechnic University Xi’An China and start of FWO-WOG Audiovisuele systemen Main research themes in speech technology n n speech modification speech enhancement expressive speech analysis and synthesis audiovisual speech analysis and synthesis

Main development work n Software development n n system for automatic synchronization of studio dialogs with lip movements in video and film postproduction (IWOIB – EOS) speech synthesis for feedback in reading tutor software (IWT - SPACE) audiovisual text to speech synthesis system (Flemish and English) sound management system for public address systems (IWT + ESAT +Televic)

Main development work n Resource development n n Audiovisual recording studio Database with multi-sensor speech recordings (EU-SAFIR with Thales and Voice Insight) Database for Flemish unit selection TTS Audiovisual database with emotional speech (new project)

Main research results n n n n n window and sampling effects in short-time cepstra of voiced speech (IWONL) improved autocorrelative pitch detection with adaptive sign clipping (IWONL) improved voicing source model for vocoders (VUB) the WSOLA algorithm and its use for robust natural sounding time scaling (IWT) perceptual speech and audio modeling with damped sinusoids (IRMUT – IWT with ESAT) least squares theory and design of optimal noise shaping filters for speech and audio requantization (SMS 4 PA – IWT with ESAT and Televic) first cross-database study for expressive speech classification (VIN - IBBT) improved speech recognition in noisy environments with bone conducting microphones (SAFIR - FP 6) improved spelling and syllabification modes in text to speech synthesis (SPACE – IWT)

ESAT/PSI-Spraak Patrick Wambacq, Hugo Van hamme, Dirk Van Compernolle Electrical Engineering, Center for Processing Speech and Images University of Leuven http: //www. esat. kuleuven. be/p si/spraak

ESAT/PSI-Spraak n n n Speech processing research at K. U. Leuven (Dept. Electrical Engineering, Center for Processing Speech and Images) since 1987 Focus on speech recognition and its applications, using inhouse developed large vocabulary continuous speech recognition system 3 staff members: Dirk Van Compernolle, Hugo Van hamme, Patrick Wambacq, ≈ 10 researchers (some postdocs) Extensive computing facilities Cooperations at national and international level through research projects with both academia and industry, for fundamental and applied research Current coordinator of CLIF research community

ESAT/PSI-Spraak: research themes n n n ASR novel architectures (episodic, hybrid, layered approaches) ASR robustness (noise, spontaneous speech, speaker variability, …) speech modeling and representation computational models of human language acquisition applications: CALL, clinical applications, indexing, subtitling, … tools and corpora for ASR

ESAT/PSI-Spraak: FLa. Vo. R n n FLAVOR: “Flexible Large Vocabulary Recognition”: IWT funded, Oct. 2002 - Sept. 2006, with CNTS-U of Antwerp Frustrated by the inflexibility of the traditional monolithic ASR architecture, we set out to n Incorporate Linguistic Knowledge Sources • That allow for efficient modeling of morphologically productive languages • That allow for modelinguistic phenomena that are not well dealt with in a traditional left-to-right architecture • That allow for the modeling of both short and long term dependencies

ESAT/PSI-Spraak: FLa. Vo. R n Through a Modular Recognizer Architecture • That assures a better reusability of components • That relies on a high degree of independence between acoustic and linguistic processing • That allows for a faster decoder and hence makes computational resources available for the more complex linguistic modeling

ESAT/PSI-Spraak: FLa. Vo. R

ESAT/PSI-Spraak: SPACE n n SPACE: “SPeech Algorithms for Clinical and Educational applications”, IWT funded, Mar. 2005 - Feb. 2009, with ELISUgent, DSSP-VUB, ORTHO-KULeuven, COM-UAntwerp Main goals: n n evaluate user’s speech in educational and clinical applications improve speech recognition and speech synthesis technologies to better support these applications: • provide accurate classification of uttered phonemes • provide corrective auditory feedback • particular focus points: children’s speech, disfluencies, mispronunciations, deviant speech (e. g. speech of the deaf, dysarthria) n demonstrate the benefits of speech technology based tools for these applications, with involvement of experienced user groups

ESAT/PSI-Spraak: SPRAAK n n SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit: STEVIN project Dec. 2005 - June 2008, building on in-house developed software since 15 years make state-of-the-art LVCSR system available for research community (free for research purposes): n n n modular toolkit (plug&play) for research on speech recognition, allowing researchers to focus on one particular aspect only and forget about the rest, with access to deep internals of the system (using low-level API) recognizer with simple interface, usable by non-experts through high-level API http: //www. spraak. org

Text

Centrum voor Computerlinguïstiek Frank Van Eynde Faculty of Arts University of Leuven http: //www. ccl. kuleuven. be/

Centrum voor Computerlinguïstiek n n n founded in 1991 at the Faculty of Arts of K. U. Leuven building on the expertise that had been acquired in the 80 s in the framework of the machine translation projects Eurotra and Metal part of the research unit ‘Dutch, German and Computational Linguistics’ since 2005 member of ELSNET since 1993 and of CLIF since 1995 main objectives: (1) acquiring funds for carrying out research in formal and computational linguistics and its application in natural language processing; (2) teaching, training and dissemination

Centrum voor Computerlinguïstiek n n n n formal syntax and semantics (Head-driven Phrase Structure Grammar) corpus annotation (tagging, treebanks, semantic annotation) machine translation multilingual information retrieval teaching at K. U. Leuven and at international summer schools (ESSLLI, ELSNET, EMLS, LOT) host of ESSLLI-90, TMI-95, CLAW-96, ELSNET-97, CLIN-98, EMLS -02, HPSG-04, CLIN-07 http: //www. ccl. kuleuven. be/

Machine Translation METIS-II (EU/FP 6) 2004 -2007 n n n Successor of Metis I (2003 -2004) Hybrid System DU-EN Low Resources (no full parser, no parallel data) BLEU scores about the same as SMT with IBM 1 trained on Europarl Succeeded by Pa. Co-MT (NTU) (2008 -2011) n Hybrid system FR <-> DU <-> EN n Full Resources

Corpus Annotation CGN / D-Coi / Lassy (STEVIN) Series of joined Flemish/Dutch projects n spoken Dutch (CGN 1998 -2000), ± 10 M n written Dutch (D-Coi '05 -'06, Lassy '06 -'09), ± 50 M n Po. S labels, lemmata, treebank n Parts manually corrected (e. g. 1 M treebank in Lassy) n Succeeded by So. Na. R 2008 -2011, ± 500 M semantic labels (corrected) for 1 M (NE, coreference, semantic roles, spatiotemporal) n

CNTS Language Technology Group Walter Daelemans, Steven Gillis Faculty of Arts University of Antwerp http: //www. cnts. ua. ac. be

CNTS Center for Dutch Language and Speech n n n Founded in 1992 to promote research in Dutch (corpus) linguistics, psycholinguistics, and computational linguistics Research Center of the department of Linguistics (Faculty of Arts) Member of Elsnet, CLIF, Flarenet, Clarin, Pascal, CIL, … Co-founded ACL SIG on Computational Language Learning (SIGNLL) and the associated Co. NLL conference series and Co. NLL shared tasks series Resources and software development n n Corpora (CGN, COREA, Knack-2002, …) Ti. MBL, Tadpole (with ILK, Tilburg University) Memory-Based Shallow Parser (MBSP) Spin-off: www. textkernel. nl

Research Topics n Computational Psycholinguistics n n Machine Learning of Language n n n Memory-Based Learning; ML methodology ML-based Text Analysis Phonological and morphological analysis, Prosody and grapheme-tophoneme, POS tagging, chunking, grammatical relations, pp-attachment, named-entity recognition, semantic role labeling, word sense disambiguation, coreference resolution, … LT Applications n n n Computational models of human language acquisition and processing (phonology, morphology, syntax) Biomedical information extraction; Summarization and sentence simplification; Ontology extraction from text; Question Answering NL interface to graphical design packages, serious gaming, computational stylometry African Language Technology

http: //www. biograph. be Biomedical Text Mining n Bio. Min. T (EU FP 5, Quality of Life, 2003 -2005) n Information Retrieval and Information Extraction tool for human curators of SWISSPROT (protein database) • With SIB Geneva, University of Manchester, Pharma. DM, University of Vienna, University of Geneva n Results CNTS • Adaptation Memory-Based Shallow Parser for biomedical language (tagger, tokenizer, NER, grammatical relations) n Biograph (University of Antwerp GOA project, 2007 -2010) n Ranking genes implicated in diseases (schizophrenia, Alzheimer) using heterogeneous data (including text) • With data mining group and molecular biology group of University of Antwerp n Better text mining engine including analysis of modality and negation for better biomedical information extraction

Computational Stylometry n n Computational techniques for stylometry (FWO project, 2007 -2010) Goals n Develop feature construction, feature selection and supervised and unsupervised machine learning techniques for • Authorship, gender, date, and personality attribution from text • Stylistic analysis of literary texts n n Develop and make available a tool and benchmark datasets http: //www. cnts. ua. ac. be/stylometry

i. TEC Piet Desmet Faculty of Arts University of Leuven, Campus Kortrijk http: //www. itec-research. be

i. TEC n Interdisciplinary research on Technology, Education & Communication: Computer-assisted language learning n Corpora & digital libraries n Language testing n Authoring systems n

Recent projects n Lingu@tic n Language learning environment Dutch & French based on video extracts • www. kuleuven-kortrijk. be/linguatic n Medi@tic n Database of video extracts for language learners Dutch & French • www. kuleuven-kortrijk. be/mediatic n DPC n Dutch Parallel Corpus (10 million words Dutch, English & French) • www. kuleuven-kortrijk. be/dpc

Lingu@tic Development of a free language learning environment (Dutch & French) based on authentic broadcasted video extracts n Use of half-open exercises (e. g. translation exercises with alternative answers) n • www. franel. eu

Medi@tic n Development of a database of learning objects Repository of free authentic video materials n Management tool for the description of audio and video assets n Exploration tool for selecting video materials which can be integrated in CALL applications n

DPC Annotated sentence aligned corpus n 10 million words NL-FR and NL-EN n Quality control n Compatible with D-COI n

LIIR Marie-Francine Moens Computer Science University of Leuven http: //www. cs. kuleuven. be/~liir/

Current research topics n Text analysis and information retrieval: n Content recognition in text and information extraction n Automatic indexing of text, topic tracking n Processing of noisy texts such as spam mail and blogs n Knowledge acquisition from text n (Cross-media and cross-lingual) alignment and summarization of content n Information retrieval and search models n Question answering and reasoning

LINDO n Large scale multimedia INDexation of multimedia Objects n LIIR task: question answering of multimedia objects • 11/2007 -10/2010 • EU ITEA 2 (06011) • In collaboration with: Thales Security Systems, France, CEA List France, DENODO, Spain, IRIT, France, SGT, France, Space Applications Services, Belgium, SUPELEC, France, Telefonica Investigacion y Desarolla, Spain and Infoglobal, Spain http: //www. lindo-itea. eu/

n Anticipatory Learning for Reliable Phishing Prevention • Recognition of phishing e-mails based on machine learning techniques • 1/2006 -6/2009 • EU FP 6 -027978 • In collaboration with: • Fraunhofer-Gesellschaft zur Fördering der angewandten Forschung (IAIS, Bonn), Germany • Symantec LIRIC Limited, Ireland, Symantec Ltd, USA, TISCALI Services, Italy http: //www. antiphishresearch. org/home. html

n Cognitive-Level Annotation Using Latent Statistical Structure • The annotation of multimedia content by means of latent statistical techniques • 1/2006 -6/2009 • EU FP 6 -027978 • In collaboration with: • K. U. Leuven (ESAT-Visics), INRIA, Grenoble, France, University of Oxford, UK, University of Helsinki, Finland, Max-Planck Institute for Biological Cybernetics, Germany http: //class. inrialpes. fr/

Language and Translation Technology Team Véronique Hoste Faculty of Translation Studies University college Ghent http: //veto. hogent. be/lt 3/

Language and Translation Technology Team Founded: May 1, 2006 n Embedding: n Faculty of Translation Studies n University College Ghent n Ghent University Association (research group: “language technology and computional intelligence) n

3 LT main 1. focus Terminology: terminological research, (automatic) construction of monolingual and multilingual term banks. 2. Corpora: construction of monolingual and parallel corpora for linguistic and NLP research. 3. Language technology: language and translation technology research: (multilingual) word sense disambiguation, alignment, coreference resolution, information extraction, terminology extraction, etc.

Readability projects n ABOP n n n IWT-TETRA (2007 -2009) In cooperation with Artesis, University of Antwerp and University of Leuven Automatic leaflet Optimizer Automatic replacement of scientific terms by their popular counterpart, redundancy detection, speech act detection HENDI n n n HOF (2008 -2011) An Automatic Readability tool for Dutch and English Discourse in Parallel and Comparable Texts Measuring complexity at different levels: lexical, syntactic, pragmatic, etc.

Corpus construction n DPC (Dutch Parallel Corpus) n n STEVIN (2 nd call) (2006 -2009) In cooperation with the university of Leuven (campus Kortrijk) Construction of a 10 million word parallel corpus (Dutch - French English) So. Na. R (Stevin Nederlandstalig Referentiecorpus) n n STEVIN (tender) (2008 -2011) In cooperation with the University of Tilburg, Nijmegen, Twente, Leuven and Utrecht. 500 million word corpus of contemporary written Dutch: Acquistion for Flanders Semantic annotation of a 1 -million word subcorpus: WP management

Other n n n Para. Sense (HOF, 2007 -2012): unsupervised WSD based on parallel corpora Sub-sentential alignment (HOF, 2004 -2010) Autoweb: relation extraction in biomedical texts (BOF, 2006 -2010) Biomedical information retrieval (HOF, 2007 -2012) Bilingual term extraction (private funding, 20082009)

Conclusions n n n For historical reasons, the majority of language technology research takes place in Arts environments, all speech technology takes place in Engineering environments. Flemish research covers (almost) the complete field of HLT Cooperation n n There is extensive cooperation inter and intra the language technology and the speech technology areas There is extensive international cooperation with The Netherlands as privileged partner