Compiling and using a parallel corpus for Translation

































- Slides: 33
Compiling and using a parallel corpus for Translation Studies Ana Frankenberg-Garcia
The study of human translation ØTraditionally not a hard science ØDifficult to be systematic With the advances of corpus linguistics, things can change (Baker 1993)
Advantages of using corpora to study human translation ØAn enormous amount of translated texts ØSystematic analyses ØQuantifiable results
Corpora used in translation research I. Comparable corpora Bilingual Tognini-Bonelli & Manca (2002) EN Farmhouse holidays IT Agroturismo Monolingual Olohan & Baker (2000) EN BNC EN Translational English Corpus
Corpora in translation research II. Parallel corpora Unidirectional L 1 -L 2 e. g. P-ACTRES EN-ES Bidirectional L 1 -L 2 & L 2 -L 1 e. g. COMPARA PT-EN & EN-PT
Compiling parallel corpora text selection • Genre (scientific, imaginative, technical, etc. ) • Mode (oral? written? ) • Variety (standard? regional? ) Are there translations? • Time (contemporary? older? ) • Languages (which? just two or more? ) • Translations (professional? native speakers? different translators? ) • Unidirectional or bidirectional?
Compiling parallel corpora example of interrelated factors for PT-EN spoken popular scientific academic tourism literature politics
Compiling parallel corpora copyright Personal use no hassle Shared use permissions lots of work for limited use lots of users and uses results verifiable
Compiling parallel corpora copyright • Two permissions, double the work – Public domain ST, but copyright TT • Publishers, authors and translators generally don’t know what a corpus is • Protect • Advertise
Compiling parallel corpora alignment Text? Paragraph? Sentence? Clause? Word?
Compiling parallel corpora Alignment tags <id=EBJT 1 1845> …their winglike ears pierced with plastic identity tags. <id=EBJT 1 1845> …suas orelhas, que faziam lembrar asas, furadas e umas etiquetas de plástico a identificá-las. Other tags e. g. textual, grammatical, semantic What do we want tags for? More pre-processing, less post-processing
Our options for A bidirectional parallel corpus of English and Portuguese Funding Portuguese Government and European Union (FEDER and FSE) contract ref. POSC/339/1. 3/C/NAC Project leaders Ana Frankenberg-Garcia & Diana Santos Research assistants Pedro Sousa, Rosário Silva & Susana Inácio
Corpus structure parallel bi-directional EN Translations PT Source texts PT Translations EN Source texts
PT PT 1 EN 1 ST EN PT 2 EN 2 TT Can be used as parallel and comparable One part of the corpus can be used as a control for the other
Language varieties UK Portugal US Mozambique Brazil South Africa Angola PORTUGUESE ENGLISH
Publication dates 2002 1997 1988 1914 1880 1837
Genre Published fiction
Portuguese authors Angola José Eduardo Agualusa Brazil Aluísio Azevedo Autran Dourado Chico Buarque Jô Soares José de Alencar Machado de Assis Manuel Antônio de Almeida Marcos Rey Patrícia Melo Paulo Coelho Rubem Fonseca Mozambique Mia Couto Portugal Camilo Castelo Branco Eça de Queirós José Cardoso Pires José Saramago Jorge de Sena Lídia Jorge Mário de Carvalho Sá Carneiro (Fernando Pessoa)
English authors British Isles United States David Lodge Ian Mc. Ewan Julian Barnes Joseph Conrad Joanna Trollope Kazuo Ishiguro Lewis Carrol Mary Shelley Oscar Wilde Henry James Edgar Allan Poe Richard Zimler South Africa Nadine Gordimer
Portuguese translators Ana Maria Amador, Ana Falcão Bastos, Ana Luísa Faria, Aníbal Fernandes, Carlos Grifo Babo, Cristina Ferreira de Almeida, Cristina Rodriguez, Eduardo Guerra Carneiro, Fernanda Pinto Rodrigues, Geraldo Galvão Ferraz, Helena Cardoso, Januário Leite, José Viera Lima, J. Teixeira de Aguilar, Lídia Cavalcante. Luther, Lucinda Santos Silva, Luís Lobo, Manuel João Gomes, M. F. Gonçalves de Azevedo, Maria Carlota Pracana, Maria do Carmo Figueira, Mário Martins de Carvalho, Nina Videira, Paula Reis, Yolanda Artiaga.
English translators Adria Frizzi, Alan Clarke, Alexis Levitin, Alice Clemente, Cliff Landers, David Brookshaw, David Rosenthal, Elizabeth Lowe, Ellen Watson, Helen Caldwell, Giovanni Pontiero, Graeme Mac Nicoll, Gregory Rabassa, Isabel Burton, John Gledson, John Parker, John Byrne, John Vetch, Margaret Jull Costa, Mary Fitton, Natália Costa, Peter Bush, Richard Zenith, Ronald W. Sousa.
Can any text be included in the corpus? Ø Only published source texts and translations Ø Only English translated directly from Portuguese translated directly from English Ø Only human translations!
Texts 75 translations 72 source texts (extracts)
Size 1, 543, 514 1, 436 , 187 words in in English Portuguese
Tags for highlighted text PMMC 1. en His wife, who had returned from the machamba <tnote> Small plot of land for cultivation</tnote>, interrupted his thoughts: EBDL 2 T 1. en When we sat on the sofa together to watch <title>News at Ten</title> EBDL 1 T 1. pt …the fellow was off on his patrol boat, <named> The Bahian </named>
Tags for highlighted text EBJB 1. en the white bear, <foreign> thalassarctos maritimus </foreign>, is the aristocrat of bears. . . EBLC 1. en `I wish <emph>I</emph> could manage to be glad!´ the Queen said. (tags inserted manually during ocr output cleaning)
Grammar tags Portuguese - PALAVRAS + human revision Petrus/PROP pediu/V_fmc a/DETartd especialidade/N da/PRP+DETartd casa/N --/PU uma/DETarti paella/N valenciana/ADJ --/PU que/SPECrel comemos/V em/PRP silêncio/N , /PU acompanhados/V apenas/ADV do/PRP+DETartd saboroso/ADJ vinho/N Rioja/PROP. /PU
Grammar tags English - CLAWS 7 + human revision Strangers/NN 2 come/VV 0 here/RL for/IF the/AT / feria/NN 1 /, /PU expecting/VVG to/TO be/VBI swept/VVN up/RP into/II a/AT 1 great/JJ flamenco/NN 1 carnival/NN 1
Semantic tags Colour manually inserted (Santos et al 2008) I did, too --changed over to the knitted tie at a <sem=“colour”> red </sem>light. `Or that sort of pale <sem=“colour”> olive </sem>, ´ says Sandra.
Alignment 1 alignment unit = 1 source-text sentence ST TT S S S S 2 S½ Ø S(+S) Easy. Align + human revision Special alignment tags
Encoding IMS Corpus Workbench format (Christ 1994) Interface DISPARA Web (Santos 2002) URL www. linguateca. pt/COMPARA/
Interface www. linguateca. pt/COMPARA/ Free, no registration required PT and EN service Easy to use by people who have never heard of corpora before Powerful and flexible tool for experienced corpus users Results good for research and education
Research in Translation Studies Word-sense disambiguation Loan words Text length Explicitation Published articles available at www. linguateca. pt/COMPARA/ Colour Distinctive lexical distributions Studies unthinkable before corpora… Plenty of room for further research!