Compiling and using a parallel corpus for Translation

  • Slides: 33
Download presentation
Compiling and using a parallel corpus for Translation Studies Ana Frankenberg-Garcia

Compiling and using a parallel corpus for Translation Studies Ana Frankenberg-Garcia

The study of human translation ØTraditionally not a hard science ØDifficult to be systematic

The study of human translation ØTraditionally not a hard science ØDifficult to be systematic With the advances of corpus linguistics, things can change (Baker 1993)

Advantages of using corpora to study human translation ØAn enormous amount of translated texts

Advantages of using corpora to study human translation ØAn enormous amount of translated texts ØSystematic analyses ØQuantifiable results

Corpora used in translation research I. Comparable corpora Bilingual Tognini-Bonelli & Manca (2002) EN

Corpora used in translation research I. Comparable corpora Bilingual Tognini-Bonelli & Manca (2002) EN Farmhouse holidays IT Agroturismo Monolingual Olohan & Baker (2000) EN BNC EN Translational English Corpus

Corpora in translation research II. Parallel corpora Unidirectional L 1 -L 2 e. g.

Corpora in translation research II. Parallel corpora Unidirectional L 1 -L 2 e. g. P-ACTRES EN-ES Bidirectional L 1 -L 2 & L 2 -L 1 e. g. COMPARA PT-EN & EN-PT

Compiling parallel corpora text selection • Genre (scientific, imaginative, technical, etc. ) • Mode

Compiling parallel corpora text selection • Genre (scientific, imaginative, technical, etc. ) • Mode (oral? written? ) • Variety (standard? regional? ) Are there translations? • Time (contemporary? older? ) • Languages (which? just two or more? ) • Translations (professional? native speakers? different translators? ) • Unidirectional or bidirectional?

Compiling parallel corpora example of interrelated factors for PT-EN spoken popular scientific academic tourism

Compiling parallel corpora example of interrelated factors for PT-EN spoken popular scientific academic tourism literature politics

Compiling parallel corpora copyright Personal use no hassle Shared use permissions lots of work

Compiling parallel corpora copyright Personal use no hassle Shared use permissions lots of work for limited use lots of users and uses results verifiable

Compiling parallel corpora copyright • Two permissions, double the work – Public domain ST,

Compiling parallel corpora copyright • Two permissions, double the work – Public domain ST, but copyright TT • Publishers, authors and translators generally don’t know what a corpus is • Protect • Advertise

Compiling parallel corpora alignment Text? Paragraph? Sentence? Clause? Word?

Compiling parallel corpora alignment Text? Paragraph? Sentence? Clause? Word?

Compiling parallel corpora Alignment tags <id=EBJT 1 1845> …their winglike ears pierced with plastic

Compiling parallel corpora Alignment tags <id=EBJT 1 1845> …their winglike ears pierced with plastic identity tags. <id=EBJT 1 1845> …suas orelhas, que faziam lembrar asas, furadas e umas etiquetas de plástico a identificá-las. Other tags e. g. textual, grammatical, semantic What do we want tags for? More pre-processing, less post-processing

Our options for A bidirectional parallel corpus of English and Portuguese Funding Portuguese Government

Our options for A bidirectional parallel corpus of English and Portuguese Funding Portuguese Government and European Union (FEDER and FSE) contract ref. POSC/339/1. 3/C/NAC Project leaders Ana Frankenberg-Garcia & Diana Santos Research assistants Pedro Sousa, Rosário Silva & Susana Inácio

Corpus structure parallel bi-directional EN Translations PT Source texts PT Translations EN Source texts

Corpus structure parallel bi-directional EN Translations PT Source texts PT Translations EN Source texts

PT PT 1 EN 1 ST EN PT 2 EN 2 TT Can be

PT PT 1 EN 1 ST EN PT 2 EN 2 TT Can be used as parallel and comparable One part of the corpus can be used as a control for the other

Language varieties UK Portugal US Mozambique Brazil South Africa Angola PORTUGUESE ENGLISH

Language varieties UK Portugal US Mozambique Brazil South Africa Angola PORTUGUESE ENGLISH

Publication dates 2002 1997 1988 1914 1880 1837

Publication dates 2002 1997 1988 1914 1880 1837

Genre Published fiction

Genre Published fiction

Portuguese authors Angola José Eduardo Agualusa Brazil Aluísio Azevedo Autran Dourado Chico Buarque Jô

Portuguese authors Angola José Eduardo Agualusa Brazil Aluísio Azevedo Autran Dourado Chico Buarque Jô Soares José de Alencar Machado de Assis Manuel Antônio de Almeida Marcos Rey Patrícia Melo Paulo Coelho Rubem Fonseca Mozambique Mia Couto Portugal Camilo Castelo Branco Eça de Queirós José Cardoso Pires José Saramago Jorge de Sena Lídia Jorge Mário de Carvalho Sá Carneiro (Fernando Pessoa)

English authors British Isles United States David Lodge Ian Mc. Ewan Julian Barnes Joseph

English authors British Isles United States David Lodge Ian Mc. Ewan Julian Barnes Joseph Conrad Joanna Trollope Kazuo Ishiguro Lewis Carrol Mary Shelley Oscar Wilde Henry James Edgar Allan Poe Richard Zimler South Africa Nadine Gordimer

Portuguese translators Ana Maria Amador, Ana Falcão Bastos, Ana Luísa Faria, Aníbal Fernandes, Carlos

Portuguese translators Ana Maria Amador, Ana Falcão Bastos, Ana Luísa Faria, Aníbal Fernandes, Carlos Grifo Babo, Cristina Ferreira de Almeida, Cristina Rodriguez, Eduardo Guerra Carneiro, Fernanda Pinto Rodrigues, Geraldo Galvão Ferraz, Helena Cardoso, Januário Leite, José Viera Lima, J. Teixeira de Aguilar, Lídia Cavalcante. Luther, Lucinda Santos Silva, Luís Lobo, Manuel João Gomes, M. F. Gonçalves de Azevedo, Maria Carlota Pracana, Maria do Carmo Figueira, Mário Martins de Carvalho, Nina Videira, Paula Reis, Yolanda Artiaga.

English translators Adria Frizzi, Alan Clarke, Alexis Levitin, Alice Clemente, Cliff Landers, David Brookshaw,

English translators Adria Frizzi, Alan Clarke, Alexis Levitin, Alice Clemente, Cliff Landers, David Brookshaw, David Rosenthal, Elizabeth Lowe, Ellen Watson, Helen Caldwell, Giovanni Pontiero, Graeme Mac Nicoll, Gregory Rabassa, Isabel Burton, John Gledson, John Parker, John Byrne, John Vetch, Margaret Jull Costa, Mary Fitton, Natália Costa, Peter Bush, Richard Zenith, Ronald W. Sousa.

Can any text be included in the corpus? Ø Only published source texts and

Can any text be included in the corpus? Ø Only published source texts and translations Ø Only English translated directly from Portuguese translated directly from English Ø Only human translations!

Texts 75 translations 72 source texts (extracts)

Texts 75 translations 72 source texts (extracts)

Size 1, 543, 514 1, 436 , 187 words in in English Portuguese

Size 1, 543, 514 1, 436 , 187 words in in English Portuguese

Tags for highlighted text PMMC 1. en His wife, who had returned from the

Tags for highlighted text PMMC 1. en His wife, who had returned from the machamba <tnote> Small plot of land for cultivation</tnote>, interrupted his thoughts: EBDL 2 T 1. en When we sat on the sofa together to watch <title>News at Ten</title> EBDL 1 T 1. pt …the fellow was off on his patrol boat, <named> The Bahian </named>

Tags for highlighted text EBJB 1. en the white bear, <foreign> thalassarctos maritimus </foreign>,

Tags for highlighted text EBJB 1. en the white bear, <foreign> thalassarctos maritimus </foreign>, is the aristocrat of bears. . . EBLC 1. en `I wish <emph>I</emph> could manage to be glad!´ the Queen said. (tags inserted manually during ocr output cleaning)

Grammar tags Portuguese - PALAVRAS + human revision Petrus/PROP pediu/V_fmc a/DETartd especialidade/N da/PRP+DETartd casa/N

Grammar tags Portuguese - PALAVRAS + human revision Petrus/PROP pediu/V_fmc a/DETartd especialidade/N da/PRP+DETartd casa/N --/PU uma/DETarti paella/N valenciana/ADJ --/PU que/SPECrel comemos/V em/PRP silêncio/N , /PU acompanhados/V apenas/ADV do/PRP+DETartd saboroso/ADJ vinho/N Rioja/PROP. /PU

Grammar tags English - CLAWS 7 + human revision Strangers/NN 2 come/VV 0 here/RL

Grammar tags English - CLAWS 7 + human revision Strangers/NN 2 come/VV 0 here/RL for/IF the/AT / feria/NN 1 /, /PU expecting/VVG to/TO be/VBI swept/VVN up/RP into/II a/AT 1 great/JJ flamenco/NN 1 carnival/NN 1

Semantic tags Colour manually inserted (Santos et al 2008) I did, too --changed over

Semantic tags Colour manually inserted (Santos et al 2008) I did, too --changed over to the knitted tie at a <sem=“colour”> red </sem>light. `Or that sort of pale <sem=“colour”> olive </sem>, ´ says Sandra.

Alignment 1 alignment unit = 1 source-text sentence ST TT S S S S

Alignment 1 alignment unit = 1 source-text sentence ST TT S S S S 2 S½ Ø S(+S) Easy. Align + human revision Special alignment tags

Encoding IMS Corpus Workbench format (Christ 1994) Interface DISPARA Web (Santos 2002) URL www.

Encoding IMS Corpus Workbench format (Christ 1994) Interface DISPARA Web (Santos 2002) URL www. linguateca. pt/COMPARA/

Interface www. linguateca. pt/COMPARA/ Free, no registration required PT and EN service Easy to

Interface www. linguateca. pt/COMPARA/ Free, no registration required PT and EN service Easy to use by people who have never heard of corpora before Powerful and flexible tool for experienced corpus users Results good for research and education

Research in Translation Studies Word-sense disambiguation Loan words Text length Explicitation Published articles available

Research in Translation Studies Word-sense disambiguation Loan words Text length Explicitation Published articles available at www. linguateca. pt/COMPARA/ Colour Distinctive lexical distributions Studies unthinkable before corpora… Plenty of room for further research!