UNIVERSIT DEGLI STUDI DI MACERATA Dipartimento di Studi

  • Slides: 50
Download presentation
UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere,

UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (Classe LM-38) Traduzione per la Comunicazione Internazionale – inglese - mod. B STRUMENTI E TECNOLOGIEPER LA TRADUZIONESPECIALISTICA a. a. 2019/2020 Francesca Raffi francesca. raffi@unimc. it 1

What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text,

What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991: 171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992: 7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992: 167) • “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (Mc. Enery & Wilson, 1996: 23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (Mc. Enery et al. , 2006: 5)

What is / is not a corpus…? • A newspaper archive on CD-ROM? The

What is / is not a corpus…? • A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e. g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e. g. for spoken TV language)

Corpora vs. web • Corpora: – Usually stable • searches can be replicated –

Corpora vs. web • Corpora: – Usually stable • searches can be replicated – Control over contents • we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them • concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable • results can change at any time for reasons beyond our control – No control over contents • what/how many texts are indexed by Google’s robots? – Limited control over search results • cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs. Google comparison

What types of corpora exist? A brief overview • A corpus is a principled

What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic

An example of planned balance: the British National Corpus • 100 m words of

An example of planned balance: the British National Corpus • 100 m words of contemporary spoken and written British English • Representative of British English “as a whole” • Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) • Balanced with regard to genre, subject matter and style • Sampling and representativeness very difficult to ensure

BNC • 4, 124 texts: 90% written, 10% spoken • Largest collection of spoken

BNC • 4, 124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10 m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative

BNC written material Sources: • 60% books • 25% periodicals • 5% brochures and

BNC written material Sources: • 60% books • 25% periodicals • 5% brochures and other ephemera • E. g. bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken) Register levels: • 30% literary or technical “high” • 45% “middle” • 25% informal “low”

BNC Subject coverage • Planned to reflect pattern of book publishing in UK over

BNC Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject Imaginative World affairs Social science Leisure Applied science Commerce Arts Natural science Belief & thought Unclassified Number of texts 625 453 510 374 364 284 259 144 146 50 % of total written 22 18 15 11 8 8 8 4 3 3

BNC Spoken corpus • Context-governed material Lectures, tutorials, classrooms News reports Product demonstrations, consultations,

BNC Spoken corpus • Context-governed material Lectures, tutorials, classrooms News reports Product demonstrations, consultations, interviews Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions • • 10/18

BNC Spoken corpus • Ordinary conversation 2000 hrs from 124 volunteers, 38 different regions

BNC Spoken corpus • Ordinary conversation 2000 hrs from 124 volunteers, 38 different regions Four different socio-economic groupings Equal male and female, age range 15 to 60+ All conversations over a 2 -day period recorded No secret recording, and allowed to erase Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • • • include false starts, hesitations, etc. some paralinguistic features (shouting, whispering), use of dialect words/grammar but no phonetic information

What types of corpora exist? A brief overview • A corpus is a principled

What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable 12 dynamic

Dynamic (Monitor) vs static (Finite) • A static corpus will give a snapshot of

Dynamic (Monitor) vs static (Finite) • A static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes • A dynamic corpus is ever-changing • Called “monitor” corpus because allows us to monitor language change over time

Key concepts and technical notions in corpus-based translation studies • Wordlist, frequency list, keyword

Key concepts and technical notions in corpus-based translation studies • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density)

“Type” and “token” • “Token” means individual occurrence of a word • “Type” means

“Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types?

Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens,

Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting

Concordance for nodeword “eyes” (sorted 1 L) generated from the BNC

Concordance for nodeword “eyes” (sorted 1 L) generated from the BNC

Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens,

Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting • Collocation (collocates) • Lemmatisation (morphological analysis) • (POS-)Tagging (grammatical analysis) • Parsing (syntactic analysis)

20 www. nature. com/nature/journal/v 455/n 7215/full/455835 b. html

20 www. nature. com/nature/journal/v 455/n 7215/full/455835 b. html

General / reference monolingual corpora (of English) Last week, tens of thousands of researchers

General / reference monolingual corpora (of English) Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending.

Took to the streets • http: //corpus. leeds. ac. uk/internet. html • English •

Took to the streets • http: //corpus. leeds. ac. uk/internet. html • English • Let’s try to understand: • Meaning • Extended (sentential) co-text, preferential co-selections • Context(s) of use • Semantic preference • Semantic prosody

Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora * http:

Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora * http: //corpus. leeds. ac. uk/internet. html

Let’s explore internal variation - Examples of (possible) useful queries • Any other forms

Let’s explore internal variation - Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) • Plural/singular of the noun street? (colligational constraints) • Other verbs? (collocational flexibility) • Other nouns? (collocational flexibility) • Select “CQP syntax only” * (automatic POS-tagging!) • http: //cwb. sourceforge. net/files/CQP_Tutorial/ • Look at the examples on the following slides for guidance and adapt those models to your searches • Try out a number of different options to familiarise yourself with the search syntax, and understand what kinds of searches it can support

Examples of (possible) useful queries • Any other forms of the verb take? (colligational

Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) Plural/singular of the noun street? (colligational constraints) • [lemma="take"] "to" "the" [lemma="street"] • Lemmatised search: finds all possible forms of verb and noun • [pos="V. *"] "to" "the" [lemma="street"] • Lemmatised and POS-specific search: as above but finds all verbs • [lemma="take"] "to" "the" [pos="N. *"] • Lemmatised and POS-specific search: as above but finds all nouns • Click on the link to the left of the concordance line for context

Now the translation into Italian of “took to the streets” • Verb? • andare

Now the translation into Italian of “took to the streets” • Verb? • andare • scendere • …? • Preposition? • • in nella/nelle per la/per le? …? • Noun? • strada/strade • piazza/piazze • …? Which queries do we need? How many are necessary?

Now the translation into Italian of “take to the street” • [pos="V. *"] []

Now the translation into Italian of “take to the street” • [pos="V. *"] [] [lemma="strada"] NB: [] means ‘any word in that position’ • [pos="V. *"] [] [lemma="piazza"] • very general (slower) • [pos="V. *"] [word="(in|nella|nelle)"] [lemma="strada"] • [pos="V. *"] [word="(in|nella|nelle)"] [lemma="piazza"] • more specific/restrictive NB: | is called ‘pipe’, lists alternatives • [lemma="scendere"] [word="(in|nella|nelle)"] [lemma="strada"] • [lemma="andare"] [word="(in|nella|nelle)"] [lemma="piazza"] • very specific/restrictive

Last week, tens of thousands of researchers took to the streets to register their

Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending.

REGISTER ONE’S OPPOSITION • Now search the BNC for this expression. • What does

REGISTER ONE’S OPPOSITION • Now search the BNC for this expression. • What does it mean? • Which “feelings” are usually “registered”? • • • interest concern support dismay frustrations dissatisfaction disapproval protest commitment …

Monolingual general / reference corpora available online (at least partially, i. e. as demos)

Monolingual general / reference corpora available online (at least partially, i. e. as demos) • British National Corpus (BNC, British English) • www. natcorp. ox. ac. uk • COCA (American English) • http: //corpus. byu. edu/coca/ • The CORIS corpus (Italian) • http: //corpora. dslo. unibo. it/coris_ita. html • Leeds Internet corpora • English, Chinese, Arabic, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish: http: //corpus. leeds. ac. uk/internet. html • Mannheim corpora (German) • http: //corpora. ids-mannheim. de/ccdb • Corpus del Español (Spanish) • www. corpusdelespanol. org • CREA (Spanish) • http: //corpus. rae. es/creanet. html explore the Web to see what other corpora are available !

A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) Comparable

A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) Comparable texts in terms of genre/text type or topic. Usually rather small, created ad-hoc for specific tasks ( «DIY» ), «disposable» comparable parallel Original texts aligned to corresponding translations. Typically available and precompiled (as for general, 31 reference monolingual corpora)

She is the author of numerous articles regarding learning disabilities and she speaks often

She is the author of numerous articles regarding learning disabilities and she speaks often before parent and teacher groups concerning learning and behavior problems. È autrice di numerosi articoli riguardanti le disabilità di apprendimento e ha tenuto spesso conferenze davanti a gruppi di genitori e insegnanti sui problemi del comportamento e dell’apprendimento. All expectations need to be direct and explicit. Don't require this child to 'read between the lines' to glean your intentions. Esplicitare chiaramente tutte le aspettative, in modo da non richiedere al bambino di “leggere tra le righe” per cogliere le intenzioni. Obviously, the child with nonverbal learning disorders would not be expected to be the 'scribe' in a cooperative grouping - her contribution should be in the verbal arena. Ovviamente non ci si deve aspettare che sia lo “scriba” del gruppo cooperativo, il suo contributo deve essere inserito nell’arena verbale. Sentence-level alignment (new line delimited) 32

Bilingual parallel corpora on the web • OPUS corpus, opus. lingfil. uu. se •

Bilingual parallel corpora on the web • OPUS corpus, opus. lingfil. uu. se • A variety of multilingual parallel corpora • • • European Parliament debates (Euro. Parl corpus) European Central Bank corpus UN documents Subtitles (open subtitle project) Software manuals (PHP, OO) … • With linguistic annotation • Online interface based on CWB/CQP syntax • Corpora can also be downloaded for local use • COMPARA (EN-PT) • OSLO Multilingual Corpus 33

http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface help Choose SL

http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface help Choose SL Query Choose TL(s) Other useful functions Sort + Launch the query

http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface [word="a|an|the"] [tnt="JJ. *"]

http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface [word="a|an|the"] [tnt="JJ. *"] "issue"

http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl Query Launch the

http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl Query Launch the query Choose TL(s) Format of search results

http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl

http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl

A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) comparable

A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) comparable Comparable texts in terms of genre/text type or topic. Normally relatively small, created ad-hoc for specific translation assignments ( «DIY» ), «disposable» , for texts belonging to specialised domains parallel 38

Using comparable corpora for translation • Learn something about a specific domain/topic • Understand

Using comparable corpora for translation • Learn something about a specific domain/topic • Understand the source text • Choose the “right” TL term/word/collocation • Identify and reproduce the features of the specific genre/register in the TL • Look for equivalents, definitions and contexts of use in both the source and target language

Source text (we are the translator/interpreter) P. R. O. Bally (1959) “Monadenium arborescens”. Candollea

Source text (we are the translator/interpreter) P. R. O. Bally (1959) “Monadenium arborescens”. Candollea 17: 25 -26. Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4. 25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. They are erect and may be solitary or in twos. If branched, the branches are quite slender, grow erect, and are some 30 – 60 cm. apart. They are smooth, and covered in a green bloom. Leaf scars, which are 10 mm. in diameter, are borne 4 – 7 cm. apart and below each leaf scar is a small tubercle which on older plants has a small reddish/brown spine, but a more robust one up to 2 cm. long on is produced on younger plants. The leaves are crowded terminally around the ends of the stems and are produced from the angles of the stems. They are obovate, pointed and heart shaped, 7 – 19 cm long and 5. 6 – 11 cm. wide. Flowering takes place from an eye situated directly above the leaf scar and several cymes may be produced near the apex of the branches with peduncles 6 – 7 cm. long and 5 – 6 mm. thick. The colour of the inflorescence is red. This species is not in general cultivation due to its 40 rapid

The process for manual corpus construction • We want to build a bilingual specialised

The process for manual corpus construction • We want to build a bilingual specialised comparable corpus for the translation task (English Italian) • Two stages: a) Source language corpus component (English) b) Target language corpus component (Italian) 41

Searching for similar SL (English) texts for the corpus • We look for: •

Searching for similar SL (English) texts for the corpus • We look for: • web pages in English, as similar to our ST as possible • e. g. searching for ‘monadenium’ on google. co. uk • We find, e. g. : en. wikipedia. org/wiki/Monadenium_arborescens www. sdcss. com/monadenium. html davesgarden. com/guides/pf/go/65135/ www. gardening. eu/plants/Succulent-Plants/Monadenium-guentheri/3708/ • You can add to the search string: monadenium filetype: pdf In general pages in pdf format tend to be more informative and authoritative 42

Uninformative, different genre 43

Uninformative, different genre 43

Very informative, authoritative (source: San Diego Cactus Society), similar genre (journal article) 44

Very informative, authoritative (source: San Diego Cactus Society), similar genre (journal article) 44

Uninformative, little connected text, different 45 function (promotional) and genre

Uninformative, little connected text, different 45 function (promotional) and genre

Low quality, unreliable (language) 46

Low quality, unreliable (language) 46

Searching for TL texts • We look for “monadenium” in Italian (reliable) webpages, e.

Searching for TL texts • We look for “monadenium” in Italian (reliable) webpages, e. g. : • http: //www. giardinaggio. it/grasse/singolegrasse/Monadenium. asp

Practical considerations: file types • Corpus files must be downloaded/saved in this format: •

Practical considerations: file types • Corpus files must be downloaded/saved in this format: • Simple/pure text (. txt) • save as “text only” • Common formats of online texts • HTML • File save as xxx. txt • (just modify the file extension) • Microsoft Word these must be converted into (saved as). txt format • Save as xxx. txt • File type plain text “. txt” ok (ignore any error message) • pdf • image/“dead pdf” (not good) vs. searchable pdf (OK) • edit select all copy paste in a new text file save • Plan separate folders for each corpus (sub-)component • e. g. SL/TL, but also more/less authoritative, different genres etc. 48

Practical considerations: corpus query tools Now that we have built the corpus, what concordancing

Practical considerations: corpus query tools Now that we have built the corpus, what concordancing tools are available? • Ant. Conc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www. laurenceanthony. net/software. html • Text. Stat – free, includes an interesting web-spider which downloads as many pages as you want from a particular website (good if you have identified a reliable website) • http: //neon. niederlandistik. fu-berlin. de/en/textstat/ • Word. Smith Tools – commercial tool • (older) version 4. 0 now freely available • http: //lexically. net/wordsmith/version 4/index. htm 49

 • Ant. Conc – user-friendly, many functionalities, and you can download it (for

• Ant. Conc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www. laurenceanthony. net/software. html And what can we do with it? 50