UNIVERSIT DEGLI STUDI DI MACERATA Dipartimento di Studi


























![Now the translation into Italian of “take to the street” • [pos="V. *"] [] Now the translation into Italian of “take to the street” • [pos="V. *"] []](https://slidetodoc.com/presentation_image_h2/2e28aa27dd17351202dbefa2ccd2bced/image-27.jpg)







![http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface [word="a|an|the"] [tnt="JJ. *"] http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface [word="a|an|the"] [tnt="JJ. *"]](https://slidetodoc.com/presentation_image_h2/2e28aa27dd17351202dbefa2ccd2bced/image-35.jpg)















- Slides: 50
UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (Classe LM-38) Traduzione per la Comunicazione Internazionale – inglese - mod. B STRUMENTI E TECNOLOGIEPER LA TRADUZIONESPECIALISTICA a. a. 2019/2020 Francesca Raffi francesca. raffi@unimc. it 1
What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991: 171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992: 7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992: 167) • “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (Mc. Enery & Wilson, 1996: 23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (Mc. Enery et al. , 2006: 5)
What is / is not a corpus…? • A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e. g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e. g. for spoken TV language)
Corpora vs. web • Corpora: – Usually stable • searches can be replicated – Control over contents • we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them • concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable • results can change at any time for reasons beyond our control – No control over contents • what/how many texts are indexed by Google’s robots? – Limited control over search results • cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs. Google comparison
What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic
An example of planned balance: the British National Corpus • 100 m words of contemporary spoken and written British English • Representative of British English “as a whole” • Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) • Balanced with regard to genre, subject matter and style • Sampling and representativeness very difficult to ensure
BNC • 4, 124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10 m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative
BNC written material Sources: • 60% books • 25% periodicals • 5% brochures and other ephemera • E. g. bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken) Register levels: • 30% literary or technical “high” • 45% “middle” • 25% informal “low”
BNC Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject Imaginative World affairs Social science Leisure Applied science Commerce Arts Natural science Belief & thought Unclassified Number of texts 625 453 510 374 364 284 259 144 146 50 % of total written 22 18 15 11 8 8 8 4 3 3
BNC Spoken corpus • Context-governed material Lectures, tutorials, classrooms News reports Product demonstrations, consultations, interviews Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions • • 10/18
BNC Spoken corpus • Ordinary conversation 2000 hrs from 124 volunteers, 38 different regions Four different socio-economic groupings Equal male and female, age range 15 to 60+ All conversations over a 2 -day period recorded No secret recording, and allowed to erase Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • • • include false starts, hesitations, etc. some paralinguistic features (shouting, whispering), use of dialect words/grammar but no phonetic information
What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable 12 dynamic
Dynamic (Monitor) vs static (Finite) • A static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes • A dynamic corpus is ever-changing • Called “monitor” corpus because allows us to monitor language change over time
Key concepts and technical notions in corpus-based translation studies • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density)
“Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types?
Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting
Concordance for nodeword “eyes” (sorted 1 L) generated from the BNC
Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting • Collocation (collocates) • Lemmatisation (morphological analysis) • (POS-)Tagging (grammatical analysis) • Parsing (syntactic analysis)
20 www. nature. com/nature/journal/v 455/n 7215/full/455835 b. html
General / reference monolingual corpora (of English) Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending.
Took to the streets • http: //corpus. leeds. ac. uk/internet. html • English • Let’s try to understand: • Meaning • Extended (sentential) co-text, preferential co-selections • Context(s) of use • Semantic preference • Semantic prosody
Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora * http: //corpus. leeds. ac. uk/internet. html
Let’s explore internal variation - Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) • Plural/singular of the noun street? (colligational constraints) • Other verbs? (collocational flexibility) • Other nouns? (collocational flexibility) • Select “CQP syntax only” * (automatic POS-tagging!) • http: //cwb. sourceforge. net/files/CQP_Tutorial/ • Look at the examples on the following slides for guidance and adapt those models to your searches • Try out a number of different options to familiarise yourself with the search syntax, and understand what kinds of searches it can support
Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) Plural/singular of the noun street? (colligational constraints) • [lemma="take"] "to" "the" [lemma="street"] • Lemmatised search: finds all possible forms of verb and noun • [pos="V. *"] "to" "the" [lemma="street"] • Lemmatised and POS-specific search: as above but finds all verbs • [lemma="take"] "to" "the" [pos="N. *"] • Lemmatised and POS-specific search: as above but finds all nouns • Click on the link to the left of the concordance line for context
Now the translation into Italian of “took to the streets” • Verb? • andare • scendere • …? • Preposition? • • in nella/nelle per la/per le? …? • Noun? • strada/strade • piazza/piazze • …? Which queries do we need? How many are necessary?
Now the translation into Italian of “take to the street” • [pos="V. *"] [] [lemma="strada"] NB: [] means ‘any word in that position’ • [pos="V. *"] [] [lemma="piazza"] • very general (slower) • [pos="V. *"] [word="(in|nella|nelle)"] [lemma="strada"] • [pos="V. *"] [word="(in|nella|nelle)"] [lemma="piazza"] • more specific/restrictive NB: | is called ‘pipe’, lists alternatives • [lemma="scendere"] [word="(in|nella|nelle)"] [lemma="strada"] • [lemma="andare"] [word="(in|nella|nelle)"] [lemma="piazza"] • very specific/restrictive
Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending.
REGISTER ONE’S OPPOSITION • Now search the BNC for this expression. • What does it mean? • Which “feelings” are usually “registered”? • • • interest concern support dismay frustrations dissatisfaction disapproval protest commitment …
Monolingual general / reference corpora available online (at least partially, i. e. as demos) • British National Corpus (BNC, British English) • www. natcorp. ox. ac. uk • COCA (American English) • http: //corpus. byu. edu/coca/ • The CORIS corpus (Italian) • http: //corpora. dslo. unibo. it/coris_ita. html • Leeds Internet corpora • English, Chinese, Arabic, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish: http: //corpus. leeds. ac. uk/internet. html • Mannheim corpora (German) • http: //corpora. ids-mannheim. de/ccdb • Corpus del Español (Spanish) • www. corpusdelespanol. org • CREA (Spanish) • http: //corpus. rae. es/creanet. html explore the Web to see what other corpora are available !
A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) Comparable texts in terms of genre/text type or topic. Usually rather small, created ad-hoc for specific tasks ( «DIY» ), «disposable» comparable parallel Original texts aligned to corresponding translations. Typically available and precompiled (as for general, 31 reference monolingual corpora)
She is the author of numerous articles regarding learning disabilities and she speaks often before parent and teacher groups concerning learning and behavior problems. È autrice di numerosi articoli riguardanti le disabilità di apprendimento e ha tenuto spesso conferenze davanti a gruppi di genitori e insegnanti sui problemi del comportamento e dell’apprendimento. All expectations need to be direct and explicit. Don't require this child to 'read between the lines' to glean your intentions. Esplicitare chiaramente tutte le aspettative, in modo da non richiedere al bambino di “leggere tra le righe” per cogliere le intenzioni. Obviously, the child with nonverbal learning disorders would not be expected to be the 'scribe' in a cooperative grouping - her contribution should be in the verbal arena. Ovviamente non ci si deve aspettare che sia lo “scriba” del gruppo cooperativo, il suo contributo deve essere inserito nell’arena verbale. Sentence-level alignment (new line delimited) 32
Bilingual parallel corpora on the web • OPUS corpus, opus. lingfil. uu. se • A variety of multilingual parallel corpora • • • European Parliament debates (Euro. Parl corpus) European Central Bank corpus UN documents Subtitles (open subtitle project) Software manuals (PHP, OO) … • With linguistic annotation • Online interface based on CWB/CQP syntax • Corpora can also be downloaded for local use • COMPARA (EN-PT) • OSLO Multilingual Corpus 33
http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface help Choose SL Query Choose TL(s) Other useful functions Sort + Launch the query
http: //opus. lingfil. uu. se/ Euro. Parl v 7 search interface [word="a|an|the"] [tnt="JJ. *"] "issue"
http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl Query Launch the query Choose TL(s) Format of search results
http: //opus. lingfil. uu. se/ OPUS multilingual search interface > Europarl
A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) comparable Comparable texts in terms of genre/text type or topic. Normally relatively small, created ad-hoc for specific translation assignments ( «DIY» ), «disposable» , for texts belonging to specialised domains parallel 38
Using comparable corpora for translation • Learn something about a specific domain/topic • Understand the source text • Choose the “right” TL term/word/collocation • Identify and reproduce the features of the specific genre/register in the TL • Look for equivalents, definitions and contexts of use in both the source and target language
Source text (we are the translator/interpreter) P. R. O. Bally (1959) “Monadenium arborescens”. Candollea 17: 25 -26. Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4. 25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. They are erect and may be solitary or in twos. If branched, the branches are quite slender, grow erect, and are some 30 – 60 cm. apart. They are smooth, and covered in a green bloom. Leaf scars, which are 10 mm. in diameter, are borne 4 – 7 cm. apart and below each leaf scar is a small tubercle which on older plants has a small reddish/brown spine, but a more robust one up to 2 cm. long on is produced on younger plants. The leaves are crowded terminally around the ends of the stems and are produced from the angles of the stems. They are obovate, pointed and heart shaped, 7 – 19 cm long and 5. 6 – 11 cm. wide. Flowering takes place from an eye situated directly above the leaf scar and several cymes may be produced near the apex of the branches with peduncles 6 – 7 cm. long and 5 – 6 mm. thick. The colour of the inflorescence is red. This species is not in general cultivation due to its 40 rapid
The process for manual corpus construction • We want to build a bilingual specialised comparable corpus for the translation task (English Italian) • Two stages: a) Source language corpus component (English) b) Target language corpus component (Italian) 41
Searching for similar SL (English) texts for the corpus • We look for: • web pages in English, as similar to our ST as possible • e. g. searching for ‘monadenium’ on google. co. uk • We find, e. g. : en. wikipedia. org/wiki/Monadenium_arborescens www. sdcss. com/monadenium. html davesgarden. com/guides/pf/go/65135/ www. gardening. eu/plants/Succulent-Plants/Monadenium-guentheri/3708/ • You can add to the search string: monadenium filetype: pdf In general pages in pdf format tend to be more informative and authoritative 42
Uninformative, different genre 43
Very informative, authoritative (source: San Diego Cactus Society), similar genre (journal article) 44
Uninformative, little connected text, different 45 function (promotional) and genre
Low quality, unreliable (language) 46
Searching for TL texts • We look for “monadenium” in Italian (reliable) webpages, e. g. : • http: //www. giardinaggio. it/grasse/singolegrasse/Monadenium. asp
Practical considerations: file types • Corpus files must be downloaded/saved in this format: • Simple/pure text (. txt) • save as “text only” • Common formats of online texts • HTML • File save as xxx. txt • (just modify the file extension) • Microsoft Word these must be converted into (saved as). txt format • Save as xxx. txt • File type plain text “. txt” ok (ignore any error message) • pdf • image/“dead pdf” (not good) vs. searchable pdf (OK) • edit select all copy paste in a new text file save • Plan separate folders for each corpus (sub-)component • e. g. SL/TL, but also more/less authoritative, different genres etc. 48
Practical considerations: corpus query tools Now that we have built the corpus, what concordancing tools are available? • Ant. Conc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www. laurenceanthony. net/software. html • Text. Stat – free, includes an interesting web-spider which downloads as many pages as you want from a particular website (good if you have identified a reliable website) • http: //neon. niederlandistik. fu-berlin. de/en/textstat/ • Word. Smith Tools – commercial tool • (older) version 4. 0 now freely available • http: //lexically. net/wordsmith/version 4/index. htm 49
• Ant. Conc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www. laurenceanthony. net/software. html And what can we do with it? 50