Workshop Ant Conc as a corpuslinguistic program for

  • Slides: 41
Download presentation
Workshop: Ant. Conc as a corpus-linguistic program for ad-hoc LSP purposes Presentation for the

Workshop: Ant. Conc as a corpus-linguistic program for ad-hoc LSP purposes Presentation for the NDSU Conference by Birthe Mousten MA, Ph. D. 19 July, 2016

ABSTRACT Corpus linguistics is a field connected with tracking language for mega-corpora such as

ABSTRACT Corpus linguistics is a field connected with tracking language for mega-corpora such as Oxford’s, Longman’s and Webster’s dictionaries. Even though such dictionaries are based on huge corpora, they often fail to provide answers to even fairly simple technical and scientific terminology and lexicography questions. This knowledge representation gap cannot be referred to corpus linguistics being useless for LSP purposes, but must be referred to the lack of LSP inputs in the corpora used for dictionary tracking. This is where an ad-hoc corpus comes into the picture. An ad-hoc corpus is collected on the fly, typically for a certain LSP task at a certain time. It can therefore be used as a tool for technical writers and translators who need to swiftly map the lexicographic, terminological and genre characteristics of a new, delimited field. The ad-hoc corpus tool is the freeware program Ant. Conc. Join me for an ad-hoc LSP task.

CORPUS LINGUISTICS - OVERVIEW � Articles about ad-hoc corpus linguistics � Mega-corpora � Ad-hoc-corpora

CORPUS LINGUISTICS - OVERVIEW � Articles about ad-hoc corpus linguistics � Mega-corpora � Ad-hoc-corpora � Traditional corpus use � Cross-linguistic corpus use � Specialized language corpora � Training with Ant. Conc – freeware program

CORPUS LINGUISTICS - ARTICLES Our humble start:

CORPUS LINGUISTICS - ARTICLES Our humble start:

CORPUS LINGUISTICS - TUTORIAL

CORPUS LINGUISTICS - TUTORIAL

CORPUS LINGUISTICS – CASE STUDY

CORPUS LINGUISTICS – CASE STUDY

Mega corpora American National Corpora: http: //corpus. byu. edu/ British National Corpus: http: //www.

Mega corpora American National Corpora: http: //corpus. byu. edu/ British National Corpus: http: //www. natcorp. ox. ac. uk/ Wordschatz: Http: //wortschatz. uni-leipzig. de/ Korpus. DK http: //ordnet. dk/

OVERVIEW AMERICAN NATIONAL CORPORA

OVERVIEW AMERICAN NATIONAL CORPORA

BRITISH NATIONAL CORPUS

BRITISH NATIONAL CORPUS

GERMAN NATIONAL CORPUS

GERMAN NATIONAL CORPUS

DANISH NATIONAL CORPUS

DANISH NATIONAL CORPUS

TASK LATER: KNOWLEDGE ABOUT INSULIN ANALOGS � You will be asked to work with

TASK LATER: KNOWLEDGE ABOUT INSULIN ANALOGS � You will be asked to work with insulin analogs later, so why not test that one right now? � We will search the American megacorpora for insulin analog and see where it gets us. � Let us try the American megacorpus first.

AMERICAN NOW CORPUS Result: 8 hits = 8 texts; at a closer look maybe

AMERICAN NOW CORPUS Result: 8 hits = 8 texts; at a closer look maybe only 4, of which one is not American English, but probably Indian English, and one is international English. The three hits from FDA are probably from only one text. So in practice, from an American point of view, there is the FDA text and the Seeking Alpha text. Not very impressing: Shows the need for a lack of further search to work with the area. However, why not take the FDA text for our corpus now that we have it. So we click the text and…

THE FIRST REFERENCE RENDERS THIS: …. get this text, which we copy for our

THE FIRST REFERENCE RENDERS THIS: …. get this text, which we copy for our text corpus.

OUR CORPUS TEXT NO. 1 The same text copied to Word.

OUR CORPUS TEXT NO. 1 The same text copied to Word.

STARTING OUR COLLECTION FOR THE CORPUS 1) 2) 3) Copy text Open Word Saving

STARTING OUR COLLECTION FOR THE CORPUS 1) 2) 3) Copy text Open Word Saving text in Word in a file in a folder: - Save as –> source/title/date (or your chosen parameters) - Save as. txt by choosing plain text => (all codes from html removed)

Save as My chosen folder Source text date Plain text

Save as My chosen folder Source text date Plain text

BUILDING UP THE CORPUS �I have to search elsewhere for text, and why not

BUILDING UP THE CORPUS �I have to search elsewhere for text, and why not the largest big-data corpus in the world: Google. � I use Google advanced search – it is easier, and quicker. � But a small step before that – I read my wiki � https: //www. google. ca/advanced_search

WIKIPEDIA � � � To help me make a very precise search on Google,

WIKIPEDIA � � � To help me make a very precise search on Google, I sneak peak in Wikipedia to see whether it knows anything about insulin analog: An insulin analog is an altered form of insulin, different from any occurring in nature, but still available to the human body for performing the same action as human insulin in terms of glycemic control. Through genetic engineering of the underlying DNA, the amino acid sequence of insulin can be changed to alter its ADME (absorption, distribution, metabolism, and excretion) characteristics. Officially, the U. S. Food and Drug Administration (FDA) refers to these as "insulin receptor ligands", although they are more commonly referred to as insulin analogs. These modifications have been used to create two types of insulin analogs: those that are more readily absorbed from the injection site and therefore act faster than natural insulin injected subcutaneously, intended to supply the bolus level of insulin needed at mealtime (prandial insulin); and those that are released slowly over a period of between 8 and 24 hours, intended to supply the basal level of insulin during the day and particularly at nighttime (basal insulin). The first insulin analog approved for human therapy (insulin Lispro r. DNA) was manufactured by Eli Lilly and Company.

THEN GOOGLE ADVANCED I want these parameters My key concept Most recent year

THEN GOOGLE ADVANCED I want these parameters My key concept Most recent year

GOOGLE ADVANCED RESULTS Ok – let’s get to it then Give yourself 10 minutes

GOOGLE ADVANCED RESULTS Ok – let’s get to it then Give yourself 10 minutes to copy paste into your folder.

TEN MINUTES AFTER I HAVE MY CORPUS Then I am ready for my corpus

TEN MINUTES AFTER I HAVE MY CORPUS Then I am ready for my corpus work

Now --- our task We just got a task from a company, say Eli

Now --- our task We just got a task from a company, say Eli Lilly or Novo Nordisk about writing or translation something about insulin analog How can a corpus help you? 1) 2) 3) 4) 5) 6) Register Collocations Definitions Synonyms Knowledge! Etc.

Getting the program Find Laurence Anthony’s website – Just google search the name. By

Getting the program Find Laurence Anthony’s website – Just google search the name. By the way, the address is here: http: //www. antlab. sci. waseda. ac. jp/software. html #antpconc Press: Ant. Conc 3. 4. 4 (or the Mac or other version that is compatible with your computer) (NB: The language code must be Western Latin 1! (Check under Global Settings – Language encoding – set it to Western Latin 1) Please join me here!

Loading your corpus into the program Guide: • Press Ant. Conc 3. 4. 4

Loading your corpus into the program Guide: • Press Ant. Conc 3. 4. 4 (sprogudgave skal være Latin 1) • Load your folder (Windows explorer method) • Then you are ready to search.

ANTCONC OPENED – BUT EMPTY Ant. Conc is now open. 1) Press File 2)

ANTCONC OPENED – BUT EMPTY Ant. Conc is now open. 1) Press File 2) Open Director y 3) Load your corpus folder in the Windows

YOUR CORPUS LIST

YOUR CORPUS LIST

PRESS WORD LIST

PRESS WORD LIST

LANTUS – WHAT IS THAT? Only three sources: -> Product name? Check in texts.

LANTUS – WHAT IS THAT? Only three sources: -> Product name? Check in texts.

WHAT IS HYPOGLYCEMIA If you press some of the words, you get directly into

WHAT IS HYPOGLYCEMIA If you press some of the words, you get directly into the texts. Scroll down -> different sources use the word => ESP word

THE OR TRICK Finding: Content knowledge Alternatives Synonyms (try also Aka Referred to Known

THE OR TRICK Finding: Content knowledge Alternatives Synonyms (try also Aka Referred to Known as (

THE ( TRICK – FINDING PARENTHETICAL INFO Which kind of info would you find

THE ( TRICK – FINDING PARENTHETICAL INFO Which kind of info would you find in a parenthesis and what does that data tell you?

FINDING COLLOCATIONS - RIGHT Set the sorting parameters at the bottom. 1 R 2

FINDING COLLOCATIONS - RIGHT Set the sorting parameters at the bottom. 1 R 2 R 3 R Findings: Metabolic changes Metabolic control Metabolic decompensation Metabolic deterioration All of them concepts in their own right.

FINDING COLLOCATIONS - LEFT Set the sorting parameters at the bottom. 1 L 2

FINDING COLLOCATIONS - LEFT Set the sorting parameters at the bottom. 1 L 2 L 3 L Findings: Good metabolic Poor metabolic Rapid metabolic … . . but for instance not bad metabolic!

STATISTICS & COLLOCATION – FOR THE NERDS Shows ranking, frequency left, frequency right, statistics

STATISTICS & COLLOCATION – FOR THE NERDS Shows ranking, frequency left, frequency right, statistics and the collocate. Note for instance regimen, can be used with dosing a L and R collocate. The same meaning?

FILE VIEW – FOR BETTER CONTEXT

FILE VIEW – FOR BETTER CONTEXT

CONCORDANCE PLOT FOR PRECAUTIONS Shows in which texts a word exists and how the

CONCORDANCE PLOT FOR PRECAUTIONS Shows in which texts a word exists and how the word use is distributed throughout the text. Precautions has a strong front tendency in the texts. A possible genre tool?

YOUR TURN � Let your phantasy loose. � What do you find out? �

YOUR TURN � Let your phantasy loose. � What do you find out? � What can it be used for? � Is it useful in the first place?

MY OPINION – GOOD The program is intuitive in use – I never learned

MY OPINION – GOOD The program is intuitive in use – I never learned it from anyone. General Microsoft processes work in Ant. Conc. Good as an ad-hoc writing and translation tool. Good for register work. Good for collocations. Good for proof of what you are doing. Forget about your intuitive ideas and check how it works. Define a problem and devise your own solution method. Even 10 texts as in our case can provide a wealth of information. Ten times faster and 100 times more reliable than tiresome open-and-read x number of Google docs. Must nowadays necessarily replace any oldfashioned pencil-and-paper work.

MY OPINION - LIMITATIONS � The quality of the findings depends on the user

MY OPINION - LIMITATIONS � The quality of the findings depends on the user Quality in => quality out � You have got to get started to like it � You have to learn approx. five shortcuts in order not to tire out

� Thank you for joining me in this. �I wish you the best of

� Thank you for joining me in this. �I wish you the best of luck with the rest of the conference. � If you want to contact me, please write me: � bmo@dac. au. dk / bmo@expo-com. dk