Writing II Word processors Spell checkers Grammar checkers

Word processors • Typical functions (with linguistic relevance) – – – – Text formatting

Low-level functions • Text justification – Usually involves inserting spaces, though not randomly –

Low-level functions • Hyphenation – Printing conventions (don’t hyphenate short words, or isolate too

Spelling checkers • First developed in early 1980 s I have a spelling checker,

Spelling checkers • Operate only at the word level • Checking each word independently

Spelling checkers • For not-found words, possible alternatives are suggested – Calculated using “Levenshtein

Levenshtein distance • Smallest number of substitutions, insertions and deletions needed to change one

Spelling checkers • Display of suggestions may or may not take account of likelihood

Spelling checkers • In general do not handle true homograph errors • They could

Spelling checkers • Dictionary size: “the bigger the better? ” – Including rare words

Spelling checkers for other languages • Concept of “spelling” not appropriate for some writing

Grammar checking • “Grammar” as in “good grammar”? • Early grammar checkers were really

Grammar checking • Who wants it and what for? • What mistakes do native

Grammar checking for learners • Language learners have different needs from ordinary users •

Grammar checking • True grammar checking would involve syntactic analysis. . . – Needs

Grammar checking • Subject-verb agreement • Modifier-noun agreement (Eng this~these etc, but more extensive

Grammar checking • For real grammar-checking, use a tagger and/or parser (see later) •

Using a language model • Language model can also be used to distinguish between

Using a language model: n-grams • Counting hits on Google • Homophone distinction –

Using a language model: collocation • Homophone distinction – dessert + camel (307 k)

Slides: 26

Download presentation

Writing II Word processors Spell checkers Grammar checkers 1

Word processors • Typical functions (with linguistic relevance) – – – – Text formatting via a graphical user interface Automatic completion/expansion/correction Spelling correction Grammar and style correction Dictionary and thesaurus functions Sorting and collating (of tables) Count words Compare and merge documents • All of the above according to local norms – As determined by language and/or area 2

Low-level functions • Text justification – Usually involves inserting spaces, though not randomly – Arabic allows letter shapes to be stretched • Counting words – What counts as a word? • Collation, sorting – Alphabetical order may differ from language to language 3

Low-level functions • Hyphenation – Printing conventions (don’t hyphenate short words, or isolate too few characters) – There are rules, which differ by language • Eng (morph) present-ation ~ Fr (phon) présen-tation – Some rules change spelling • Swe trafikk+kultur = trafikultur, but trafikk-kultur • Ger ck k-k, eg dicke dik-ke • Swiss Ger ss can be split, but not in a compound, eg Stras-se but gross-artig (* gros-sartig) – Hyphen is repeated on second line in some languages – Hyphenation should avoid misleading the reader, or other unfortunate consequences, eg mish-ap, leg-end, Arse-nal 4

Spelling checkers • First developed in early 1980 s I have a spelling checker, It came with my PC, • Now standardly available for It plain lee marks four my revue all text-based software (not Miss steaks aye can knot sea. just word processors) Eye ran this poem threw it, • Available for many languages Your sure reel glad two no. Its vary polished in it's weigh, (at least, those for which My checker tolled me sew. “spelling” is a relevant concept) [. . . ] • Nevertheless, still quite crude Candidate for a Pullet Surprise in design and application or Owed to a Spell Checker Jerrold H. Zar http: //www. bios. niu. edu/zar/poem. pdf 5

Spelling checkers • Operate only at the word level • Checking each word independently against a word list (dictionary) – For most languages this implies some knowledge of morphology for handling inflections – Though see what happens when you add a word to the dictionary 6

Spelling checkers • For not-found words, possible alternatives are suggested – Calculated using “Levenshtein distance” • Simple string difference calculation – May or may not take account of likely errors due to • Transpositions of symbols (eg langauge) • Transpositions of neighbouring keys (eg levture) • Phonetic misspellings (eg fizix) 10

Levenshtein distance • Smallest number of substitutions, insertions and deletions needed to change one string into another • Most efficient computer algorithm for calculating this discovered by V. I. Levenshtein in 1965 • (Particular) substitutions, transpositions, etc. may be “weighted” to bias the score • Considering size of dictionary, processing must be lightning fast 12

Spelling checkers • Display of suggestions may or may not take account of likelihood considering – Levenshtein distance score (is the word with the lowest score necessarily the likeliest correction? ) – Frequency of use – Matching part of speech – Readability (long list of alternatives not helpful to a bad speller, eg dyslexic) 13

Spelling checkers • In general do not handle true homograph errors • They could quite easily deal with – Very frequent errors that can be identified by immediate context (eg its~it’s, there~their, no~know, . . . ) – (Some) errors that can be identified by part-of-speech tagging (eg practice~practise) • More difficult to deal with errors that depend on meaning 14

Spelling checkers • Dictionary size: “the bigger the better? ” – Including rare words disadvantageous • especially if they are same as common misspellings (eg bhat) • They clutter up the list of suggestions – Most spelling checkers now compromise • 90, 000 entries according to Wikipedia – Sensible handling of morphology (inflections and derivations) can reduce size considerably 15

Spelling checkers for other languages • Concept of “spelling” not appropriate for some writing systems • If writing system is really phonetic, spell checker only has to deal with true typos (miskeying), not alternative phonetic realisations • Compounding rules in languages like Ger, Dutch mean many “new” words – checker should not flag these if they are potentially correct • Spelling is much less standardized for some languages, eg Heb ‘ עיראק ערק Iraq’ • Languages with very rich morphology have potentially infinite different word forms, so simple dictionary lookup is not appropriate 16

Grammar checking • “Grammar” as in “good grammar”? • Early grammar checkers were really style checkers – Still word-based, will flag use of “weak” words like nice, very, etc. and use of clichés, – and mechanical errors, eg double words, apparent punctuation errors • Now grammar checking involves genuine text analysis • Several companies were involved but Microsoft has now become dominant – Arguably resulting in stagnation (see Wikipedia) 17

Grammar checking • Who wants it and what for? • What mistakes do native speakers make? – Borderline between style and grammar? • media/data is/are; less~fewer; compared to/with • Comma after subject of sentence • however as a conjunction – Some mistakes clear-cut • Do people type ungrammatical sentences? • Mistakes introduced by editing 18

Grammar checking for learners • Language learners have different needs from ordinary users • Mistakes are somewhat predictable • They make different mistakes • They might also like an explanation or link to a grammar (in the pedagogic sense) tutorial • Grammar checker can be predictive, i. e. go looking for specific mistakes • Could be set at an appropriate level 19

Grammar checking • True grammar checking would involve syntactic analysis. . . – Needs a dictionary indicating parts of speech – Morphological processing (as before) – Rules of grammar • . . . and possibly some semantic processing • Actually, it’s too hard to do completely • But a lot can be done 20

Grammar checking • Subject-verb agreement • Modifier-noun agreement (Eng this~these etc, but more extensive for other langauges) • Verb complement checking (wait for, depend on, etc) • Inclusion of a main clause • All of the above only if the sentence is fairly simple 21

Grammar checking • For real grammar-checking, use a tagger and/or parser (see later) • Some things can be done with statistical models – Learn probability of word sequences (ngrams) from a large corpus – Use this model to judge grammaticality of text 22

Using a language model • Language model can also be used to distinguish between – Homophones – Near synonyms • In either case by looking at collocations – Again, n-grams – Or co-occurrence of words in the sentence 24

Using a language model: n-grams • Counting hits on Google • Homophone distinction – – – principle reason (110 k) ~ principal reason (1. 03 m) stationary cupboard (831) ~ stationery cupboard (37. 7 k) could of gone (27. 7 k) ~ could have gone (1. 92 m) I wonder weather (2. 78 k) ~ I wonder whether (1. 74 m) dessert + camel (307 k) ~ desert + camel (2. 07 m) • Near synonym distinction – – “strong coffee” (443 k) ~ “powerful coffee” (668) “strong engine” (86 k) ~ “powerful engine” (614 k) strong + coffee (17. 6 m) ~ powerful + coffee (8. 9 m) strong + engine (28. 6 m) ~ powerful + engine (28. 8 m) 25

Using a language model: collocation • Homophone distinction – dessert + camel (307 k) ~ desert + camel (2. 07 m) • Near synonym distinction – strong + coffee (17. 6 m) ~ powerful + coffee (8. 9 m) – strong + engine (28. 6 m) ~ powerful + engine (28. 8 m) • Similar distinctions can also be measured with reference to a structured thesaurus such as Word. Net ( next week’s topic) 26