LIS 654 lecture 4 representation of text Thomas

  • Slides: 36
Download presentation
LIS 654 lecture 4 representation of text Thomas Krichel 2012 -10 -05

LIS 654 lecture 4 representation of text Thomas Krichel 2012 -10 -05

what’s up doc? • We have see the representation of images. • Today we

what’s up doc? • We have see the representation of images. • Today we look more at the representation of character data. • This is more difficult than the representation of images because it involves a more sophisticated representation of human culture. • After that we start with copyright.

introduction • From Omeka, we have seen that databases store records. • Records contain

introduction • From Omeka, we have seen that databases store records. • Records contain fields, fields have values. • Here we talk about fundamentally, how do we compose those values. – Numerical values are easy – String values are harder

literature • The library textbooks are hopelessly short and confused about this topic. •

literature • The library textbooks are hopelessly short and confused about this topic. • I have most of what I have here from my own experience. • I recommend Wikipedia, it has fascinating articles about these topics.

all gone to a number • In all modern information system, information is stored

all gone to a number • In all modern information system, information is stored to be treated on a computer. • A computer can only deal with numbers. • As a consequence all information has to be converted into a number. • It's a huge job. • Let’s look at the ground, numbers.

a bit • A bit is the elementary unit of information. • It takes

a bit • A bit is the elementary unit of information. • It takes a binary value. We can label it true/false, black/white, +/-, etc. • Every piece of information in all modern information storage systems has to be reduced to a sequence of bits. • We will denote them 0/1 here.

byte • A byte is a sequence of 8 bits. '0000' to '1111'. There

byte • A byte is a sequence of 8 bits. '0000' to '1111'. There are 2 to the power 8, meaning 256 possibilities to write a byte. • If the byte is required to start with 0, then we can only write '0000000' to '01111111'. This leaves us with 2 to the power 7, meaning 128 possibilities.

hex numbers • Hex numbers contain the usual digits 0 to 9, as well

hex numbers • Hex numbers contain the usual digits 0 to 9, as well as A to F. A means 10, B means 11, etc F means 15. • One hex number can represent 2 to the power 4, meaning 16 possibilities (0 to 15). • Two hex numbers can represent 2 to the power 8 possibilities.

bytes and hex numbers • Since two hex numbers convene the same number of

bytes and hex numbers • Since two hex numbers convene the same number of possibilities as a byte is often represented as two hex numbers. • Thus, for example • '0000' in binary is 00 in hex, • '1111' in binary is 'FF' in hex, • '01111111' in binary is ‘ 7 F‘ in hex

converting information to numbers • • • A lot of problem in converting information

converting information to numbers • • • A lot of problem in converting information comes from some part of the information encode in some form and some other part in some other from. Example: “ 15 Julliet 1923” vs “July 17, 1923” Often such inconsistencies require manual reformatting, which is very expensive.

numerical information • • Some information can be converted to a number using a

numerical information • • Some information can be converted to a number using a simple conversion. Examples: – – A recent point in time is often converted into a number by taking the number of seconds since the first of January 1970. A date is often written as an ISO date in the form yyyymmdd. yyyy in the year, mm is the month and dd the day with leading 0 s.

numerizing • • In the design of every information system, it is a good

numerizing • • In the design of every information system, it is a good idea to convert information into something that is directly a number. There are examples where it is possible directly use a number, such as – – – colours times and dates locations.

another hex number example • • Colors on the world wide web follow the

another hex number example • • Colors on the world wide web follow the red/green/blue color model. Each color is given as a number #rrggbb, where rr is the amount of red gg is the amount of green and bb in the amount of blue. All these numbers are hex numbers. Example – – #FFFFFF white #00 FFFF aqua

non-numerical information • A lot of information is not numerical by its nature. For

non-numerical information • A lot of information is not numerical by its nature. For example – – the name of a person the title of an expression of a work • The information is of a character string nature. • To store character strings in an information system, each character has to be converted to a number.

character • • A character is an indivsible unit of textual information. Textual information

character • • A character is an indivsible unit of textual information. Textual information is composed of characters, and nothing else.

characters and computer • Computers can not deal with characters directly. They can only

characters and computer • Computers can not deal with characters directly. They can only deal with numbers. • There we need to associate a number with every character that we want to use in an information encoding system. • A character set combines characters with number.

ASCII • ASCII is an old character set developed in the United States. It

ASCII • ASCII is an old character set developed in the United States. It is a seven bit character set. • In hex notation, it goes from '00' to '7 F' • Because Anglo-Saxon cultural imperialism, the first 128 characters in Unicode are the same as in ASCII.

notable characters in ASCII decimal • 8 • 9 • 10 • 13 •

notable characters in ASCII decimal • 8 • 9 • 10 • 13 • 32 • 127 hex 8 9 A D 20 7 F byte 08 09 0 A 0 D 20 7 F U+0008 U+0009 U+000 A U+000 D U+0020 U+007 F backspace horizontal tab line feed carriage return space delete

wikipedia notation • • Wikipedia denotes every character in the BMP as U+hhhh where

wikipedia notation • • Wikipedia denotes every character in the BMP as U+hhhh where h is a hex digit 0 -F. We will follow this notation here.

UCS / Unicode • UCS is a universal character set. • It is maintained

UCS / Unicode • UCS is a universal character set. • It is maintained by the International Standards Organization. • Unicode is an industry standard for characters. It is better documented than UCS. • For what we discuss here, UCS and Unicode are the same.

Basic multilingual plane • This is a name for the first 65536 characters in

Basic multilingual plane • This is a name for the first 65536 characters in Unicode. • Each of these characters fits into two bytes and is conveniently represented by four hex numbers. • Even for these characters, there are numerous complications associated with them.

dashes • figure dash ‒ U+2012 to link numbers without a range • en

dashes • figure dash ‒ U+2012 to link numbers without a range • en dash – U+2013 to link numbers with a range • em dash — U+2014 for interjections in a sentence • minus sign − U+2212 for mathematics

“smart” quotes U+201 c “ is the opening double quote U+201 d ” is

“smart” quotes U+201 c “ is the opening double quote U+201 d ” is the closing U+2019 ’ is the apostrophe The single quote of the ASCII character set is considered to be of mixed usage, it should be avoided when a specific use can be done. • Similarly, the double quote of the ASCII character set is imprecise. • •

spaces • non-breaking space, U+00 A 0 is used when you want to avoid

spaces • non-breaking space, U+00 A 0 is used when you want to avoid a line break between the two spaced items. For example in hyperlink text, it is good practice to replace spaces with nonbreaking spaces as to avoid there appearing to be two links. • In whitespace collapsing contents, it can also be use to add extra space.

beyond ascii, foreign languages • • Everything becomes difficult. As an example consider the

beyond ascii, foreign languages • • Everything becomes difficult. As an example consider the characters – – – • o ő ö The latter two can be considered o with diarcitics or as separate characters.

most problematic: encoding • • One issue is how to map characters to numbers.

most problematic: encoding • • One issue is how to map characters to numbers. This is complicated for languages other than English. But assume UCS/Unicode has solved this. But this is not the main problem that we have when working

encoding • • • The encoding determines how the numbers of each character should

encoding • • • The encoding determines how the numbers of each character should be put into bytes. If you have a character set that is has one byte for each character, you have no encoding issue. But then you are limited to 256 characters in your character set.

fixed-length encoding • • • If you have a fixed length encoding, all characters

fixed-length encoding • • • If you have a fixed length encoding, all characters take the same number of bytes. Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that. If you are writing only ASCII, it appears a waste.

variable length encoding • • • The most widely used scheme to encode Unicode

variable length encoding • • • The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8. I will leave out the technical details on how this is. But it is important to understand that the encoding needs to known and correct.

ASCII vs UTF-8 • The ASCII representation of characters in a byte has the

ASCII vs UTF-8 • The ASCII representation of characters in a byte has the first bit set to zero. • This is the same is in UTF-8. • Any other character occupies at least two bytes in UTF-8. • This is in contrast to character sets such as ISO -Latin-1 that occupies more characters in the second half of the byte. • This is THE major problem practical work!

ligature • In fine traditional typography, certain characters appear to be linked to each

ligature • In fine traditional typography, certain characters appear to be linked to each other. • The most command examples in English usage are fi, ff, fl, ffi, ffl.

ligatures growing up • In certain cases, ligatures have become so common that they

ligatures growing up • In certain cases, ligatures have become so common that they have become characters of their own. • A prominent example is the German sz ligature the esszet. It looks a bit like a beta because it is derived from the fraktur font of the characters. • Another example, apparently, is &.

collations • Collations are topic that is related to characters. • A collation is

collations • Collations are topic that is related to characters. • A collation is a sorting order of character strings. • You may think this is trivial, just follow the alphabetic order. • But in many languages, diacritics come to complicate matters.

example German Here are the extra letter of German: Ä/ä, Ö/ö, Ü/ü, ß •

example German Here are the extra letter of German: Ä/ä, Ö/ö, Ü/ü, ß • In German, there are two collations. – DIN 5007 -1 “dictionary collation” treats umlauted characters as if they did not have them, and ß as s. • – DIN 5007 -2 “phonebook collation” treats umlauted as letter and e (ex. ä --> ae), and ß as ss

transliterations • When non-English characters are supposed to be entered in a system used

transliterations • When non-English characters are supposed to be entered in a system used by English speaking people, a transliteration might be used. • This can also be the case if the original script may not be commonly understood. An example are Japanese road sign. • Wikipedia lists 20 different ways to do that for Russian, say. Library of Congress scheme is apparently the most widely used.

http: //openlib. org/home/krichel Please shutdown the computers when you are done. Thank you for

http: //openlib. org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!