Unicode in ALEPH 1 Session Outline Key concepts

  • Slides: 83
Download presentation
Unicode in ALEPH 1

Unicode in ALEPH 1

Session Outline • Key concepts • Pre-UNICODE ALEPH • ALEPH 500. 14. 2 -

Session Outline • Key concepts • Pre-UNICODE ALEPH • ALEPH 500. 14. 2 - full UNICODE version • Innovations in character conversion mechanism • Implementation of UNICODE conversion, useful remarks, tips 2

Key Concepts 3

Key Concepts 3

Key Concepts • Character - the smallest component of the written text • Character

Key Concepts • Character - the smallest component of the written text • Character set - an agreed upon set of characters For example, - English alphabet : 52 upper and lower case letters - ISO 8859 -5 : basic Latin + Cyrillic characters 4

Key Concepts • Encoding - unique assignment of characters to numerical codes For example,

Key Concepts • Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO 8859 -8 : Hebrew letter ‘ 5 ‘ = 224

Key Concepts • Encoding types: – single byte (i. e. English+another character set) :

Key Concepts • Encoding types: – single byte (i. e. English+another character set) : one byte = character – double byte (i. e. ANSEL, UNICODE) : 2 bytes = character – multi-byte (i. e. CJK, UTF-8) : 1, 2 or 3 bytes = one character 6

Non-UNICODE Systems Non-UNICODE systems: - Based on the single byte encoding schemes - ASCII

Non-UNICODE Systems Non-UNICODE systems: - Based on the single byte encoding schemes - ASCII 7 -bit code space and its 8 -bit extension are limited to 128 and 256 code positions respectively. 7

Non-UNICODE Systems. . . • Restriction of character repertoire to at most 256 characters

Non-UNICODE Systems. . . • Restriction of character repertoire to at most 256 characters proved to be more than rigid: Even implementation of all European characters using Latin script needed more than 400 characters. 8

Non-UNICODE Systems. . . • As a result, multiple national standards developed, adjusting the

Non-UNICODE Systems. . . • As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space. 9

Non-UNICODE Systems -For example, ISO 8859 is a full series of 10 standardized multilingual

Non-UNICODE Systems -For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8 -bit) character sets for writing in alphabetic languages: - Latin 1 (West European) - Latin 2 (East European) Latin 3 (South European) Latin 4 (North European) Cyrillic Arabic etc. 10

Non-UNICODE Systems Results: 1. Use of multiple inconsistent character codes because of the conflicting

Non-UNICODE Systems Results: 1. Use of multiple inconsistent character codes because of the conflicting character sets. For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO 8859 -1. 11

Non-UNICODE Systems 2. No easy way to input multilingual data 3. No transparent transfer

Non-UNICODE Systems 2. No easy way to input multilingual data 3. No transparent transfer of textual data between computer systems high risk of code page related misinterpretation 12

13

13

Unicode Solution provided by the UNICODE standard: • Definition of a set of characters

Unicode Solution provided by the UNICODE standard: • Definition of a set of characters that encompasses most of the major languages of the world 14

Unicode • Based on 16 -bit character codes • Any given 16 -bit value

Unicode • Based on 16 -bit character codes • Any given 16 -bit value always represents the same character. 15

Unicode • Allocation areas: – The codes are grouped in linguistic and functional categories.

Unicode • Allocation areas: – The codes are grouped in linguistic and functional categories. – The Unicode standard code space is divided into several areas, which are themselves divided into character blocks. 16

Unicode 17

Unicode 17

Unicode Encoding schemes: • UTF-16: double byte encoding using the Unicode standard character codes

Unicode Encoding schemes: • UTF-16: double byte encoding using the Unicode standard character codes • UTF-8: multi byte encoding utilizing the full 8 bits of each byte • UTF-7: multi byte encoding utilizing only 7 bits of each byte 18

Unicode Mappings: • Transformation between encoding is based on an algorithm and not a

Unicode Mappings: • Transformation between encoding is based on an algorithm and not a table. • Readily available conversion tables from standard character sets to Unicode • Unicode can act as intermediate encoding. 19

Pre-UNICODE ALEPH 20

Pre-UNICODE ALEPH 20

Pre-Unicode ALEPH differentiated between 2 types of data • Bibliographic: this also includes all

Pre-Unicode ALEPH differentiated between 2 types of data • Bibliographic: this also includes all authorities and holding records • Administrative: patrons, items, acquisition data, serials etc. . 21

Pre-Unicode ALEPH Administrative data: • Inherently homogenous • Data can be stored in a

Pre-Unicode ALEPH Administrative data: • Inherently homogenous • Data can be stored in a single byte encoding of a given character set. 22

Pre-Unicode ALEPH Bibliographic data: In all versions of ALEPH Bibliographic information can be defined

Pre-Unicode ALEPH Bibliographic data: In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support. 23

Pre-Unicode ALEPH Multiscript functionality in the non. UNICODE versions of ALEPH is possible due

Pre-Unicode ALEPH Multiscript functionality in the non. UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field. 24

Pre-Unicode ALEPH 25

Pre-Unicode ALEPH 25

Pre-Unicode ALEPH ALPHA defines input, display, and filing characteristics of the field. 26

Pre-Unicode ALEPH ALPHA defines input, display, and filing characteristics of the field. 26

Pre-Unicode ALEPH Input: One of the configuration files in the GUI client contains definition

Pre-Unicode ALEPH Input: One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. catalog. ini: Font. L=Courier New Font. H=Web Hebrew Monospace Font. A=Aleph Fixed Arabic Egypt Font. S=Courier New Cyr Font. R=Courier New Greek 27

Pre-Unicode ALEPH Output: A similar definition exists for the display characteristics of the bibliographic

Pre-Unicode ALEPH Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom. ini: Font. L 01=11 MS Sans Serif Font. H 01=16 Web Hebrew AD Font. A 01=16 Aleph Fixed Arabic Egypt Font. S 01=18 Courier New Cyr Font. R 01=16 Courier New Greek 28

Pre-Unicode ALEPH Screen capture from MLT 29

Pre-Unicode ALEPH Screen capture from MLT 29

Pre-Unicode ALEPH Filing order is defined per script: char_conv. A: AL 235 AH 235

Pre-Unicode ALEPH Filing order is defined per script: char_conv. A: AL 235 AH 235 30 000 235

Pre-Unicode ALEPH Creation of indexes is ALPHA specific: z 01_rec_key  03 acc_code. .

Pre-Unicode ALEPH Creation of indexes is ALPHA specific: z 01_rec_key 03 acc_code. . . AUT 03 alpha. . . . H 03 filing_text. . . . … צורות חשיבה z 01_rec_key 03 acc_code. . . AUT 03 alpha. . . . L 03 filing_text. . . aamodt agnar 31

Pre-Unicode ALEPH Pre-UNICODE ALEPH is ALPHA dependant 32

Pre-Unicode ALEPH Pre-UNICODE ALEPH is ALPHA dependant 32

Pre-Unicode ALEPH Restrictions: 1. GUI input and output within a single field are limited

Pre-Unicode ALEPH Restrictions: 1. GUI input and output within a single field are limited to one code page 33 Input and output within a single field are still limited to 256 characters of one code page. It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e. g. , a Russian title containing several French words).

Pre-Unicode ALEPH 2. Indexing and retrieval are script dependent. Both FIND and BROWSE are

Pre-Unicode ALEPH 2. Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records. 34

Pre-Unicode ALEPH • For example, the following ‘S’ designated field : will be indexed

Pre-Unicode ALEPH • For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z 01): 35 Words index (z 97):

Pre-Unicode ALEPH ‘S’ marked headings and words can be retrieved only when the ‘S’

Pre-Unicode ALEPH ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent. • 36

UNICODE ALEPH 37

UNICODE ALEPH 37

UNICODE ALEPH 14. 2 is the full UNICODE version 38

UNICODE ALEPH 14. 2 is the full UNICODE version 38

UNICODE ALEPH • Data (bibliographic + administrative) is stored in UTF-8 • GUI client

UNICODE ALEPH • Data (bibliographic + administrative) is stored in UTF-8 • GUI client is UNICODE compatible • No need in character conversion for input and display • ALPHA looses its meaning 39

UNICODE ALEPH - Indexing Words: • Creation of the words index is no longer

UNICODE ALEPH - Indexing Words: • Creation of the words index is no longer ALPHA dependent. • Index is created in UTF-8. • Indexing records increased in size to accommodate Unicode data (z 97). 40

UNICODE ALEPH - Indexing Browse index: • Browse index is not ALPHA specific as

UNICODE ALEPH - Indexing Browse index: • Browse index is not ALPHA specific as well • Index is created in UNICODE - 16 -bit codes • Indexing records are increased in size to accommodate Unicode data (z 01). 41

UNICODE ALEPH - GUI client • Unicode data processing 42

UNICODE ALEPH - GUI client • Unicode data processing 42

UNICODE ALEPH – GUI client • Catalog and Search clients - no limitations in

UNICODE ALEPH – GUI client • Catalog and Search clients - no limitations in input and display of UNICODE data • Administrative clients : – no limitations in display of UNICODE data in the Navigation Map, View windows, Lists BUT – input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing. 43

UNICODE ALEPH - WEB OPAC • WEB OPAC - UFT-8 input and display 44

UNICODE ALEPH - WEB OPAC • WEB OPAC - UFT-8 input and display 44

UNICODE ALEPH - WEB OPAC • ALEPH is sensitive to browser types. • If

UNICODE ALEPH - WEB OPAC • ALEPH is sensitive to browser types. • If browser is less than Net. Scape 6 or Internet Explorer 5, we assume that it does not support UTF-8. www_server_defaults defines the default character set for the non-utf compatible browsers. Example: 45 setenv server_default_charset 8859 -1" "iso-

UNICODE ALEPH tables and html pages Tables and html pages are written in ISO

UNICODE ALEPH tables and html pages Tables and html pages are written in ISO and on-load are converted to utf 8. The utf-8 variants of the WEB pages and tables are stored under. /alephe/utf_files. 46

UNICODE ALEPH tables and html pages The system converts tables and html pages in

UNICODE ALEPH tables and html pages The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF 47

UNICODE ALEPH Printing Printouts produced prom the GUI client: - It UNICODE data processing

UNICODE ALEPH Printing Printouts produced prom the GUI client: - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks. 48

UNICODE ALEPH Printing Printouts produced from the WEB OPAC are converted to single byte

UNICODE ALEPH Printing Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible. 49

UNICODE ALEPH Services Processing of UTF data is enabled in the batch services. 50

UNICODE ALEPH Services Processing of UTF data is enabled in the batch services. 50

UNICODE ALEPH Services 1. Processing of UTF data is enabled in the batch services.

UNICODE ALEPH Services 1. Processing of UTF data is enabled in the batch services. 2. Html pages of the batch jobs which are intended for UTF data processing must contain the following tag: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 51

52

52

Character Conversion Mechanism - Innovations 53

Character Conversion Mechanism - Innovations 53

Character Conversion (old) 54 • • • /alephe/char_conv Separate table for each instance where

Character Conversion (old) 54 • • • /alephe/char_conv Separate table for each instance where character conversion is required; e. g. : char_conv. 1: Internal -> Display char_conv. 3: Catalog -> Internal char_conv. 4: Input -> Internal char_conv. A: filing of bib data char_conv. K: user names char_conv. N: order indexes

Character Conversion Mechanism Innovations • Char_conv tables have been replaced by new unicode 2

Character Conversion Mechanism Innovations • Char_conv tables have been replaced by new unicode 2 xxx tables • All tables convert hexadecimal rather than decimal values: unicode 2 filing-a, unicode 2 pinyin 55

Character Conversion Mechanism Innovations • The values are the Unicode 16 bit code. •

Character Conversion Mechanism Innovations • The values are the Unicode 16 bit code. • There is a built-in algorithm for translation of Unicode values to UFT-8 ones, where necessary. 56

Character Conversion Mechanism Innovations • All the tables are stored in directory /alephe/unicode •

Character Conversion Mechanism Innovations • All the tables are stored in directory /alephe/unicode • Character conversion mechanism is driven by the table tab_character_conversion_line 57

Character Conversion Mechanism Innovations • tab_character_conversion_line provides parameters for the process of character conversion

Character Conversion Mechanism Innovations • tab_character_conversion_line provides parameters for the process of character conversion 58

Character Conversion Mechanism Innovations • tab_character_conversion_line UTF_TO_URL ##### # line_utf 2 line_sb unicode_to_8859_1 UTF_TO_WEB_MAIL

Character Conversion Mechanism Innovations • tab_character_conversion_line UTF_TO_URL ##### # line_utf 2 line_sb unicode_to_8859_1 UTF_TO_WEB_MAIL WWW web_unicode_to_sb LOCATE ##### # line_utf 2 line_utf unicode_to_locate FILING-KEY-01 FILING-KEY-02 ##### # line_utf 2 line_sb unicode_to_filing_01 unicode_to_filing_02 WORD-FIX ##### # line_utf 2 line_utf unicode_to_word_gen 59 # line_utf 2 line_sb

tab_character_conversion_line • col. 1 - name of the procedure WORD-FIX ##### # line_utf 2

tab_character_conversion_line • col. 1 - name of the procedure WORD-FIX ##### # line_utf 2 line_utf unicode_to_word_gen 60

tab_character_conversion_line • col. 2 - server type (PC, WWW, #####) It is possible to

tab_character_conversion_line • col. 2 - server type (PC, WWW, #####) It is possible to apply different types of character conversion when transactions are performed by the different servers. Example: UTF_TO_WEB_MAIL WWW web_unicode_to_sb 61 # line_utf 2 line_sb

tab_character_conversion_line • col. 3 - ALPHA of the field (wildcards possible) Example: UTF_TO_WEB_MAIL ALEPH

tab_character_conversion_line • col. 3 - ALPHA of the field (wildcards possible) Example: UTF_TO_WEB_MAIL ALEPH 300_TO_UTF ALEPH 300_TO_UTF 62 WWW ##### ##### # L S A R H line_utf 2 line_sb 2 line_utf line_sb 2 line_utf web_unicode_to_sb 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Y Y Y

tab_character_conversion_line • col. 4 - program to run Example: LOCATE ##### # line_utf 2

tab_character_conversion_line • col. 4 - program to run Example: LOCATE ##### # line_utf 2 line_utf unicode 2 locate UTF_TO_WEB_MAIL WWW unicode_to_8859_1 63 # line_utf 2 line_sb

tab_character_conversion_line Major Programs: – line_utf 2 line_sb (UTF -> single byte) example of usage

tab_character_conversion_line Major Programs: – line_utf 2 line_sb (UTF -> single byte) example of usage - conversion of data for printing/mailing from the WEB OPAC – line_sb 2 line_utf (single byte -> UTF) example of usage - conversion of single byte data befor upload into ALEPH library – line_utf 2 line_utf example of usage - creation of administrative indexes (vendor, users) 64

tab_character_conversion_line • col. 5 - character conversion table to use Example: UTF_TO_WEB_MAIL WWW LOCATE

tab_character_conversion_line • col. 5 - character conversion table to use Example: UTF_TO_WEB_MAIL WWW LOCATE ##### # line_utf 2 line_utf 65 # line_utf 2 line_sb unicode_to_8859_1 unicode 2 locate

tab_character_conversion_line • col. 6 - defines display of characters which trespass the code page

tab_character_conversion_line • col. 6 - defines display of characters which trespass the code page repertoire Values : Y- display, N or blank - do not display Example: UTF_TO_WEB_MAIL ALEPH 300_TO_UTF ALEPH 300_TO_UTF 66 WWW ##### ##### # L S A R H line_utf 2 line_sb 2 line_utf line_sb 2 line_utf web_unicode_to_sb 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Y Y Y

Implementation, conversion, useful tips • 67

Implementation, conversion, useful tips • 67

Conversion • The whole set of data must be converted to UTF-8 68

Conversion • The whole set of data must be converted to UTF-8 68

How to convert bibliographic data • Use appropriate character conversion tables in $alephe_unicode: 8859_1_to_unicode

How to convert bibliographic data • Use appropriate character conversion tables in $alephe_unicode: 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Create instance for the character conversion you are going to run in tab_character_conversion_line: • 69 ALEPH 300_TO_UTF ALEPH 300_TO_UTF ##### ##### L S A R H line_sb 2 line_utf line_sb 2 line_utf 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Y Y Y

How to convert bibliographic data • Note : col. 6=‘Y’ indicates that a character

How to convert bibliographic data • Note : col. 6=‘Y’ indicates that a character , the conversion of which did not succeed, will still be included into file. ALEPH 300_TO_UTF ##### L line_sb 2 line_utf 8859_1_to_unicode 70 Y

How to convert bibliographic data • Run p_manage_22 (character conversion utility) in order to

How to convert bibliographic data • Run p_manage_22 (character conversion utility) in order to test character conversion process without upload to the database 71

How to convert bibliographic data • Run p_manage_18 (Load Catalog Records) using parameter Character

How to convert bibliographic data • Run p_manage_18 (Load Catalog Records) using parameter Character Conversion in order to perform character conversion at the time of load. 72

How to convert administrative data 1. Upload utilities p_file_04 and p_file_06 have two new

How to convert administrative data 1. Upload utilities p_file_04 and p_file_06 have two new parameters, which enable character conversion handling (more detail’s in lecture on conversion) NOTE: All functional codes must be in ASCII only! 73

Conversion of tables and html pages • In order to have tables and html

Conversion of tables and html pages • In order to have tables and html files converted to utf correctly –Make sure that you have proper character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF – If mecessary modify the corresponding tables in $alephe_unicode 74

Conversion of tables and html pages If there is a need to include several

Conversion of tables and html pages If there is a need to include several scripts into a table / html page, use the following command: !CHARACTER_CONVERSION=8859_8_TO_U TF 75

Conversion of tables and html pages Example. . /pc_tab/catalog/codes. eng !CHARACTER_CONVERSION=8859_8_TO_UTF 100 Y N

Conversion of tables and html pages Example. . /pc_tab/catalog/codes. eng !CHARACTER_CONVERSION=8859_8_TO_UTF 100 Y N N L סופר L Main Entry - סופר !CHARACTER_CONVERSION=8859_1_TO_UTF 76

Conversion of tables and html pages Character conversion of tables is performed in accordance

Conversion of tables and html pages Character conversion of tables is performed in accordance with the structure specified in the table header. It is highly important to have updated headers! 77

Low Versions of WEB Browsers • If browser is less than Net. Scape 6

Low Versions of WEB Browsers • If browser is less than Net. Scape 6 or Internet Explorer 5, we assume that it does not support UTF-8. Therefore, "charset=UTF-8" is translated to "charset=xxx" where xxx is taken from www_server_defaults variable "server_default_charset”: setenv server_default_charset 8859 -1" 78 "iso-

Low Versions of WEB Browsers • The system uses the following tables for fallback

Low Versions of WEB Browsers • The system uses the following tables for fallback display and input in browsers that are not Unicode compatible. • web_unicode_to_sb (display) • sb_to_web_unicode (input) 79

Low Versions of WEB Browsers • tables web_unicode_to_sb and sb_to_web_unicode must be adjusted to

Low Versions of WEB Browsers • tables web_unicode_to_sb and sb_to_web_unicode must be adjusted to your local needs (depending on the codepage of display) • Characters which tresspass the repertour of the codepage you have chosen , can be transliterated. 80

tab_character_conversion_line important definitions 1. Character conversion for browse index creation FILING-KEY-01 ##### # line_utf

tab_character_conversion_line important definitions 1. Character conversion for browse index creation FILING-KEY-01 ##### # line_utf 2 line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf 2 line_sb unicode_to_filing_02 2. Character conversion for words index creation WORD-FIX ##### # line_utf 2 line_utf unicode_to_word_gen 3. Administration VENDOR_NAME_KEY COURSE_NAME_KEY ADM_KEYWORD_KEY data - creation of keys: ##### # line_utf 2 line_utf BORROWER_NAME_KEY ##### # line_utf 2 line_utf ACQ_INDEX ##### # line_utf 2 line_utf adm_name_key acq_index 4. Conversion of mail messages sent from the WEB OPAC to single byte incoding) transliteration possible( UTF_TO_WEB_MAIL WWW # line_utf 2 line_sb 81 web_unicode_to_sb

Settings - PC client - fonts • alephcom/fonts. ini possible to define different fonts

Settings - PC client - fonts • alephcom/fonts. ini possible to define different fonts for different Unicode ranges. Allows using “light” fonts when possible, using “heavy” Unicode font only when necessary 82 List. Box## 0000 00 FF Tahoma 0401 045 F Tahoma 0384 03 CE Tahoma 05 D 0 05 EA Tahoma 0000 FFFF Bitstream Cyberbit

GUI client - font settings If you do not succeed to achieve proper display

GUI client - font settings If you do not succeed to achieve proper display for a certain Unicode range, try adjusting CHARSET. Possible values are: ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET GB 2312_CHARSET CHINESEBIG 5_CHARSET 83