Status of Proposed Unicode Changes to the SQL






















- Slides: 22

Status of Proposed Unicode Changes to the SQL Standard by Michael G. Mc. Kenna / Sybase, Inc. Stefan Buchta, Hirotaka Yoshioka / Oracle Corporation v. 1. 3 March 1999 ® (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 1 SQL Issues UIC#14

Introduction l SQL Character Set Internationalization is out-of- date l New accepted standards • Unicode • WWW • Posix • XML • WG 20 • Windows/NT • Java • Oracle/Sybase l Some changes made to I 18 N for SQL 3 • Still under review by ISO and ANSI for future (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 2 SQL Issues UIC#14

Issues with SQL 92 and SQL 3 l Old I 18 N (pre standards) • non conformant • awkward • not implemented l Character Data/Character Columns • Multiple Character Sets • CREATE CHARACTERSET • Character set introducer (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 5 SQL Issues UIC#14

Issues, continued (2) l SQL Names and Literals Example of SQL 92/SQL 3 literal: SELECT * from employee WHERE name = _iso 88591'Müller’ • Now uses Unicode lexical types for identifiers l SQL_TEXT • Superset of all installed character sets • Ideally, should explicitly be Unicode (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 8 SQL Issues UIC#14

Issues, continued (3) l Collation Handling • SQL 92 contains features that (almost) allow the definition of collations • Example: CREATE COLLATION german_dictionary FOR iso 8859_1 FROM (USING(german_default), MODIFY (A < Ä, a < ä, O < Ö, o < ö, U < Ü, u < ü, ß = ss), WHEN NOT FOUND MAX) • No multi-pass ordering like ISO 14651 • Drastically changed for SQL 3 (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 10 SQL Issues UIC#14

Issues, continued (4) l Text element versus Unicode character (10646 levels, applies to collations) • How long is a character? l Composite Characters/Canonical Equivalence ñ ? = n (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle ~ 13 SQL Issues UIC#14

Issues, continued (5) l Upper-/Lowercase Translations (FOLD) l Example: • German Ü <-> ü • But: German “ß” has the uppercase equivalent “SS”, but not all sequences “SS” correspond to “ß” when returned to lowercase. l FOLD function to use Unicode case-mapping, as of January 1999 (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 14 SQL Issues UIC#14

Issues, continued (6) l Client Character Encoding through CLI (Locale negotiation) l MESSAGE TEXT l User Defined Characters (UDC) (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 15 SQL Issues UIC#14

Proposed Changes to SQL l Synchronize with present standards l Character Handling l Collations l Locales (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 16 SQL Issues UIC#14

Synchronize with present standards l JTC 1/SC 22/WG 20 • Programming Languages and I 18 N l JTC 1/SC 2/WG 2 : ISO 10646 -1 • Character Set handling • Unicode concepts l JTC 1/SC 2/ WG 3 • Single byte character sets l ISO 14651/14652 • Standardized collations • Unicode Technical Report (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 17 SQL Issues UIC#14

Synchronize with present de Facto standards l Java • Unicode String type l RFC 2277 • UTF-8 as default internet encoding l XML • Potential Universal data stream • Default encoding is Unicode l ODBC 3. 5 • Mapping with SQL_WCHAR (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 18 SQL Issues UIC#14

(c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 19 SQL Issues UIC#14

Character Sets l SQL_TEXT º Unicode l Eliminate introducer for identifiers • Uxxxx • \ escape l Keep schema default character set l Add UNICHAR datatype (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 20 SQL Issues UIC#14

Character Sets (2) l Surrogate characters l User-defined character mechanism CREATE UDC <char value> FOR <charset name> AS <unicode binary value> [ WHERE LEXICAL PROPERTY LIKE <unicode binary value> [ WITH { UPPER | LOWER } = <unicode binary value> ] ] (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 22 SQL Issues UIC#14

Character Sets (3) l Canonical Equivalence for Identifiers Entry Level 1 Intermediate Level 2 (Vietnamese, Indic, Arabic) Full Level 3 (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle ¹ ’ ¹ ’ A+^ + ’ = ’ = A+^+ ’ 24 SQL Issues UIC#14

Collations l Unicode Consortium Technical Paper #10 for Universal sorting • Has mechanism for cultural differences, “overlays” • Proven in actual implementation (Java, Sybase internal testing) • Issue: No standard cultural variations yet Being developed by National Bodies, de-Facto, TC 304/Europe, ISO 14652 Cultural Registries l Map all data to Unicode for collation results (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 25 SQL Issues UIC#14

Collations (2) l Handle “Text Elements” in comparison • example: Marillo (ll) like “mari_o” like “maril_o” like “mari_lo” } } marillo, marilo } l Handle “Text Elements” in string functions • example (Vietnamese): SUBSTRING (‘ FROM 2 FOR 5 ) should be “ not “ (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle ‘ COLLATE viet_te “ (five text elements) “ (five unicode characters) 26 SQL Issues UIC#14

Locale Architecture l Linguistic profile preferences on connection • Like HTTP language preferences protocol • Consistent SQL locales independent of O/S l Communicate language to called external functions • to handle MESSAGE_TEXT in diagnostic functions l Locale-sensitive FOLD • e. g. Capitalized accents with fr_FR versus fr_CA (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 27 SQL Issues UIC#14

Locale Architecture (2) l Hierarchical Retrieval Methods • Application development • Language/message retrieval • Preferred choice retrieval SELECT * FROM table_foo WHERE HIERARCHY(foo_lang, SELECT lang FROM lang_list ORDER BY japan_nec) (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 28 SQL Issues UIC#14

Open Issues l Meta tagging of languages l Lexical equivalence • not canonically equivalent • columns • compatibility zone • in-line • keywords? • identifiers • non-Latin digits l Current I 18 N proposals for SQL (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 29 SQL Issues UIC#14

Current SQL Proposal vs. This Paper l Agree • Unicode as SQL_TEXT l Disagree • Character set introducers • Simultaneous multiple character set support • Canonical equivalence • UNICHAR • Surrogate characters • Locales • Use of standard character sets and collations • Language for MESSAGE_TEXT • Unicode identifiers • Text elements • FOLD • User-defined characters • Schema default character set • Hierarchical retrieval methods (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 31 SQL Issues UIC#14

Conclusion Unicode + Collations + Locale negotiation “opens the door” for other cultural processing of data • date/time • numbers • conversion to/from strings Fosters better SQL integration with: • Java • XML • the Internet • ODBC Feedback? Please visit: www. g 11 n. org/SQL (c) 1999, M. Mc. Kenna/Sybase Inc. , S. Buchta/Oracle 32 SQL Issues UIC#14