Optimizing the Usage of Normalization Vladimir Weinstein Markus

  • Slides: 22
Download presentation
Optimizing the Usage of Normalization Vladimir Weinstein, Markus Scherer IBM Globalization Center of Competency

Optimizing the Usage of Normalization Vladimir Weinstein, Markus Scherer IBM Globalization Center of Competency 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

Optimizing the Usage of Normalization Introduction 1. Unicode standard has multiple ways to encode

Optimizing the Usage of Normalization Introduction 1. Unicode standard has multiple ways to encode equivalent strings NFD re : sume résum re NFC: é sumé résume 2. Accents that don’t interact are put into a unique order 27 th Internationalization and Unicode Conference 2 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Introduction (contd. ) § Normalization provides a way to

Optimizing the Usage of Normalization Introduction (contd. ) § Normalization provides a way to transform a string to an unique form (NFD, NFC) § Strings that can be transformed so that they are identical in a unique form are called canonically equivalent § Time-critical applications need to minimize the number of passes over the text § ICU provides a number of tools to deal with this problem § We will use collation (language-sensitive string comparison) as an example 27 th Internationalization and Unicode Conference 3 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Avoiding Normalization § Force users to provide already normalized

Optimizing the Usage of Normalization Avoiding Normalization § Force users to provide already normalized data § The performance problem does not go away § When the strings are processed many times, it could be beneficial to normalize them beforehand § Forcing users to provide a specific form can be unpopular 27 th Internationalization and Unicode Conference 4 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Check for Normalized Text § Most strings are already

Optimizing the Usage of Normalization Check for Normalized Text § Most strings are already in normalized form § Quick Check is significantly faster than the full normalization § Needs canonical class data and additional data for checking the relation between a code point and a normalization form § Algorithm in UAX #15 Annex 8 (http: //www. unicode. org/unicode/reports/tr 15/#Ann ex 8) 27 th Internationalization and Unicode Conference 5 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Normalize Incrementally § Instead of normalizing the whole string

Optimizing the Usage of Normalization Normalize Incrementally § Instead of normalizing the whole string at once, normalize one piece at a time § This technique is usually combined with an incremental Quick Check § Useful for procedures with early exit, such as string comparing or scanning § Normalizes up to the next safe point 27 th Internationalization and Unicode Conference 6 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Incremental Normalization: Example Non incremental normalization Initial string re

Optimizing the Usage of Normalization Incremental Normalization: Example Non incremental normalization Initial string re Quick check Incrementa l normalizati on re sume résum é 27 th Internationalization and Unicode Conference résum é If normalized regularly, the whole string is processed by normalization Normalize just the parts that fail quick check 7 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Optimized Concatenation § Simple concatenation of two normalized strings

Optimizing the Usage of Normalization Optimized Concatenation § Simple concatenation of two normalized strings can yield a string that is not normalized § One option is to normalize the result § Unnecessarily duplicates normalization 27 th Internationalization and Unicode Conference 8 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Optimized Concatenation: Example Find boundaries re + sum é

Optimizing the Usage of Normalization Optimized Concatenation: Example Find boundaries re + sum é Concatenate and normalize up to the boundaries r e+ sum é r e sum é résumé Concatenate then normalize re sumé résumé § It is enough to normalize the boundary parts § Incremental normalization is used § Much faster than redoing the whole resulting string 27 th Internationalization and Unicode Conference 9 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Accepting the FCD Form § Fast Composed or Decomposed

Optimizing the Usage of Normalization Accepting the FCD Form § Fast Composed or Decomposed form is a partially normalized form § Not unique § More lenient than NFD or NFC form § It requires that the procedure has support for all the canonically equivalent strings on input § It is possible to quick check the FCD format 27 th Internationalization and Unicode Conference 10 Berlin, Germany, April 2005

Optimizing the Usage of Normalization FCD Form: Examples SEQUENCE FCD NFC NFD A-ring Y

Optimizing the Usage of Normalization FCD Form: Examples SEQUENCE FCD NFC NFD A-ring Y Y Angstrom Y A + ring Y Y A + grave Y Y A-ring + grave Y A + cedilla + ring Y Y A + ring + cedilla A-ring + cedilla 27 th Internationalization and Unicode Conference Y 11 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Canonical Closure § Preprocessing data to support the FCD

Optimizing the Usage of Normalization Canonical Closure § Preprocessing data to support the FCD form § Ensures that if data is assigned to a sequence (or a code point) it will also be assigned to all canonically equivalent FCD sequences Å = X A-ring (U+00 C 5) 27 th Internationalization and Unicode Conference => Å = X, A+ Angstrom sign (U+212 B) = X A + combining ring above (U+0041 U+030 A) 12 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Collation § Locale specific sorting of strings § Relation

Optimizing the Usage of Normalization Collation § Locale specific sorting of strings § Relation between code points and collation elements § Context sensitive: – Contractions: H < Z, but CZ < CH – Expansions: OE < Œ < OF – Both: カー < カイ or キー > キイ See or read “Collation in ICU” 27 th Internationalization and Unicode Conference 13 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Collation Implementation in ICU § Two modes of operation:

Optimizing the Usage of Normalization Collation Implementation in ICU § Two modes of operation: – Normalization OFF: expects the users to pass in FCD strings – Normalization ON: accepts any strings § Some locales require normalization to be turned on § Canonical closure done for contractions and regular mappings § Two important services – Sort key generation – String compare function More about ICU at the end of presentation 27 th Internationalization and Unicode Conference 14 Berlin, Germany, April 2005

Optimizing the Usage of Normalization FCD Support in Collation § Much higher performance §

Optimizing the Usage of Normalization FCD Support in Collation § Much higher performance § Values assigned to a code point or a contraction are equal to those for its FCD canonically equivalent sequences § This process is time consuming, but it is done at build time § May increase data set 27 th Internationalization and Unicode Conference 15 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Sort Key Generation § Whole strings are processed §

Optimizing the Usage of Normalization Sort Key Generation § Whole strings are processed § Sort keys tend to get reused, so the emphasis is on producing as short sort keys as possible § Two modes of operation – Normalization ON: strings are quick checked and normalization is performed, if required – Normalization OFF: depends on strings being in FCD form. The performance increases by 20% to 50% 27 th Internationalization and Unicode Conference 16 Berlin, Germany, April 2005

Optimizing the Usage of Normalization String Compare § Very time critical § Result is

Optimizing the Usage of Normalization String Compare § Very time critical § Result is usually determined before fully processing both strings § First step is binary comparison for equality § When it fails, comparison continues from a safe spot No need to backup, normal situation Must backup to the start of contraction A c z Å c h 27 th Internationalization and Unicode Conference 17 Must backup to the normalization safe spot Berlin, Germany, April 2005

Optimizing the Usage of Normalization String Compare Continued § Normalization ON: incremental FCD check

Optimizing the Usage of Normalization String Compare Continued § Normalization ON: incremental FCD check and incremental FCD normalization if required § Normalization OFF: assumes that the source strings are FCD § Most locales don’t require normalization on and thus are 20% faster by using FCD 27 th Internationalization and Unicode Conference 18 Berlin, Germany, April 2005

Optimizing the Usage of Normalization International Components for Unicode § International Components for Unicode(ICU)

Optimizing the Usage of Normalization International Components for Unicode § International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support § The ICU normalization engine supports the optimizations mentioned here § § § Library services accept FCD strings as input Wide variety of supported platforms Open source (X license – non-viral) C/C++ and Java versions http: //ibm. com/software/globalization/icu 27 th Internationalization and Unicode Conference 19 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Conclusion § The presented techniques allow much faster string

Optimizing the Usage of Normalization Conclusion § The presented techniques allow much faster string processing § In case of collation, sort key generation gets up to 50% faster than if normalizing beforehand § String compare function becomes up to 3 times faster! § May increase data size § Canonical closure preprocessing takes more time to build, but pays off at runtime 27 th Internationalization and Unicode Conference 20 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Q&A 27 th Internationalization and Unicode Conference 21 Berlin,

Optimizing the Usage of Normalization Q&A 27 th Internationalization and Unicode Conference 21 Berlin, Germany, April 2005

Optimizing the Usage of Normalization Summary § Introduction § Avoiding normalization § Check for

Optimizing the Usage of Normalization Summary § Introduction § Avoiding normalization § Check for normalized text § Normalize incrementally § Concatenation of normalized strings § Accepting the FCD form § Implementation of collation in ICU 27 th Internationalization and Unicode Conference 22 Berlin, Germany, April 2005