Unicode and the Unicode logo are trademarks of

  • Slides: 33
Download presentation
Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission PHP meets ŬŋǐcøðΣ Nuno Lopes, NEIIST – 4º Ciclo de Apresentações, 13/Outubro/2005

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n n n n Porquê l 10 n/i 18 n? Desafios da l 10 n Introdução ao Unicode Implementação Actual (PHP 4/5) Implementação Futura (PHP 6) Links Questões

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: ð Porquê n n n l 10 n/i 18 n? Desafios da l 10 n Introdução ao Unicode Implementação Actual (PHP 4/5) Implementação Futura (PHP 6) Links Questões

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Porquê l 10 n/i 18 n? There is more than one country in the world n Ce n’est pas tout le monde qui parle anglais n Tjueseks karakterer holder ikke mål n Нот эврибади из юзин зэ сэйм скрипт ивэн n 它�得更加复�的与���言 n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Porquê l 10 n/i 18 n? Suportar as línguas necessárias, sem rescrever a aplicação n Adicionar novos caracteres de forma transparente (por exemplo, €) n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n Porquê l 10 n/i 18 n? ð Desafios n n n da l 10 n Introdução ao Unicode Implementação Actual (PHP 4/5) Implementação Futura (PHP 6) Links Questões

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Desafios da l 10 n Diferenças nos charsets n Multi-byte vs Single-byte encodings n Diferentes algoritmos de sort, spelling, dates, . . . n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Exemplo: Sorting (aka Collation) n n Em Lituano, o ‘y’ é ordenado entre ‘i’ e ‘k’ Em Espanhol Tradicional, ‘ch’ é tratado como uma única letra, e é ordenado entre ‘c’ e ‘d’ Em Sueco, ‘v’ e ‘w’ são consideradas variantes da mesma letra Em Alemão, ‘öf’ é ordenado antes de ‘of’. Nas listas telefónicas é o contrário

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Exemplo: Capitalization Grego: Σ ⇨ σ (no meio de uma palavra) n Grego: Σ ⇨ ς (no fim de uma palavra) n Turco: i ⇨ İ, ı ⇨ I n Alemão: ß ⇨ SS (lower[SS]=ss) n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n n Porquê l 10 n/i 18 n? Desafios da l 10 n ð Introdução n n ao Unicode Implementação Actual (PHP 4/5) Implementação Futura (PHP 6) Links Questões

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Introdução ao Unicode Suporta todas as línguas n +100 mil caracteres n 1 caracter != 1 byte n Compatível com ASCII n BOM (byte order mask) identifica a codificação usada n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Termos técnicos (UTF-16) Code point – representação de caracteres por números (U+1234) n Code unit – uma sequência de dois bytes n Surrogates (high and low) – 2 code units para representar o mesmo caracter (> FFFF) n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Codificação UTF-7 (obsoleto) n UTF-8 (até 4 bytes) n UTF-16 (LE & BE) (2 ou 4 bytes) n UTF-32 (LE & BE) (4 bytes) n UTF-EBCDIC (até 5 bytes) n. . . n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Composição de caracteres a+ˆ+. =ậ U+0061 + U+0302 + U+0323 = U+1 EAD a+. +ˆ=ậ U+0061 + U+0323 + U+0302 = U+1 EAD

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Normalization Caracteres equivalentes são reduzidos a uma forma standard (por exemplo os caracteres do ASCII estendido) n Facilita algoritmos n å != å U+00 C 5 + U+030 A != U+0041

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Propriedades n Os caracteres têm propriedades, como: ¨ Espaços ¨ Letras (lower/upper case) ¨ Números ¨ Pontuação ¨. . .

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n n n Porquê l 10 n/i 18 n? Desafios da l 10 n Introdução ao Unicode ð Implementação Actual (PHP 4/5) n Implementação Futura (PHP 6) n Links Questões n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Iconv iconv_strlen() n iconv_substr() n iconv_strpos() n iconv() n n Não resolve a maioria dos problemas

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Mbstring mb_strlen() n mb_strpos() n. . . n Centrado em charsets Asiáticos n Também não resolve a maioria dos problemas n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n n Porquê l 10 n/i 18 n? Desafios da l 10 n Introdução ao Unicode Implementação Actual (PHP 4/5) ð Implementação n n Links Questões Futura (PHP 6)

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission PHP 6 Detecção do encoding do script via BOM n Overload das funções de forma transparente n Variáveis e nomes de funções em Unicode n Suporte para Locales POSIX n Utiliza a library da IBM: ICU n UTF-16 internamente n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Settings

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Hello World <? php ini_set('unicode. output_encoding', 'iso-8859 -1'); function こんにちは() { $世界 = 'Hello World!'; echo $世界; } こんにちは(); ? >

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Sorting <? php // the list of the strings to sort $array = array( 'caramelo', 'cacto', 'caçada' ); // set our locale (Portuguese, in this case) i 18 n_loc_set_default('pt_PT'); // sort using the locale we previously set sort($array, SORT_LOCALE_STRING); ? >

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Normalization <? php $GLOBALS["u 212 B"] = '승인'; // U+00 C 5 = Å echo $GLOBALS["u 00 C 5"]; ? >

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission String types binary – raw strings n string – usa o encoding do script (for BC) n unicode – UTF-16 n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Binary vs Unicode <? php $unicode = '傀� 两亨乄了� 刄'; $binary = b'傀� 两亨乄了� 刄'; $binary 2 = (binary) $unicode; echo strlen($unicode); // 8 echo strlen($binary); // 24 echo strlen($binary 2); // 24 var_inspect($unicode[2]); // unicode(1) "两" { 4 e 24 } var_dump($binary[2]); // Ç ? >

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Escapes <? php // 'Uxxxxxx' $str = 'U+123: U 000123'; // 'uxxxx' $str = 'U+123: u 0123'; // unicode(8) "U+123: ģ" var_dump($str); ? >

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Novas funções unicode_decode(input, encoding) n string unicode_encode(input, encoding) n string i 18 n_loc_get_default() n bool i 18 n_loc_set_default(locale) n text i 18 n_strtotitle(str) n. . . ? n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Stream Filters unicode. to. * - Unicode->String n unicode. from. * - String->Unicode n unicode. tidy. * - “magic” filter n

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Agenda: n n n Porquê l 10 n/i 18 n? Desafios da l 10 n Introdução ao Unicode Implementação Actual (PHP 4/5) Implementação Futura (PHP 6) ð Links n Questões

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission Links n www. php. net/unicode n n http: //www. derickrethans. nl/files/php 6 unicode. pdf http: //www. gravitonic. com/do_download. php? do wnload_file=talks/oscon 2005/php_unicode_osco n 2005. pdf n http: //mega. ist. utl. pt/~ncpl/pres/

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission

Unicode and the Unicode logo are trademarks of Unicode, Inc. , used with permission PHP meets ŬŋǐcøðΣ ? s e õ t s e u Q