PHP Internals Strings and Textual Content Martin Kruli

  • Slides: 18
Download presentation
PHP Internals Strings and Textual Content Martin Kruliš by Martin Kruliš (v 1. 0)

PHP Internals Strings and Textual Content Martin Kruliš by Martin Kruliš (v 1. 0) 26. 2. 2015 1

Select Your Charset � One Charset to Rule Them All ◦ HTML, PHP, database

Select Your Charset � One Charset to Rule Them All ◦ HTML, PHP, database (connection), text files, … ◦ Determined by the language(s) used �Unicode covers almost every language ◦ Early incoming, late outgoing conversions � Charset in Meta-data ◦ Must be in HTTP headers header('Content-Type: text/html; charset=utf-8'); ◦ Do not use HTML meta element with http-equiv �Except special cases (like saving HTML file locally) by Martin Kruliš (v 1. 0) 26. 2. 2015 2

Multi-byte Strings � Multibyte Character Encoding ◦ Some charsets (e. g. , UTF-8, UTF-16,

Multi-byte Strings � Multibyte Character Encoding ◦ Some charsets (e. g. , UTF-8, UTF-16, …) ◦ Standard string functions are ANSI based �They treat each byte as a char � Multibyte String Functions Library ◦ Standard library, often present in PHP ◦ Duplicates most of the standard string functions, but with prefix mb_ (mb_strlen, mb_strpos, …) ◦ Encoding conversions mb_convert_encoding() ◦ mb_internal_encoding() – specifies the internal encoding used in PHP Example 1 by Martin Kruliš (v 1. 0) 26. 2. 2015 3

Data Encoding � Encoding Input Data from HTTP ◦ Usually done transparently �Check “mbstring”

Data Encoding � Encoding Input Data from HTTP ◦ Usually done transparently �Check “mbstring” section of php. ini ◦ Can be done manually mb_parse_str() � Databases ◦ The database or the database connection usually requires to be configured ◦ An example for My. SQL database �mysqli_set_charset() by Martin Kruliš (v 1. 0) 26. 2. 2015 4

Comparisons and Conversions � Lexicographical Comparison of Strings ◦ Best to be done elsewhere

Comparisons and Conversions � Lexicographical Comparison of Strings ◦ Best to be done elsewhere (in DBMS for instance) ◦ The strcmp() function is binary safe ◦ The locale must be set correctly (setlocale()) � Iconv Library ◦ An alternative to Multibyte String Functions ◦ Fewer functions ◦ Easier for encoding conversions �Can deal with missing mappings and replacements by Martin Kruliš (v 1. 0) 26. 2. 2015 5

Input Verification/Sanitization � What to Sanitize ◦ Everything that possibly comes from users: $_GET,

Input Verification/Sanitization � What to Sanitize ◦ Everything that possibly comes from users: $_GET, $_POST, $_COOKIE, … ◦ Data that comes from external sources (database, text files, …) � When to Sanitize ◦ On input �At the beginning of the script ◦ On output �When inserted into HTML, into SQL queries, … by Martin Kruliš (v 1. 0) 26. 2. 2015 6

Input Verification/Sanitization � How to Verify ◦ Regular expressions ◦ Filter functions �filter_input(), filter_var(),

Input Verification/Sanitization � How to Verify ◦ Regular expressions ◦ Filter functions �filter_input(), filter_var(), … �Useful for special validations (e-mail, URL, IP, …) � How ◦ ◦ to Sanitize String and filter functions, regular expressions htmlspecialchars() – encoding for HTML urlencode() – encoding for URL DBMS-specific functions (mysqli_escape_string()) by Martin Kruliš (v 1. 0) 26. 2. 2015 7

Regular Expressions � String Search Patterns ◦ Special syntax that encodes a program (language)

Regular Expressions � String Search Patterns ◦ Special syntax that encodes a program (language) for regular automaton ◦ Simple to use �Encoding is (mostly) human readable ◦ POSIX and Perl Standards � Usage ◦ Searching strings, listing matches ◦ Find and replace ◦ Splitting a string into an array of strings by Martin Kruliš (v 1. 0) 26. 2. 2015 8

Regular Expression Syntax � Expression ◦ <separator>expr<separator>modifiers ◦ Separator is a single character (usually

Regular Expression Syntax � Expression ◦ <separator>expr<separator>modifiers ◦ Separator is a single character (usually /, #, %, …) ◦ Pattern modifiers are flags that affect the evaluation � Base Syntax ◦ Sequence of atoms ◦ Atom could be �Simple (non-meta) character (letter, number, …) �Dot (. ) represents any character �A list of characters in [] ([abc], [0 -9 a-z_], …) by Martin Kruliš (v 1. 0) 26. 2. 2015 9

Regular Expression Syntax � Important Meta-characters ◦  - an escaping character for other

Regular Expression Syntax � Important Meta-characters ◦ - an escaping character for other meta-characters ◦ Anchors ^, $ marking start/end of a string/line �^ in character class definition inverts the set ◦ [, ] – character class definition ◦ {, } – min/max quantifier atom{n}, atom{min, max} �[0 -9]{8} (8 -digit number), . {1, 9} (1 -9 chars) ◦ (, ) – subpattern (treated like an atom) ◦ *, +, ? – repetitions, shorthand notations of {0, }, {1, }, and {0, 1} respectively ◦ | - branches (ptrn 1|ptrn 2) by Martin Kruliš (v 1. 0) 26. 2. 2015 10

Regular Expression Syntax � Character Classes ◦ Pre-defined classes identified by names [: name:

Regular Expression Syntax � Character Classes ◦ Pre-defined classes identified by names [: name: ] ◦ ◦ ◦ ◦ �For example [ab[: digit: ]] matches a, b, and 0 -9 alpha – letters digit – decimal digits alnum – letters and digits blank – horizontal whitespace (space and tab) space – any whitespace (including line breaks) lower, upper – lowercase/uppercase letters cntrl – control characters xdigit – hexadecimal digits by Martin Kruliš (v 1. 0) 26. 2. 2015 11

Regular Expression Syntax � Modifiers i – case Insensitive m – multiline mode (^,

Regular Expression Syntax � Modifiers i – case Insensitive m – multiline mode (^, $ match start/end of a line) s – '. ' matches also a newline character x – ignore whitespace in regex (except in character class constructs) ◦ S – more extensive performance optimizations ◦ U – switch to not greedy evaluation ◦ ◦ �Greedy evaluation means that patterns with *, +, or ? tries to match as many characters as possible by Martin Kruliš (v 1. 0) 26. 2. 2015 12

Regular Expression Syntax � Subpatterns ◦ To ensure correct operation precedence (one|two|three){1, 3} ◦

Regular Expression Syntax � Subpatterns ◦ To ensure correct operation precedence (one|two|three){1, 3} ◦ To add modifiers to only a part of the expression (? modifiers: ptrn) ◦ To mark important parts of the expression �Used to retrieve parts of a string after matching �Named subpatterns (? <name>ptrn), or (? 'name'ptrn) �Unnamed subpatterns (no capturing in matching) (? : ptrn) by Martin Kruliš (v 1. 0) 26. 2. 2015 13

Regular Expression Example � E-mail Verification (RFC 2822) (? : [a-z 0 -9!#$%&'*+/=? ^_`{|}~-]+(?

Regular Expression Example � E-mail Verification (RFC 2822) (? : [a-z 0 -9!#$%&'*+/=? ^_`{|}~-]+(? : . [a-z 0 -9!#$%&'*+/ =? ^_`{|}~-]+)*|"(? : [x 01 -x 08x 0 bx 0 cx 0 e-x 1 fx 21 x 23 -x 5 bx 5 d-x 7 f]|\[x 01 -x 09x 0 bx 0 cx 0 e-x 7 f])* ")@(? : [a-z 0 -9](? : [a-z 0 -9 -]*[a-z 0 -9])? . )+[a-z 0 -9] (? : [a-z 0 -9 -]*[a-z 0 -9])? |[(? : 25[0 -5]|2[0 -4][0 -9]| [01]? [0 -9]? ). ){3}(? : 25[0 -5]|2[0 -4][0 -9]|[01]? [0 -9]? |[a-z 0 -9 -]*[a-z 0 -9]: (? : [x 01 -x 08x 0 bx 0 c x 0 e-x 1 fx 21 -x 5 ax 53 -x 7 f]|\[x 01 -x 09x 0 bx 0 c x 0 e-x 7 f])+)]) by Martin Kruliš (v 1. 0) 26. 2. 2015 14

Regular Expression Functions � preg_match($ptrn, $subj [, &$matches]) ◦ Searches given string by a

Regular Expression Functions � preg_match($ptrn, $subj [, &$matches]) ◦ Searches given string by a regex ◦ Returns true if the pattern matches the subject ◦ The matches array gathers the matched substrings of subject with respect to the expression and subpatterns �Subpatterns are indexed from 1 �At index 0 is the entire expression �Named patterns are indexed by their names "6 eggs, 3 spoons of oil, 250 g of flower" ~ /[[: digit: ]]+/ array(1) { [0] => string("6") } by Martin Kruliš (v 1. 0) 26. 2. 2015 15

Regular Expression Functions � preg_replace($ptrn, $repl, $str) ◦ Search and replace substrings in a

Regular Expression Functions � preg_replace($ptrn, $repl, $str) ◦ Search and replace substrings in a string �Each match of the pattern is replaced �Replacement may contain references to subpatterns � preg_split($ptrn, $str [, $limit]) ◦ Similar to explode() function ◦ Split a string into an array of strings ◦ The pattern is used to match delimiters �Delimiters are not part of the result Example 2 by Martin Kruliš (v 1. 0) 26. 2. 2015 16

POSIX Regular Expressions � Differences ◦ The expression is not enclosed by separators �No

POSIX Regular Expressions � Differences ◦ The expression is not enclosed by separators �No modifiers can be added ◦ Only simple subpatterns ◦ Only a few escape sequences � Functions ◦ ereg(), ereg_replace(), split() ◦ Each function has –i version (case insensitive) �eregi() – case insensitive version of ereg() ◦ Deprecated since PHP 5. 3 by Martin Kruliš (v 1. 0) 26. 2. 2015 17

Discussion by Martin Kruliš (v 1. 0) 26. 2. 2015 18

Discussion by Martin Kruliš (v 1. 0) 26. 2. 2015 18