PHP Internals Strings and Textual Content Martin Kruli
- Slides: 18
PHP Internals Strings and Textual Content Martin Kruliš by Martin Kruliš (v 1. 0) 26. 2. 2015 1
Select Your Charset � One Charset to Rule Them All ◦ HTML, PHP, database (connection), text files, … ◦ Determined by the language(s) used �Unicode covers almost every language ◦ Early incoming, late outgoing conversions � Charset in Meta-data ◦ Must be in HTTP headers header('Content-Type: text/html; charset=utf-8'); ◦ Do not use HTML meta element with http-equiv �Except special cases (like saving HTML file locally) by Martin Kruliš (v 1. 0) 26. 2. 2015 2
Multi-byte Strings � Multibyte Character Encoding ◦ Some charsets (e. g. , UTF-8, UTF-16, …) ◦ Standard string functions are ANSI based �They treat each byte as a char � Multibyte String Functions Library ◦ Standard library, often present in PHP ◦ Duplicates most of the standard string functions, but with prefix mb_ (mb_strlen, mb_strpos, …) ◦ Encoding conversions mb_convert_encoding() ◦ mb_internal_encoding() – specifies the internal encoding used in PHP Example 1 by Martin Kruliš (v 1. 0) 26. 2. 2015 3
Data Encoding � Encoding Input Data from HTTP ◦ Usually done transparently �Check “mbstring” section of php. ini ◦ Can be done manually mb_parse_str() � Databases ◦ The database or the database connection usually requires to be configured ◦ An example for My. SQL database �mysqli_set_charset() by Martin Kruliš (v 1. 0) 26. 2. 2015 4
Comparisons and Conversions � Lexicographical Comparison of Strings ◦ Best to be done elsewhere (in DBMS for instance) ◦ The strcmp() function is binary safe ◦ The locale must be set correctly (setlocale()) � Iconv Library ◦ An alternative to Multibyte String Functions ◦ Fewer functions ◦ Easier for encoding conversions �Can deal with missing mappings and replacements by Martin Kruliš (v 1. 0) 26. 2. 2015 5
Input Verification/Sanitization � What to Sanitize ◦ Everything that possibly comes from users: $_GET, $_POST, $_COOKIE, … ◦ Data that comes from external sources (database, text files, …) � When to Sanitize ◦ On input �At the beginning of the script ◦ On output �When inserted into HTML, into SQL queries, … by Martin Kruliš (v 1. 0) 26. 2. 2015 6
Input Verification/Sanitization � How to Verify ◦ Regular expressions ◦ Filter functions �filter_input(), filter_var(), … �Useful for special validations (e-mail, URL, IP, …) � How ◦ ◦ to Sanitize String and filter functions, regular expressions htmlspecialchars() – encoding for HTML urlencode() – encoding for URL DBMS-specific functions (mysqli_escape_string()) by Martin Kruliš (v 1. 0) 26. 2. 2015 7
Regular Expressions � String Search Patterns ◦ Special syntax that encodes a program (language) for regular automaton ◦ Simple to use �Encoding is (mostly) human readable ◦ POSIX and Perl Standards � Usage ◦ Searching strings, listing matches ◦ Find and replace ◦ Splitting a string into an array of strings by Martin Kruliš (v 1. 0) 26. 2. 2015 8
Regular Expression Syntax � Expression ◦ <separator>expr<separator>modifiers ◦ Separator is a single character (usually /, #, %, …) ◦ Pattern modifiers are flags that affect the evaluation � Base Syntax ◦ Sequence of atoms ◦ Atom could be �Simple (non-meta) character (letter, number, …) �Dot (. ) represents any character �A list of characters in [] ([abc], [0 -9 a-z_], …) by Martin Kruliš (v 1. 0) 26. 2. 2015 9
Regular Expression Syntax � Important Meta-characters ◦ - an escaping character for other meta-characters ◦ Anchors ^, $ marking start/end of a string/line �^ in character class definition inverts the set ◦ [, ] – character class definition ◦ {, } – min/max quantifier atom{n}, atom{min, max} �[0 -9]{8} (8 -digit number), . {1, 9} (1 -9 chars) ◦ (, ) – subpattern (treated like an atom) ◦ *, +, ? – repetitions, shorthand notations of {0, }, {1, }, and {0, 1} respectively ◦ | - branches (ptrn 1|ptrn 2) by Martin Kruliš (v 1. 0) 26. 2. 2015 10
Regular Expression Syntax � Character Classes ◦ Pre-defined classes identified by names [: name: ] ◦ ◦ ◦ ◦ �For example [ab[: digit: ]] matches a, b, and 0 -9 alpha – letters digit – decimal digits alnum – letters and digits blank – horizontal whitespace (space and tab) space – any whitespace (including line breaks) lower, upper – lowercase/uppercase letters cntrl – control characters xdigit – hexadecimal digits by Martin Kruliš (v 1. 0) 26. 2. 2015 11
Regular Expression Syntax � Modifiers i – case Insensitive m – multiline mode (^, $ match start/end of a line) s – '. ' matches also a newline character x – ignore whitespace in regex (except in character class constructs) ◦ S – more extensive performance optimizations ◦ U – switch to not greedy evaluation ◦ ◦ �Greedy evaluation means that patterns with *, +, or ? tries to match as many characters as possible by Martin Kruliš (v 1. 0) 26. 2. 2015 12
Regular Expression Syntax � Subpatterns ◦ To ensure correct operation precedence (one|two|three){1, 3} ◦ To add modifiers to only a part of the expression (? modifiers: ptrn) ◦ To mark important parts of the expression �Used to retrieve parts of a string after matching �Named subpatterns (? <name>ptrn), or (? 'name'ptrn) �Unnamed subpatterns (no capturing in matching) (? : ptrn) by Martin Kruliš (v 1. 0) 26. 2. 2015 13
Regular Expression Example � E-mail Verification (RFC 2822) (? : [a-z 0 -9!#$%&'*+/=? ^_`{|}~-]+(? : . [a-z 0 -9!#$%&'*+/ =? ^_`{|}~-]+)*|"(? : [x 01 -x 08x 0 bx 0 cx 0 e-x 1 fx 21 x 23 -x 5 bx 5 d-x 7 f]|\[x 01 -x 09x 0 bx 0 cx 0 e-x 7 f])* ")@(? : [a-z 0 -9](? : [a-z 0 -9 -]*[a-z 0 -9])? . )+[a-z 0 -9] (? : [a-z 0 -9 -]*[a-z 0 -9])? |[(? : 25[0 -5]|2[0 -4][0 -9]| [01]? [0 -9]? ). ){3}(? : 25[0 -5]|2[0 -4][0 -9]|[01]? [0 -9]? |[a-z 0 -9 -]*[a-z 0 -9]: (? : [x 01 -x 08x 0 bx 0 c x 0 e-x 1 fx 21 -x 5 ax 53 -x 7 f]|\[x 01 -x 09x 0 bx 0 c x 0 e-x 7 f])+)]) by Martin Kruliš (v 1. 0) 26. 2. 2015 14
Regular Expression Functions � preg_match($ptrn, $subj [, &$matches]) ◦ Searches given string by a regex ◦ Returns true if the pattern matches the subject ◦ The matches array gathers the matched substrings of subject with respect to the expression and subpatterns �Subpatterns are indexed from 1 �At index 0 is the entire expression �Named patterns are indexed by their names "6 eggs, 3 spoons of oil, 250 g of flower" ~ /[[: digit: ]]+/ array(1) { [0] => string("6") } by Martin Kruliš (v 1. 0) 26. 2. 2015 15
Regular Expression Functions � preg_replace($ptrn, $repl, $str) ◦ Search and replace substrings in a string �Each match of the pattern is replaced �Replacement may contain references to subpatterns � preg_split($ptrn, $str [, $limit]) ◦ Similar to explode() function ◦ Split a string into an array of strings ◦ The pattern is used to match delimiters �Delimiters are not part of the result Example 2 by Martin Kruliš (v 1. 0) 26. 2. 2015 16
POSIX Regular Expressions � Differences ◦ The expression is not enclosed by separators �No modifiers can be added ◦ Only simple subpatterns ◦ Only a few escape sequences � Functions ◦ ereg(), ereg_replace(), split() ◦ Each function has –i version (case insensitive) �eregi() – case insensitive version of ereg() ◦ Deprecated since PHP 5. 3 by Martin Kruliš (v 1. 0) 26. 2. 2015 17
Discussion by Martin Kruliš (v 1. 0) 26. 2. 2015 18
- Language font name
- Web application of multimedia
- Kruli
- Kruli
- Kruli
- Kruli
- Webik mff
- Ingela elofsson
- Real content and carrier content in esp
- Www.facebook.com.php
- Php php://input
- Nota.php?t=
- Inurl:content.php?id=
- Operating system internals and design principles
- Operating systems: internals and design principles
- Sql server internals and architecture
- Operating systems: internals and design principles
- Operating systems: internals and design principles
- Operating systems: internals and design principles