SCRAPING WITH PHP BASIC PRINCIPLES What is scraping
SCRAPING WITH PHP
BASIC PRINCIPLES • What is scraping? • Automated extraction of data from a website • Transforming it into structured data that’s useful
BUT WHY BOTHER? • Save money • Early notifications • Track trends • Competitive intelligence • Build businesses • Supplementary data to feed/API • Fun
Date Hotel Price 20/04/2009 Travelodge, Covent Garden 19 22/04/2009 Travelodge, Covent Garden 69 27/04/2009 Travelodge, Covent Garden 59 28/04/2009 Travelodge, Covent Garden 19 29/04/2009 Travelodge, Covent Garden 19 04/05/2009 Travelodge, Covent Garden 19 06/05/2009 Travelodge, Covent Garden 59 13/05/2009 Travelodge, Covent Garden 59 20/05/2009 Travelodge, Covent Garden 59 15/07/2009 Travelodge, Maylebone 19 12/08/2009 Travelodge, Covent Garden 59 04/11/2009 Travelodge, Marylebone 49 17/11/2009 Travelodge, Marylebone 49 18/11/2009 Travelodge Covent Garden 59 25/11/2009 Travelodge Covent Garden 59 01/12/2009 Travelodge Covent Garden 59 09/12/2009 Travelodge Covent Garden 59 16/12/2009 Travelodge, Marylebone 49 06/01/2010 Travelodge Euston 19 13/01/2010 Travelodge Euston 19 19/01/2010 Travelodge Euston 19 20/01/2010 Travelodge Covent Garden 29 LONDON HOTELS 190 hotels booked between 2009 and 2014 22 Tune. Hotels Average price £ 12. 34 (cheapest £ 1. 51) 168 Travelodges Average price £ 32. 15 (cheapest £ 10)
EXAMPLES
BBC NEWS • Latest headline https: //www. bbc. co. uk/news <h 3 class="gs-c-promo-heading__title gel-paragon-bold nw-o-link-split__text">Turkey to reveal 'naked truth' on Khashoggi</h 3>
LATEST BBC NEWS STORY Evolution of code - exploding <? $html = file_get_contents("https: //www. bbc. co. uk/news"); $string = '<h 3 class="gs-c-promo-heading__title gel-paragon-bold nw-o-link-split__text">'; $content = explode($string, $html); $content = explode("</h 3>", $content[1]); $title = $content[0]; echo $title;
LATEST BBC NEWS STORY Evolution of code - substr <? $html = file_get_contents("https: //www. bbc. co. uk/news"); $string = '<h 3 class="gs-c-promo-heading__title gel-paragon-bold nw-o-link-split__text">'; $pos 1 = strpos($html, $string) + strlen($string); $pos 2 = strpos($html, '</h 3>', $pos 1); $title = substr($html, $pos 1, $pos 2 -$pos 1); echo $title;
LATEST BBC NEWS STORY Evolution of code – substring / get_string_between http: //www. justin-cook. com/2006/03/31/php-parse-a-string-between-two-strings/ <? include_once("substring. php"); $html = file_get_contents("https: //www. bbc. co. uk/news"); $title = substring($html, '<h 3 class="gs-c-promo-heading__title gelparagon-bold nw-o-link-split__text">', '</h 3>'); echo $title;
ALL BBC NEWS STORIES <? $html = file_get_contents("https: //www. bbc. co. uk/news"); $string = '<h 3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">'; $pos 1 = 0; while (TRUE) { $pos 1 = strpos($html, $string, $pos 1) + strlen($string); $pos 2 = strpos($html, '</h 3>', $pos 1); $title = substr($html, $pos 1, $pos 2 -$pos 1); if (!$title) break; echo $title, PHP_EOL; }
BETTER WAY? • Code vulnerable to any tiny mark-up changes • XML is your friend • XPath -> XML Path Language • Identify and navigate nodes • Simple. XML converts XML to usable object
LET’S CONVERT HTML TO XML $doc = new DOMDocument(); $doc->load. HTML($html); $xml = simplexml_import_dom($doc); We now have something usable.
WORKING OUT XPATHS XPath Helper is one of the better tools
<? $html = file_get_contents("https: //www. bbc. co. uk/news"); $doc = new DOMDocument(); $doc->load. HTML($html); $xml = simplexml_import_dom($doc); $titles = $xml->xpath("//div/a[@class='gs-c-promo-heading gs-o-fauxblock-link__overlay-link gel-pica-bold nw-o-linksplit__anchor']/h 3"); foreach($titles as $title) echo (string)$title, PHP_EOL;
XPATH FEATURES • / = from the root node • // = any nodes that match • . = current node • @ = attributes eg [@class=‘classname’] • contains() • not()
XPATH EXAMPLES • //div/a[@class='gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor']/h 3 • //div/a/h 3[contains(@class, 'gs-c-promo-heading’)] • //div/a[h 3[contains(@class, 'gs-c-promo-heading’)]] • ->h 3 = text • [‘href’] = URL
WHITELABELS
OPEN GRAPH $price = substring($html, '<meta property="og: price: amount" content="', '"'); echo $price; $price = $xml->xpath("//meta[@property='og: price: amount']"); echo (string)$price[0]['content'];
HELPERS file_get_contents – allow_url_fopen must be on Snoopy - github. com/duantianyu/Snoopy Supports cookies. Last updated 2014. Guzzle - github. com/guzzle Setup a cookie jar Zend HTTP Client
ZEND HTTP CLIENT include_once("Zend/Http/Client. php"); $client = new Zend_Http_Client; $client->set. Cookie. Jar(); //enable cookies $client->set. Config(array('useragent' => 'User-Agent: Mozilla/5. 0 (Windows NT 6. 1; WOW 64; rv: 10. 0. 2) Gecko/20100101 Firefox/10. 0. 2’)); $client->set. Config(array('keepalive' => true));
Set URL $client->set. Uri($website); JSON Request $client->set. Headers('Content-Type', 'application/json’); Authentication $client->set. Auth($user, $pass); $client->set. Raw. Data($json, 'application/json’); Cookies SOAP Request $client->set. Cookie($name, $value); $client->set. Headers('SOAPAction', $soap_action); Get request $client->set. Raw. Data($post_data, 'text/xml; charset=utf 8’); $client->request('GET’); Post request $client->set. Parameter. Post($name, $value); $client->request('POST’); Get last URL $client->get. Last. URI();
ISSUES WAF/Robot blockers Captcha Georestrictions. net
SOLUTIONS Georestrictions • Virtual Machines • Proxies
SOLUTIONS Captcha • OCR/online services • Mechanical Turk
SOLUTIONS. net • Very painful
SOLUTIONS WAF/Robot blockers • Googlebot • User Agent • Mimic user behaviour • Proxies • Scrapoxy - scrapoxy. io
SCRAPOXY
ETHICS
IMAGES/ATTACHMENTS/PDFS? • Images & Attachments • Download them onto file system • Parse them as you see fit • Be careful with relative paths • PDFs • Many online tools • Had success with A-PDF (windows) - www. a-pdf. com/text/
JAVASCRIPT/API/REACT/NODE • Network inspector is your friend • Replicate exactly how the browser makes requests, ensuring • Current content-type • Working out which data attributes are important • Send the correct headers and cookies (either copy or work out how they are generated – look for Response Cookie headers) • Looking at the response and working out how to parse (XML/JSON-P) • XML – Simple. XML • JSON – json_decode($json, TRUE) • JSONP – substring($json, ‘callback(‘, ‘); ’) – then json_decode…
EDGE CASES
FURTHER CONSIDERATIONS/THOUGHTS
THANKS Questions?
- Slides: 36