www OASUS ca Scraping the Web with SAS
www. OASUS. ca Scraping the Web with SAS Tom Kari Consulting OASUS, June 12 2013 “Come out of the desert of ignorance to the OASUS of knowledge”
www. OASUS. ca Google is wonderful, but… • The first page is full of junk! • I can’t tell how many pages I’m getting from each site. • I KNOW the page I want is in here somewhere, how can I find it? • I’m not using SAS when I use Google! • How can I keep ALL the results to analyze? June 12, 2013 Tom Kari, Tom Kari Consulting 2
www. OASUS. ca The Basics data URL_Retrieval_Results; length HTML_Rec $32767; filename HTML_In url "http: //www. dolphinsdance. ca"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; run; June 12, 2013 Tom Kari, Tom Kari Consulting 3
www. OASUS. ca June 12, 2013 The Process What goes in the reference to google? Get results from Google How do I find the web sites listed by Google? Extract the web sites Figure out how to get 1000 web site listings Post process the results (SAS data management) Tom Kari, Tom Kari Consulting 4
1. How to send a search to Google? www. OASUS. ca • In Internet Explorer: • • • F 12 to open Developer Tools Network �Start Capturing Enter your search string Stop Capturing Dig around in the results http: //www. google. ca/s? gs_rn=14&gs_ri=psy-ab&cp=41&gs_id=a&xhr=t&q=beautiful%20 vaca tion%20 resort%20 puerto%20 vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq =&gs_l=&pbx=1&bav=on. 2, or. r_qf. &bvm=bv. 47008514, d. dmg&fp=5 ad 817295 c 2 c 0080&biw=1 123&bih=374&tch=1&ech=1&psi=8 x. Ol. Ud. Wj. BOT 84 AO 1 i. YCw. Dw. 1369773041400. 1 http: //www. google. ca/search? q=beautiful+vacation+resort+puerto+vallarta&start=1 June 12, 2013 Tom Kari, Tom Kari Consulting 5
www. OASUS. ca 2. Get Results from Google data Google. Results; length HTML_Rec $32767; filename HTML_In url "http: //www. google. ca/search? q=beautiful+vacation+resort+puert o+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; 32, 767 bytes run; June 12, 2013 Tom Kari, Tom Kari Consulting 6
www. OASUS. ca 3. How do I find the web sites listed by Google? <div id="res"><div id="topstuff"></div><div id="search"><div id="ires"><ol><li class="g"><h 3 class="r"><a href="/url? q=http: //www. tripadvisor. ca/Hotel_Review- g 150793 -d 481596 -Reviews. Dreams_Puerto_Vallarta_Resort_Spa. Puerto_Vallarta. html& sa=U& ei=bhml. Uby. UHPKw 0 QHk 1 YFg&a mp; ved=0 CCYQFj. AAOAE& usg=AFQj. CNFLq. CMjy 4 b 4 ra. Yjb. A 8 nvq. Hj. JARGl. A"> Dreams <b>Puerto Vallarta Resort</b> & Spa - All-inclusive <b>Resort</b> Reviews <b>. . . </b></a></h 3><div class="s"><div class="kv" style="marginbottom: 2 px"><cite>www. tripadvisor. ca/Hotel_Review-g 150793 -d 481596 -Reviews. Dreams_ <b>Puerto</b>_<b>Vallarta</b>_<b>Resort</b>_Spa<b>Puerto</b>_<b>Vallarta</b>. html</cite><span class="flc"> - <a href="/url? q=http: //webcache. googleusercontent. com/search%3 Fq%3 Dcache: gagl. P rouhbk. J: http: //www. tripadvisor. ca/Hotel_Review-g 150793 -d 481596 -Reviews. Dreams_Puerto_Vallarta_Resort_Spa- June 12, 2013 Tom Kari, Tom Kari Consulting 7
3. How do I find the web sites www. OASUS. ca listed by Google? (cont’d) The magic of PRX routines! "Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step. Pattern matching also enables you to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step. For example, you can search for multiple occurrences of a string and replace those strings with another string. You can search for a string in your source file and return the position of the match. You can find words in your file that are doubled. " June 12, 2013 Tom Kari, Tom Kari Consulting 8
4. Extract the web sites www. OASUS. ca June 12, 2013 data Google. HTMLResult; retain prxid; if _n_=1 then prxid=prxparse('/(? <=<h 3 class="r"><a href="/url? q=)[[: alnum: ]. _~: /? #[]@!$''()*+, ; =]+(? =&)/o'); length HTML_Rec $32767; filename HTML_In url "http: //www. google. ca/search? q=beautiful+vacation+resort+pu erto+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; call prxsubstr(prxid, HTML_Rec, pos, len); Cite. Data=substr(HTML_Rec, HTML_Pos, HTML_Len); output; run; Tom Kari, Tom Kari Consulting 9
5. Figure out how to get www. OASUS. ca 1000 web site listings Quirks to remember • Many characters can’t appear in Google search strings, so must be encoded (spaces to +, etc. ) • Ampersands in your URL need %nrstr or will fail in SAS • To use a new url infile in SAS, you need a new data step. This is easy with a macro loop. • Every now and then it fails – “ERROR: Invalid reply received from the HTTP server. Use the debug option for more info. ” Beats me! June 12, 2013 Tom Kari, Tom Kari Consulting 10
5. Figure out how to get 1000 www. OASUS. ca web site listings (cont’d) Code is in “Example 4 Extract 1000 URLs” June 12, 2013 Tom Kari, Tom Kari Consulting 11
6. Post-process the results www. OASUS. ca • Count how many time each URL appears • For each unique URL, retain the page and index where it first appears • Create a nice looking HTML page • Code is in “Example 5 Post-processed” June 12, 2013 Tom Kari, Tom Kari Consulting 12
www. OASUS. ca June 12, 2013 Tom Kari, Tom Kari Consulting 13
Appendix: PRX parse strings www. OASUS. ca prxid=prxparse('parse string'); /(? <=<h 3 class="r"><a href="/url? q=)[[: alnum: ]. _~: /? #[]@!$''()*+, ; =]+(? =&)/o outer control non-captured group any-of one or more as-is escaped grouping June 12, 2013 Tom Kari, Tom Kari Consulting 14
- Slides: 14