Web Scraping Lecture 10 Selenium Topics n Selenium
Web Scraping Lecture 10 - Selenium Topics n Selenium Webdriver n Chrome. Driver, Phantom. JS Readings: n Chapter 10 January 26, 2017
Overview Last Time: Lecture 8 Slides 1 -29 • Chapter 9: the Requests Library – filling out forms • 1 -simple. Form. py • 2 -file. Submission. py 3 - cookies. py 4 -session. Cookies. py– 5 -Basic. Auth. py • • Software Architecture of systems Today: • Chapter 13: References: Chapter 13, websites – 2– CSCE 590 Web Scraping Spring 2017
Selenium Web Driver Big Picture = Software Architecture – how components of the software fit together – 3– CSCE 590 Web Scraping Spring 2017
References Windows Installation § You. Tube video § https: //www. youtube. com/watch? v=V 69 wc 4 Tmwjc Linux Installation § http: //blog. likewise. org/2015/01/setting-up-chromedriver-and-theselenium-webdriver-python-bindings-on-ubuntu-14 -dot-04/ Chrome Driver § https: //sites. google. com/a/chromium. org/chromedriver/getting-started Phantom. JS Selenium Site – 4– CSCE 590 Web Scraping Spring 2017
Java. Script < script > alert(" This creates a pop-up using Java. Script"); </ script > Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3813 -3814). O'Reilly Media. Kindle Edition. – 5 – Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell CSCE 590 Web Scraping Spring 2017
Examples of Javascript – 6– CSCE 590 Web Scraping Spring 2017
– 7– CSCE 590 Web Scraping Spring 2017
– 8– CSCE 590 Web Scraping Spring 2017
j. Query is an extremely common library, § used by 70% of the most popular Internet sites and § about 30% of the rest of the Internet. § A site using j. Query is readily identifiable because it will contain an import to j. Query somewhere in its code, such as: § < script src =" http: // ajax. googleapis. com/ ajax/ libs/ jquery/ 1. 9. 1/ jquery. min. js" > </ script > § dynamically creates HTML content that appears only after the Java. Script is executed. – 9– CSCE 590 Web Scraping Spring 2017
Google analytics – 10 – CSCE 590 Web Scraping Spring 2017
Google Maps Embedded in websites – 11 – CSCE 590 Web Scraping Spring 2017
Executing Javascript with Selenium – 12 – CSCE 590 Web Scraping Spring 2017
Selenium Self Service Carolina Demo – 13 – CSCE 590 Web Scraping Spring 2017
Ajax and Dynamic HTML – 14 – CSCE 590 Web Scraping Spring 2017
– 15 – CSCE 590 Web Scraping Spring 2017
– 16 – CSCE 590 Web Scraping Spring 2017
Installation Not just pip here; there is the separate Chrome. Driver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome) – 17 – CSCE 590 Web Scraping Spring 2017
Chrome. Driver - Web. Driver for Chrome § Latest Release: Chrome. Driver 2. 27 § https: //sites. google. com/a/chromium. org/chromedriver/downloads § Pick your OS § Unzip and remember where it is – 18 – CSCE 590 Web Scraping Spring 2017
Phanton. JS – headless Web. Driver http: //phantomjs. org/download. html – 19 – CSCE 590 Web Scraping Spring 2017
Setting Up Chrome. Driver and the Selenium-Web. Driver Python bindings on Ubuntu 14. 04 install Google Chrome for Debian/Ubuntu: sudo apt-get install libxss 1 libappindicator 1 libindicator 7 wget https: //dl. google. com/linux/direct/google-chromestable_current_amd 64. deb sudo dpkg -i google-chrome*. deb sudo apt-get install –f install xvfb so we can run Chrome headlessly: sudo apt-get install xvfb – 20 – https: //christopher. su/2015/selenium-chromedriver-ubuntu/ CSCE 590 Web Scraping Spring 2017
Chromedriver – Unbuntu 14. 4 sudo apt-get install unzip wget -N http: //chromedriver. storage. googleapis. com/2. 26/chromedriver_li nux 64. zip unzip chromedriver_linux 64. zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver – 21 – https: //christopher. su/2015/selenium-chromedriver-ubuntu/ CSCE 590 Web Scraping Spring 2017
Install Selenium and pyvirtualdisplay pip install pyvirtualdisplay selenium Now, we can do stuff like this with Selenium in Python: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display. start() driver = webdriver. Chrome() driver. get('http: //christopher. su') print driver. title – 22 – CSCE 590 Web Scraping Spring 2017
Selenium Selectors – 23 – CSCE 590 Web Scraping Spring 2017
Still can use Beatiuful. Soup – 24 – CSCE 590 Web Scraping Spring 2017
from selenium. webdriver. common. by import By – 25 – CSCE 590 Web Scraping Spring 2017
By Selection strategies – 26 – CSCE 590 Web Scraping Spring 2017
Phanton. JS – headless Web. Driver Again http: //phantomjs. org/download. html – 27 – CSCE 590 Web Scraping Spring 2017
XPath Syntax § XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. § founded by the W 3 C in 1999 § used in languages such as Python, Java, and C# when dealing with XML documents. § Although Beautiful. Soup does not support XPath, many of the other libraries in this book do. § It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 4051 -4056). O'Reilly Media. Kindle Edition. – 28 – CSCE 590 Web Scraping Spring 2017
XPATH – 29 – CSCE 590 Web Scraping Spring 2017
XPATH – 30 – CSCE 590 Web Scraping Spring 2017
Selenium Self Service Carolina Demo if __name__ == "__main__": driver = init_driver() password = "My. Password" #password = input("Enter My. SC password: ") lookup(driver, "Selenium") time. sleep(5) driver. quit() – 31 – CSCE 590 Web Scraping Spring 2017
import time from selenium import webdriver from selenium. webdriver. common. by import By from selenium. webdriver. support. ui import Web. Driver. Wait from selenium. webdriver. support import expected_conditions as EC from selenium. common. exceptions import Timeout. Exception from bs 4 import Beautiful. Soup def init_driver(): driver = webdriver. Chrome("E: /chromedriver_win 32/chromedriver. exe") driver. wait = Web. Driver. Wait(driver, 5) return driver – 32 – CSCE 590 Web Scraping Spring 2017
def lookup(driver, query): driver. get("https: //my. sc. edu/") print ("SSC opened") try: link = driver. wait. until(EC. presence_of_element_located( (By. PARTIAL_LINK_TEXT, "Sign in to"))) #https: //ssb. onecarolina. sc. edu/BANP/twbkwbis. P_WWWLogin? pkg= twbkwbis. P_Gen. Menu%3 Fname%3 Dbmenu. P_Main. Mnu print ("Found link", link) link. click() print ("Clicked link") #button = driver. wait. until(EC. element_to_be_clickable( # (By. NAME, "btn. K"))) #box. send_keys(query) #button. click() except Timeout. Exception: print("Houston we have a problem First Page") – 33 – CSCE 590 Web Scraping Spring 2017
# Now try to login try: user_box = driver. wait. until(EC. presence_of_element_located( (By. NAME, "username"))) #https: //ssb. onecarolina. sc. edu/BANP/twbkwbis. P_WWWLogin? pkg= twbkwbis. P_Gen. Menu%3 Fname%3 Dbmenu. P_Main. Mnu print ("Found box", user_box) user_box. send_keys("01069379") print ("ID entered") passwd_box = driver. wait. until(EC. presence_of_element_located( (By. ID, "vipid-password"))) print ("Found password box", passwd_box) passwd_box. send_keys(password) print ("password entered") button = driver. wait. until(EC. element_to_be_clickable( (By. NAME, "submit"))) print ("Found submit button", button) #box. send_keys(query) button. click() except Timeout. Exception: print("Houston we have a problem Login Page") – 34 – CSCE 590 Web Scraping Spring 2017
– 35 – CSCE 590 Web Scraping Spring 2017
– 36 – CSCE 590 Web Scraping Spring 2017
– 37 – CSCE 590 Web Scraping Spring 2017
– 38 – CSCE 590 Web Scraping Spring 2017
– 39 – CSCE 590 Web Scraping Spring 2017
– 40 – CSCE 590 Web Scraping Spring 2017
– 41 – CSCE 590 Web Scraping Spring 2017
– 42 – CSCE 590 Web Scraping Spring 2017
– 43 – CSCE 590 Web Scraping Spring 2017
– 44 – CSCE 590 Web Scraping Spring 2017
– 45 – CSCE 590 Web Scraping Spring 2017
– 46 – CSCE 590 Web Scraping Spring 2017
– 47 – CSCE 590 Web Scraping Spring 2017
– 48 – CSCE 590 Web Scraping Spring 2017
– 49 – CSCE 590 Web Scraping Spring 2017
– 50 – CSCE 590 Web Scraping Spring 2017
- Slides: 50