TDM 40100: Project 12 — 2022
Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: sqlite3
, containerization, and analysis work as well.
Context: This is the third in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we’ve touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.
Scope: playwright, Python, web scraping
Questions
Question 1
This has been (maybe) a bit intense for a project series. This project is going to give you a little break and not give you anything new to do, except changing the package we are using.
playwright
is a modern web scraping tool backed by Microsoft, that, like selenium
, allows you to interact with a web page before scraping. playwright
is not necessarily better (yet), however, it is different, and actively maintained.
Implement the get_links
, and link_to_blob
functions using playwright
instead of selenium
. You can find the documentation for playwright
here.
Before you get started, you will need to run the following in a bash
cell.
%%bash
python3 -m playwright install
Finally, we aren’t going to force you to fight with the playwright
documentation to get started, so the following is an example of code that will run in a Jupyter notebook, and perform many of the basic/same operations you are acustomed to with selenium
.
import time
import asyncio
from playwright.async_api import async_playwright
# so we can run this from within Jupyter, which is already async
import nest_asyncio
nest_asyncio.apply()
async def main():
async with async_playwright() as p:
browser = await p.firefox.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://purdue.edu/directory")
# print the page source
print(await page.content())
# get html element
e = page.locator("xpath=//html")
# print the inner html of the element
print(await e.inner_html())
# isolate the search bar "input" element
inp = e.locator("xpath=.//input")
# print the outer html, or the element and contents
print(await inp.evaluate("el => el.outerHTML"))
# fill the input with "mdw"
await inp.fill("mdw")
print(await inp.evaluate("el => el.outerHTML"))
# find the search button and click it
await page.locator("xpath=//a[@id='glass']").click()
# We can delay the program to allow the page to load
time.sleep(5)
# find the table in the page with dr. wards content
table = page.locator("xpath=//table[@class='more']")
# print the table and contents
print(await table.evaluate("el => el.outerHTML"))
# find the alias, if a selector starts with // or .. it is assumed to be xpath
print(await page.locator("//th[@class='icon-key']").evaluate("el => el.outerHTML"))
# you can print an attribute
print(await page.locator("//th[@class='icon-key']").get_attribute("scope"))
# similarly, you can print an elements content
print(await page.locator("//th[@class='icon-key']").inner_text())
# you could use the regular xpath stuff, no problem
print(await page.locator("//th[@class='icon-key']/following-sibling::td").inner_text())
await browser.close()
asyncio.run(main())
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Implement the get_links
function using playwright
. Test it out so the exmaple below is the same (or close, listed houses may change).
Here is the
|
Use the |
Don’t forget to |
Unlike in
Instead, you’ll have to use the useful
|
Unlike in
Instead, you’ll have to use the useful
|
To clear cookies, search for "cookie" in the playwright documentation. Hint: you can clear cookies using the context object. |
This following provides a working skeleton to run the asynchronous code in Jupyter.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Implement the link_to_blob
function using playwright
. Test it out so the example below functions.
The
|
The
|
To test your code run the following.
output
[['11/9/2022', 'Price change', '275000'], ['11/2/2022', 'Listed for sale', '289900'], ['1/13/2000', 'Sold', '19000']]
output
[[2021, '1344', 124511], [2020, '1310', 122792], [2019, '1290', 120031], [2018, '1260', 117793], [2017, '1260', 115370], [2016, '1252', 112997], [2015, '1262', 112212], [2014, '1277', 113120], [2013, '1295', 112920], [2012, '1389', 124535], [2011, '1557', 134234], [2010, '1495', 132251], [2009, '1499', 128776], [2008, '1483', 128647], [2007, '1594', 124900], [2006, '1608', 121900], [2005, '1704', 118400], [2004, '1716', 115000], [2003, '1624', 112900], [2002, '1577', 110300], [2000, '288', 15700]] Please note that exact numbers may change slightly, that is okay! Prices and things change. |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Test out a playwright
feature from the documentation that is new to you. This could be anything. One suggestion that could be interesting would be screenshots. As long as you demonstrate something new, you will receive credit for this question. Have fun, and happy thanksgiving!
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |