In Part 1, I’ll show my steps for scraping a website with an example. For instance, I want to create a random quote generator, so first I need a list of quotes — web scraping to the rescue! 🙌
It’ll be helpful to have a basic knowledge of HTML and Python for following along. My example uses Python 3 and requires requests
and bs4
.
1. Find a website you’re interested in scraping
I googled “famous quotes” and the first result was http://www.keepinspiring.me/famous-quotes/.
2. Inspect the HTML to find patterns for extraction
After navigating to the website, I right clicked the section I was interested in and clicked “Inspect” to show the HTML structure. See below:
Then I usually open up a few more of the HTML elements, and skim through the whole document to get a sense of the structure.
3. Use Python to extract from the HTML document
We’ll use BeautifulSoup to parse the HTML and select elements we’re interested in and just print the output for a start:
After running the program (view the script output at https://gist.github.com/nickwu241/3d9ec7eb46608d4b88164f8e0f9e7ede), we get 108 quotes as expected since the post title is “108 famous quotes on life, love, and success”. But some of the output is unexpected, e.g. quotes[64]
is empty and quotes[107]
has extra text after the quote.
4. Refine the Python script
We can refine our Python script by inspecting the page and correcting our assumptions about the HTML structure until we’re happy with the results!
In this example, let’s fix the empty quotes[64]
first:
This is just an empty div
, probably a mistake from keepingspiring.me
, a fix is to ignore empty div
s.
Next, we can fix quotes[107]
by just taking the first line of text.
Finally, we can clean up the text content before storing it, i.e. splitting the string into quote
, author
and removing all quotations & dashes.
View the script output file at: https://gist.github.com/nickwu241/8b3227005d80cdfb6ff2388ccb6faf77
5. Use the data
Now we can use the data for our applications, analysis, etc.
For me, I created a simple random quote generator using Flask and deployed it on Heroku:
BONUS TIPS FOR PROTOTYPING
Most the time will be inspecting the HTML and tweaking your program so:
- get familiar with the BeautifulSoup interface, the docs and Stack Overflow are good: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- download the URLs locally to be nice to the server and quicken your iteration cycle
- start simple and have the program crash instead of failing silently when your assumptions are incorrect
Closing
And that’s it for this part to keep this post short, thank you for reading! 😄 Give this post a 👏 or two if you enjoyed it and let me know what you think, what type of data you’re interested in scraping, and what other types of blog posts you’re interested in reading next! See you next time! 😉
🌈 Find me on Instagram
🐙 Follow me on GitHub
⭐ Let’s connect on LinkedIn
🐦 Follow me on Twitter