Web Scraping with Python — Part 1 (requests, BeautifulSoup)

3 min readAug 8, 2018

In Part 1, I’ll show my steps for scraping a website with an example. For instance, I want to create a random quote generator, so first I need a list of quotes — web scraping to the rescue! 🙌

It’ll be helpful to have a basic knowledge of HTML and Python for following along. My example uses Python 3 and requires requests and bs4.

1. Find a website you’re interested in scraping

I googled “famous quotes” and the first result was http://www.keepinspiring.me/famous-quotes/.

2. Inspect the HTML to find patterns for extraction

After navigating to the website, I right clicked the section I was interested in and clicked “Inspect” to show the HTML structure. See below:

Then I usually open up a few more of the HTML elements, and skim through the whole document to get a sense of the structure.

3. Use Python to extract from the HTML document

We’ll use BeautifulSoup to parse the HTML and select elements we’re interested in and just print the output for a start:

After running the program (view the script output at https://gist.github.com/nickwu241/3d9ec7eb46608d4b88164f8e0f9e7ede), we get 108 quotes as expected since the post title is “108 famous quotes on life, love, and success”. But some of the output is unexpected, e.g. quotes[64] is empty and quotes[107] has extra text after the quote.

4. Refine the Python script

We can refine our Python script by inspecting the page and correcting our assumptions about the HTML structure until we’re happy with the results!

In this example, let’s fix the empty quotes[64] first:

Inspecting the element after “Life isn’t about finding yourself. Life is about creating yourself.”

This is just an empty div, probably a mistake from keepingspiring.me, a fix is to ignore empty divs.

Next, we can fix quotes[107] by just taking the first line of text.

Finally, we can clean up the text content before storing it, i.e. splitting the string into quote, author and removing all quotations & dashes.

View the script output file at: https://gist.github.com/nickwu241/8b3227005d80cdfb6ff2388ccb6faf77

5. Use the data

Now we can use the data for our applications, analysis, etc.

For me, I created a simple random quote generator using Flask and deployed it on Heroku:

Goody Quotes

Web App to discover random quotes and provide API endpoints for many quotes.

goody-quotes.herokuapp.com

nickwu241/quotes

Flask Application to serve quotes, deployed on Heroku using containers.

github.com

BONUS TIPS FOR PROTOTYPING

Most the time will be inspecting the HTML and tweaking your program so:

get familiar with the BeautifulSoup interface, the docs and Stack Overflow are good: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
download the URLs locally to be nice to the server and quicken your iteration cycle
start simple and have the program crash instead of failing silently when your assumptions are incorrect

Closing

And that’s it for this part to keep this post short, thank you for reading! 😄 Give this post a 👏 or two if you enjoyed it and let me know what you think, what type of data you’re interested in scraping, and what other types of blog posts you’re interested in reading next! See you next time! 😉

🌈 Find me on Instagram
🐙 Follow me on GitHub
⭐ Let’s connect on LinkedIn
🐦 Follow me on Twitter