Thanks to: https://blog.hartleybrody.com/web-scraping/
If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?
That’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:
Any content that can be viewed on a webpage can be scraped. Period.
If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.
Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.
There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.
Why You Should Scrape
But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.
Websites are More Important Than APIs
The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.
We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.
Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.
Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.
Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.
Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.
Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.
There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.
With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.
The Data’s Already in Your Face
Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.
Let’s Get to Scraping
So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.
Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.
I’ll share some of the tips I’ve learned along the way.
Fetching the Data
So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.
If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.
The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like
q= that always changes based on you search term.
Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning
? to start the query string and a
& between each key/value pair.
Dealing with Pagination
At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.
Usually, clicking to page 2 adds some sort of
offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.
With this information, you can now iterate over every page of results, incrementing the
offset parameter as necessary, until you hit that “end of data” condition.
The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.
Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.
AJAX Isn’t That Bad!
Sometimes people see web pages with URL fragments
# and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.
The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.
All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.
Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.
Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.
Use CSS Hooks
In my experience, this is usually straightforward since most web designers litter the markup with tons of
ids to provide hooks for their CSS.
You can piggyback on these to jump to the parts of the markup that contain the data you need.
Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost
<div> around the item you want.
<div> should be the outer wrapper around a single item you want access to. It probably has some
class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.
Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.
This issue has caused me a number of head-scratchers.
Get a Good HTML Parsing Library
It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.
Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:
You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help.
You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.
A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.
Some Traps To Know About
For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.
But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.
When In Doubt, Spoof Headers
Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.
Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.
Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.
Content Behind A Login
Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requestsis?), then you just need your scraper login before it gets to work.
Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.
I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.
This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.
Depending on your application, this could work for you.
Poorly Formed Markup
Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.
It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.
If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.
Well there’s 2000 words to get you started on web scraping. Hopefully I’ve convinced you that it’s actually a legitimate way of collecting data.
It’s a real hacker challenge to read through some HTML soup and look for patterns and structure in the markup in order to pull out the data you need. It usually doesn’t take much longer than reading some API docs and getting up to speed with a client. Plus it’s way more fun!