You will definitely want to start with a good web scraping framework. Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy.
Note: I use the terms crawling and scraping basically interchangeable here. This is a copy of my answer to your Quora question, it’s pretty long.
Get very familiar with either Firebug or Chrome dev tools depending on your preferred browser. This will be absolutely necessary as you browse the site you are pulling data from and map out which urls contain the data you are looking for and what data formats make up the responses.
You will need a good working knowledge of HTTP as well as HTML and will probably want to find a decent piece of man in the middle proxy software. You will need to be able to inspect HTTP requests and responses and understand how the cookies and session information and query parameters are being passed around. Fiddler (http://www.telerik.com/fiddler) and Charles Proxy (http://www.charlesproxy.com/) are popular tools. I use mitmproxy (http://mitmproxy.org/) a lot as I’m more of a keyboard guy than a mouse guy.
Some kind of console/shell/REPL type environment where you can try out various pieces of code with instant feedback will be invaluable. Reverse engineering tasks like this are a lot of trial and error so you will want a workflow that makes this easy.
A note on Regular expressions: DO NOT USE REGULAR EXPRESSIONS TO PARSE HTML. A lot of beginners do this because they are already familiar with regexes. It’s a huge mistake, use xpath or css selectors to navigate html and only use regular expressions to extract data from actual text inside an html node. This might already be obvious to you, it becomes obvious quickly if you try it but a lot of people waste a lot of time going down this road for some reason. Don’t be scared of xpath or css selectors, they are WAY easier to learn than regexes and they were designed to solve this exact problem.
There are two approaches to this problem:
The low level approach:
The embedded browser approach:
Which approach is best?
The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains. Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain.
Rate Limiting/Bot behaviour
You need to be very aware of this. You need to make requests to your target domains at a reasonable rate. You need to write a well behaved bot when crawling websites, and that means respecting robots.txt and not hammering the server with requests. Mistakes or negligence here is very unethical since this can be considered a denial of service attack. The acceptable rate varies depending on who you ask, 1req/s is the max that the Google crawler runs at but you are not Google and you probably aren’t as welcome as Google. Keep it as slow as reasonable. I would suggest 2-5 seconds between each page request.
Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it’s purpose. This url goes in the agent string.
You will be easy to block if the site wants to block you. A smart engineer on their end can easily identify bots and a few minutes of work on their end can cause weeks of work changing your scraping code on your end or just make it impossible. If the relationship is antagonistic then a smart engineer at the target site can completely stymie a genius engineer writing a crawler. Scraping code is inherently fragile and this is easily exploited. Something that would provoke this response is almost certainly unethical anyway, so write a well behaved bot and don’t worry about this.
Not a unit/integration test person? Too bad. You will now have to become one. Sites change frequently and you will be changing your code frequently. This is a large part of the challenge.
There are a lot of moving parts involved in scraping a modern website, good test practices will help a lot. Many of the bugs you will encounter while writing this type of code will be the type that just return corrupted data silently. Without good tests to check for regressions you will find out that you’ve been saving useless corrupted data to your database for a while without noticing. This project will make you very familiar with data validation (find some good libraries to use) and testing. There are not many other problems that combine requiring comprehensive tests and being very difficult to test.
The second part of your tests involve caching and change detection. While writing your code you don’t want to be hammering the server for the same page over and over again for no reason. While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned. Run your unit tests against a cached copy of the urls involved. A caching proxy is very useful here but tricky to configure and use properly.
You also do want to know if the site has changed. If they redesigned the site and your crawler is broken your unit tests will still pass because they are running against a cached copy! You will need either another, smaller set of integration tests that are run infrequently against the live site or good logging and error detection in your crawling code that logs the exact issues, alerts you to the problem and stops crawling. Now you can update your cache, run your unit tests and see what you need to change.
The law here can be slightly dangerous if you do stupid things. If the law gets involved you are dealing with people who regularly refer to wget and curl as “hacking tools”. You don’t want this.
The ethical reality of the situation is that there is no difference between using browser software to request a url and look at some data and using your own software to request a url and look at some data. Google is the largest scraping company in the world and they are loved for it. Identifying your bots name in the user agent and being open about the goals and intentions of your web crawler will help here as the law understands what Google is. If you are doing anything shady, like creating fake user accounts or accessing areas of the site that you shouldn’t (either “blocked” by robots.txt or because of some kind of authorization exploit) then be aware that you are doing something unethical and the law’s ignorance of technology will be extraordinarily dangerous here. It’s a ridiculous situation but it’s a real one.
It’s literally possible to try and build a new search engine on the up and up as an upstanding citizen, make a mistake or have a bug in your software and be seen as a hacker. Not something you want considering the current political reality.
Who am I to write this giant wall of text anyway?
I’ve written a lot of web crawling related code in my life. I’ve been doing web related software development for more than a decade as a consultant, employee and startup founder. The early days were writing perl crawlers/scrapers and php websites. When we were embedding hidden iframes loading csv data into webpages to do ajax before Jesse James Garrett named it ajax, before XMLHTTPRequest was an idea. Before jQuery, before json. I’m in my mid-30’s, that’s apparently considered ancient for this business.
I’ve written successful anti-crawling software systems as well. It’s remarkably easy to write nigh-unscrapable sites if you want to or to identify and sabotage bots you don’t like.
I like writing crawlers, scrapers and parsers more than any other type of software. It’s challenging, fun and can be used to create amazing things.
Our related scraping jobs: