Web scraping tactics and considerations

2024-04-18 Thijs

ReCAPTCHA: Select all images with a fire hydrant.

Somehow, recent hobby projects often involved collecting and parsing data from websites automatically. This is an activity known as web scraping. Since most website owners value the information that they have, you will often encounter various measures to prevent this programmatic extraction of information on the web. One of the most well-known measures is the (Re)CAPTCHA, which has had me point out more fire hydrants than I would have liked. Lengthy discussions can be had about the ethical / moral and legal implications of the practice. All in all, I believe the exchange of information is the very purpose of the internet and a little bit of automation can be expected as part of that process. Nonetheless, I also acknowledge that carefully constructed data sets enclosed in a website can be an important asset to be protected.

Because circumventing web scraper protection measures is an interesting challenge by itself, in this post I will share some basic techniques I have learned.

Realistic user agents

The User-Agent header is used by your browser to communicate the general set-up of your browser and operating system. The agent your browser uses to visit this page right now is unknown, well done . Because a programmatic retrieval of a web page usually does not send a user agent or one that clearly does not describe a real web browser, some websites block requests based on this. The solution is, of course, to configure a user agent for your web scraper. Someone published a list to GitHub of 1000 options to get started.

Throttle yourself

If you need to fetch 50 pages from a website, you obviously would like them quickly. However, requesting every page at the same time might not be the best idea if you don’t want to be blocked. Many webservers have methods to deny service to IP addresses that send too many requests in a short period of time. There is even a HTTP status code for it (429). The solution is to take it bit slower. Fetching 50 pages with a pace of 0.5 page per second still only takes under two minutes. In fact, not bombarding a website with requests is just the nice thing to do.

A friendly robot browsing the internet, generated by DALL-E 3.

Proxies

Here it gets a bit messy. If all your methods don’t work to completely prevent being blocked, you may consider switching to an entirely new internet connection to continue trying. Anonymous proxies are a way to do this. Once you reach this point however, you are walking on thin ice. The website you are trying to scrape really does not want you to this, assuming you tried the basic methods. That should make you think. Nevertheless, there are more and more proxy services nowadays that provide you with a large range of IP addresses to make use of. Just make sure you don’t become complicit in running a botnet!