5 Guidelines for Web Scraping
I've done a fair share of web scraping to gather sports data, and some automated testing and validation. I also happen to sit next to the team that manages application security and bot detection at work. I have a few guidelines for responsibly scraping, avoiding the dreaded block/sinkhole, and being a good netizen. I've generally used Python, so I name the libraries I use there, but it's not too hard to find the equivalent in your language of choice.
1. Set your headers correctly.
Headers are the easiest way to detect a bot. If you don't change your headers, you literally tell the web server that you're using headless Chrome or Python Requests. A large amount of traffic from a single IP with suspect looking headers is a quick route to the block list, or at least a partial block or request sinkhole.
My strategy has always been to get the headers from hitting the site from my browser. In Chrome, you can get that information by viewing your requests under the Network tab in Dev Tools. You can right click on any of the requests in that window, and copy as cURL, which will include all the relevant request context, including URL, headers and any payload if it's a PUT or POST method. I paste that cURL command directly into my IDE, and copy the headers over into a Python dictionary that can be passed directly to both Selenium and Requests.
Sometimes if you're traversing multiple pages, you may need to change the referer header so that it correctly matches the flow that would occur naturally from browser traffic. You can find this with the above method. Most of the time though, it will be the previous page's URL.
My strategy has always been to get the headers from hitting the site from my browser. In Chrome, you can get that information by viewing your requests under the Network tab in Dev Tools. You can right click on any of the requests in that window, and copy as cURL, which will include all the relevant request context, including URL, headers and any payload if it's a PUT or POST method. I paste that cURL command directly into my IDE, and copy the headers over into a Python dictionary that can be passed directly to both Selenium and Requests.
Sometimes if you're traversing multiple pages, you may need to change the referer header so that it correctly matches the flow that would occur naturally from browser traffic. You can find this with the above method. Most of the time though, it will be the previous page's URL.
2. Know when to use requests/urllib or headless Chrome
The biggest deciding factor here is if any of the information you need is dynamically populated by Javascript. Requests/urllib3 don't run Javascript, they solely return the server's initial response. If a page is populated server side and sent as HTML, this isn't a concern, and you can use the lighter and easier requests or urllib3 (or the equivalent libraries in your language of choice). However, if the information you need isn't in the response, you probably need to switch over to using headless Chrome and Selenium WebDriver.
Be aware though, headless Chrome is way heavier than simple web clients, and you need to remember to close it. Otherwise you're gonna have a ton of Chrome windows open, hidden or not.
3. If there's an API, use it.
This one definitely sounds obvious. Especially if there are documented APIs for a site. It's less obvious when the APIs aren't documented. But a lot of times, the API you need exists, and you can find it pretty easily.
You can use the same Network tab of Dev Tools you used before to find what API endpoint on the vendor side is populating the data you need. I used this approach to get real time betting lines from Bovada. After some trial and error, I had an API that worked perfectly.
Some sites obfuscate their undocumented APIs. This is for a good reason, but it makes accumulating their data a lot more difficult. There's probably a way to get around that, but I haven't found a good use case for running that down yet.
4. Be aware of legal and ethical concerns
This is less technical and more judgement based. There are a lot of insecure and bad websites out there. Just because a web server will give you information, does not mean you are authorized to have that information. If there's personally identifiable information, or something you suspect a website wouldn't want you to have, then your best bet is not to scrape it. A good rule of thumb is that if you can't get to the page by clicking through links in your browser, it's probably not a great idea to scrape it.
I agree that this isn't how it should be. Sites should have effective role based access control. Any time they send you a 200 response and data, it should be safe to assume you are allowed to access that data. But chances are, you won't have me on your jury if you go to court.
Another thing to keep in mind is how big the site is. There are a lot of sites out there that pay per request. If you traverse a ton of their site, that's a lot of money to handle all those requests. I'm not saying to absolutely not scrape smaller sites, but again, use your best judgement.
5. Use Selenium IDE to simplify building workflows in WebDriver
If you need Selenium, I strongly recommend using Selenium IDE. It's a browser extension that records your actions and allows you to export them as code in your preferred language. It's as simple as it sounds. The biggest win I've gotten from using Selenium IDE is when you don't know how to call an element. Selenium IDE will find a way to get that element for you, even when it's called something weird or obscure in the DOM. Big shout out to my friend AJ for showing me this.
Those are my basic guidelines. Now, go out there are have fun getting some data.
Comments
Post a Comment