Web scraping can be difficult because many websites attempt to block developers from scraping their sites. They do this by detecting IP addresses, inspecting HTTP request headers, using CAPTCHA, inspecting Javascript code and more. In response to these blocks, web scrapers can be made extremely hard to detect. This is because many of the same techniques that developers use to avoid these blocks are also used by them on the site they want to scrape. Here are some tips for making scraping a website easy:
1. Real User Agent
Some websites examine the User Agent header to determine which browser you're using. If someone isn't using one of the most popular browsers, they won't be allowed to access the site.
This makes it easy for scraper bots to identify non-configured User Agents. Find a popular User Agent for your web crawler here. This will make your website more visible to users. Avoid being one of these developers!
It's important to keep the user agents of your crawlers up to date. New versions of Safari, Firefox, Google Chrome, etc. change their user agent. If you don't make changes to the user agent on your crawler, it will look more suspicious over time. The Googlebot User Agent is recommended for advanced users as many websites want to be listed on Google and let Googlebot through.
It's also useful to switch between a bunch of different user-agents to avoid sudden spikes in requests from one user-agent to the site (which is also easy to spot).
2. IP Rotation
The best way for a website to detect a web crawler is to check its IP address. Therefore, most web scrapers use different IP address ranges to prevent IP addresses from being banned, even if they are not blocked. To avoid sending all requests through the same IP address, you can use an IP rotation service (such as BrowserCloud or another proxy service) to route requests through multiple different IP addresses. This allows you to browse most websites easily.
The most common way sites block crawling bots is by using residential or mobile proxies. If you're not familiar with these terms, you can find out more about them in our article about different types of proxy servers here. This is because the only IP addresses that are available to the public are those given to you by your internet provider.
Most people only have one of these, which means they can use as many residential or mobile IPs as they need without arousing suspicion. This is because 1 million IP addresses would allow them to surf as much as 1 million regular internet users.
3. Custom Headers
Many websites block web scrapers by checking each header. One way to bypass this block is to visit https://httpbin.org/anything and copy the headers from the web browser currently active on the computer.
Doing this tricks most web scrapers into thinking they’re a real browser, which can be beneficial to performing a task. Setting headers like “Upgrade-Insecure-Requests,” “Accept-Language,” or “Accept” will fool web scraping detection software into thinking your request is coming from a real browser.
The most common Google Chrome headers:
“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,
image/apng,*/*;q=0.8,application/signed-exchange;v=b3″,
“Accept-Encoding”: “gzip”,
“Accept-Language”: “en-US,en;q=0.9,es;q=0.8”,
“Upgrade-Insecure-Requests”: “1”,
“User-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36”
4. 'Referrer' header
When sending an HTTP request, you can customize the Referer header to point to your original site. This is an optional header, but is recommended for a professional look.
“Referer”: “https://www.google.com/”
Many popular websites have a standard header that can be changed; using a tool like SimilarWeb, you can easily get the most popular referral sources. For sites in other countries, such as the UK, you’d use “https://www.google.co.uk/” instead of “https://www.google.com/” when changing the header.
You can also change headers for individual websites by looking up a site’s most common referers on https://www.similarweb.com and changing the request to use “https://www.google.co.uk/” instead of “https://www.google.com/” in the header section of your request.
This makes your request look more authentic, as it appears to be traffic from a site that its webmaster would expect a lot of traffic to come from— even if they’re not actually expecting any at all.
5. Randomize delay between requests
A website scraping program that sends requests 24 hours a day is easy to spot. A real person wouldn’t use this website; it’s too obvious and contains a pattern no one would use.
Random delays between 2 and 10 seconds are a great way to build a web scraper that won’t get blocked. Additionally, be courteous when scraping websites.
If you notice your requests slowing down and speeding up, try adjusting your timing to reduce load on the website. Search a website’s robots.txt file at http://www.example.com/robots.txt or http://example.com/robots.txt to see if there’s a line that says crawl-delay. If so, this line will tell you how long to wait between crawls to avoid disrupting the server’s performance.
6. Use a Headless Browser
Many websites use tricks to determine if the request is legitimate. These tricks can include detection of browser cookies, javascript execution and fonts, as well as fake requests made by headless browsers. To scrape these sites, you can use BrowserCloud to make fake requests for you or create your own headless browser!
Tools like Puppeteer and Selenium provide the ability to write a program that controls a real web browser and simulates a real user's actions. This allows for scraping websites that are difficult or impossible to access without tools like this.
However, this comes with the downside of being extremely resource-intensive and buggy. If you absolutely need to scrape websites, use these tools only when necessary; manually controlled browsers can be hard on memory and CPU.
The vast majority of sites won't need to use these tools (a simple GET request will suffice), so only use them if you're blocked for not using a real browser!
7. CAPTCHA Solving
Certain websites place restrictions on crawlers by requiring them to solve a CAPTCHA. Finding ways to bypass these restrictions is easy thanks to services like AntiCAPTCHA or 2Captcha, or BrowserCloud that provides full integration. These services allow users to bypass the CAPTCHA and continue crawling sites. It may be necessary to use one of these services when submitting a form that uses CAPTCHAs. Many of these services cost money and take a long time to complete. It may also be difficult to justify scraping these websites if they require constant CAPTCHA solving for a long time.