User agents are a vital, though often underrated and misunderstood, element of web scraping
Obtaining the data you need from the internet to remain in business gets a lot more complicated without user agents and, just as importantly, without utilising them correctly.
If you’re involved in scraping data, you need to understand user agents inside out.
A user agent is simply a string of information sent by your browser when surfing the web. When you connect to a website, your browser includes this string in the HTTP header and sends it to the website’s server.
There’s no standard for how to set out a user agent, but the information sent across will identify:
Your operating system
The type of device you’re using (such as a mobile device)
The browser you’re using and its version
Any extra information needed to ensure compatibility.
The website’s server uses this information to tailor the response sent to you. For example, a website will need to know if you’re using a mobile device to send a mobile-optimised version of the site to you.
Here’s a real-life example of a user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36
This user agent states a device running Windows 10 on a 64-bit machine (Windows NT 10.0; Win64; x64) and using Chrome version 89 (Chrome/89.0.4389.128).
User agents perform just one job: they act as an identifier to each website you will ever visit. Without these, the website you want to view would not know how to present its pages to you.
That could result in the information on the page not displaying correctly or not even appearing at all. Sending user agents across allows a website to adapt their content and match a version of their site to one that’s compatible with your device’s operating system and the browser you’re using.
When it comes to anti-scraping defences, some websites may block some user agents immediately as they identify them with scraping. It’s not unheard of for the occasional scraping software to declare itself in the user agent and scream out that this is automated traffic.
The easy solution would be to try and not declare a user agent at all. However, this won’t work as the site will replace the void with a default user agent, which the site blocks as standard, so you won’t be able to access the site.
So, how do you make sure your user agent is going to be ok and won’t get you blocked?
The key to scraping data from the web is anonymity. And the best way to stay anonymous is to lose yourself in the crowd.
The most significant factor dictating anonymity is the size of your IP address pool. It used to be that if every request you send out originates from a different IP address, the website can’t connect the dots to work out that one web scraper is sitting behind those IP addresses.
Unfortunately, it’s no longer that simple. Websites have stepped up their game and improved their anti-scraping defences significantly. A part of that is a heavy focus on user agents.
To see how user agents can hinder and help us, we need to go back and look at some of the information they provide:
Old operating systems are rarely used. Even less so if they’re no longer supported and present a security risk for the user. A tiny number of user agents with old operating systems can occasionally appear. But suppose a comparatively large number of queries suddenly come from old operating systems. In that case, there’s a good chance that automation is behind those queries and requests using that the website will block the user agent. As such, your user agents should always be from the most up-to-date version of operating systems.
Not only that, but the operating system element should always be for the most popular operating systems. Think about it from the website’s point of view: if you see everyone using Windows 10, and suddenly a user agent pops up using Windows XP, that’s going to stick out a mile. Likewise for variants and flavours of Linux or other little-used operating systems. If a website suddenly receives lots of queries from user agents it hardly sees, it’ll raise red flags.
The principles here are just the same as with operating systems; you need to lose yourself in the crowd and use up-to-date and popular user agents. However, browser versions are arguably even more critical.
With operating systems, there is an element of choice on behalf of the end-user. Some may not want to upgrade to the latest version, or because of hardware limitations, they cannot upgrade, and the website may grant some leeway.
Browser versions are different. Browsers now update very often and far more often than operating systems. They also update automatically in the background. This means the visibility of older versions to websites should be much less compared to the current version, and a large number of requests appearing from an old version may raise red flags, even if they are just a few months out of date.
As browsers provide user agents at the very beginning of communication with a website, they are headers. But note that headers can provide a great deal more information than just a user agent.
If you send headers across, you need to make sure that your user agents match any information sent across through the headers. Otherwise, your target site will pick up on this and block your queries.
The best practice is to make sure that your headers are sent in the same order that a real browser does and that you set any previous page visited as the ‘referer header’. This is especially useful if you need to visit a product page directly as you can set the website’s homepage as the referer page. Ideally, you will visit the homepage first and then navigate to the product page to look more natural.
You may have worked out that if you only stick to the most popular operating systems and browser versions, the number of user agents at your disposal will diminish. There’s less scope to provide lots of variation.
While true, it’s far better to use a smaller number of ‘high-quality’ user agents, which you know will significantly help you, rather than throw in a high number of user agents that might hinder you, just for the sake of adding a few more to the mix.
In all honesty, it’s not worth the risk. Quality wins over quantity here. Stick with the most popular operating systems and browser versions.
As you can see, user agents must be maintained and updated regularly. The best practice is to check and replace all user agents at least once every six months.
Always use real and up-to-date user agents
User agents must always be:
From an up-to-date and popular browser
From an up-to-date and popular operating system
Rotated through a high-quality pool
Matched to the headers