Browser Automation with Selenium: Fingerprints, recognizability and traceability?

I want to use selenium/webdriver to simulate a browser and scrape some website-content with it. Even if its not the fastest method, for me it has many advantages such as executing scripts etc.

For many websites it is forbidden to access them via an automated method, for example search engines like google or bing.

For one tool i need to scrape the estimated resultstat from google for several keywords. This will look like the following: simulate the browser that visits google.com and types in a keyword and scrapes the results, then after a little pause type in the next keyword, scrape the results and so on…

My question is: Is it possible for a website to recognize that I’m using selenium to simulate the browser instead of using the browser by hand? Especially the google case gives me some doubts. I know selenium is partly developed by google or at least by some guys working for google. So does leave selenium some fingerprints or isn’t it possible to decide if I’m using the browser by myself or simulated by selenium, even for google?

No, nobody can actually see that you’re using Selenium and not hand-operating the browser yourself with WebDriver. I’m not sure about the old Selenium RC, but it should be the same way. Here’s how it works:

  1. Selenium opens up a browser with a clean profile (or with a profile you selected)
  2. Selenium is hooked up to the browser so it can steer it, control it. But the browser still does most of the work. Basically, Selenium replaces the user inputs to the browser, but not more.

You can easily verify this by reading the contents of the HTTP headers sent by your browser.

If you ever actually needed Selenium to be recognized by your server, you can use Browsermob-proxy and add a custom header to your requests.

All that said, there is one thing you must be aware of. While there’s no way to detect Selenium directly, there can be some indirect clues picked up by the website you’re visiting. Those usually include scanning for too many requests made in virtually no time - this might be an issue for you. Make sure your Selenium is behaving like a user.