User Agent Parsing

Created: 2013/09/25 21:48:59+0000

For my Web Utilities project I have been parsing user agent strings. You can take a look at the results for your browser by going to my user agent parser. All the APIs I looked at were focused on parsing for the Browser, OS, Layout Engine and Device. They ignore many of the additional details. There are sites such as http://www.useragentstring.com/ but I didn't want to make requests to a web API to get the information. So I did it myself.

It has been very slow, boring and unpleasant. Lots of data processing, refactoring and testing. I've gathered several user agent strings, 349 so far, to parse. There are lots of special cases to cater for.

Many user agent parsers are based on using regular expressions. I attempted to implement a parser by parsing a single token at a time with the intention of parsing all of them. The hope was that even unfamiliar user agents could still provide some information. This works for many user agents but there are those that only identify the user agent and do not provide any informative tokens.

Examples of the user agents that may need to be parsed are:

  • Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.22+ (KHTML, like Gecko) Chromium/17.0.963.56 Chrome/17.0.963.56 Safari/535.22+ Debian/7.0 (3.4.2-2.1) Epiphany/3.4.2
  • BlackBerry9000/4.6.0.167 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/102 ips-agent
  • panscient.com
  • Java/1.7.0_09

Some user agents consist of tokens describing what the browser is based on, very specific vendor based information, an identifier for the user agent, a vague description of a library used to make the request. Without knowing what these things are it is not immediately clear.

This is why I'd like to see all these old traditions, like starting with Mozilla, vanish and be replaced with structured user agent strings. My first thought was to use a Lisp like syntax. This would provide clear tree based structures and well defined token separation. This said I don't want to propose a single format. My preferred solution would be to start all user agents with a short code to indicate the scheme and parse the remaining string according to this scheme. I'll try to present some examples soon.