Tuesday, March 3, 2009

Jaro-Winkler implementation in Ruby for User-Agent matching

Working in the mobile web, it's hard to keep up with the constant glut of devices on the market and their rapidly evolving capabilities. Of course, progressive enhancement is key; the markup you use in a mobile site should be minimal and look good without any CSS, JavaScript, or images, which is the worst-case. But we don't want to stop there; we want our sites to look as good as they can on as many mobile devices as possible.

WURFL is here to help with that. A crowd-sourced database of mobile devices, updated regularly, WURFL gives you detailed information on the capabilities of over 10,000 mobile devices indexed by user-agent string and grouped by manufacturer. It's relatively straightforward to set up a cronjob for a rake task that downloads an updated WURFL file and stores it in a table in your database; using an ActiveRecord model, you can then query the data provided by WURFL to get all the details for any device which has a user-agent string in the database.

But device makers throw a wrench in these efforts: user-agent strings are frequently updated or modified from their canonical form without rhyme or reason.

Fortunately, we have the Jaro-Winkler distance algorithm, which gives us a quick-and-dirty matching algorithm for strings that are mostly the same but may vary in arbitrary ways. Using this algorithm, we can run indexed queries for exact matches, then pull in some subset (perhaps matching the first 20 characters of the mystery UA string), iterate over these devices, and find the best match to return. Failing to find any implementation of the Jaro-Winkler algorithm in Ruby, I had to cobble together my own; I hope you find it useful.

module JaroWinkler

    def self.distance(str1, str2)
      str1.strip!
      str2.strip!

      if str1 == str2
        return 1
      end

      # str2 should be the longer string
      if str1.length > str2.length
        tmp = str2
        str2 = str1
        str1 = tmp
      end

      lmax = str2.length

      # arrays to keep track of positions of matches
      found1 = Array.new(str1.length, false)
      found2 = Array.new(str2.length, false)

      midpoint = ((str1.length / 2) - 1).to_i

      common = 0

      for i in 0..str1.length
        first = 0
        last = 0
        if midpoint >= i
            first = 1
            last = i + midpoint
        else
            first = i - midpoint
            last = i + midpoint
        end

        last = lmax if last > lmax

        for j in first..last
          if str2[j] == str1[i] and found2[j] == false
            common += 1
            found1[i] = true
            found2[j] = true
            break
          end
        end
      end

      last_match = 1
      tr = 0
      for i in 0..found1.length
        if found1[i]
          for j in (last_match..found2.length)
            if found2[j]
              last_match = j + 1
              tr += 0.5 if str1[i] != str2[j]
            end
          end
        end
      end

      onethird = 1.0/3
      if common > 0
        return [(onethird * common / str1.length) +
                (onethird * common / str2.length) +
                (onethird * (common - tr) / common), 1].min
      else
        return 0
      end
    end

end

This will return a value such that 0 <= value <= 1. Matches near the beginning of the strings are weighted more heavily than matches near the end, making this a good algorithm to use with user-agent strings which are typically long and have minor variations toward the end.

That said, this isn't necessarily the optimal method to detect user-agent strings, and since it can easily give false positives in cases where a desktop and a mobile version of the same browser exist, it should be used in conjunction with a whitelist of known desktop browsers (which, though substantial, is much shorter than the list of all mobile browsers). So I offer this code with the disclaimer that Jaro-Winkler may not be the most effective algorithm for user-agent string matching.

By the way, I'm just learning Ruby, so if I'm doing anything atrocious here please let me know.

4 comments:

Anonymous said...

You might also want to look into the algorithms implemented in the amatch ruby library and see if any of them suit your dataset a little better:

http://amatch.rubyforge.org/doc/classes/Amatch.html

David Golightly said...

Thanks marzagao, I hadn't seen that library. Will give it a go.

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.