Metaphone in R

I was working on a data merge this weekend with some county-level data. This has to do with the NFIP. But one of the datasets did not include FIPS codes; it was just county names. Well, there are plenty of rational ways one could deal with this. But I saw before me a glorious yak who desperately needed a shave. So I did the obvious thing and decided to Metaphone the county names to ensure normalization.

This, of course, required writing an implementation of Metaphone in R. And before that, I had decided on what Metaphone. First, Metaphone is a family of three algorithms. First is the original Metaphone, which is widely regarded as flawed and still one of the best options for phonetic spelling. Second is Double Metaphone which produces two different encodings for the same sound. Third, is Metaphone3, which is patented and therefore essentially unusable for a few more years.

I went with the original Metaphone, but this got complicated quickly. It turns out, nobody can agree on what the correct standard is for Metaphone. There’s a version provided as part of Apache Commons and also one included with base PHP. Apparently, and I cannot stress enough I have not tested this, they produce different results in some edge cases. This version is derived from a Javascript version that is itself based on the PHP version.

There’s a test suite of 83 examples taken from the Javascript implementation I use for regression testing.

The package is available for download from GitHub and I have set up a dedicated page here.

Over the next few weeks, I hope to add other phonetic algorithms. Of course, contributions are welcome.

Image by Arian Zwegers / Flickr.

  • tacitus voltaire

    Metaphone 3 is available in C++, C#, Java, PL/SQL, and Perl source for $240. (www.amorphics.com) We do allow for special purchase terms of $60 for non commercial purposes – if you feel that you qualify please write me at [email protected]

    You could easily, I’m sure, port it to R for your own use. I don’t know why you call it “unusable”!!

    I developed the original Metaphone algorithm 25 years ago but frankly I consider it deprecated now, and if you absolutely must only use free software please implement it as Double Metaphone from the reference implementation in C++ at http://aspell.net/metaphone/dmetaph.cpp

    Please feel free to write me anytime in regard to any problems or questions you have regarding Metaphone algorithms!

    – Lawrence Philips

    • First, thanks for the quick reply. Second, Double Metaphone is on my agenda. I started it, but I got caught up in the semantic question of returning two values (the primary and secondary metaphone values) within the framework provided by R. Most R functions should work on single strings and also on character arrays and it is unclear how the return value should be structured. So as soon as I wrap my head around it, I will probably use the reference implementation.

      As for Metaphone3, the licensing terms do not seem to support one of the traditional free software licenses…and this implementation of Metaphone is released under a 2-clause BSD license. I completely respect that, and that’s fine, but it is a significant barrier to entry. If I have misread the licensing terms, let me know.

      • tacitus voltaire

        Aha, I see.

        Double Metaphone should only return a second encoding when there is more than one way to pronounce a word. In practice the algorithm will only give a second encoding in a small percentage of cases, although it still gives a number of incorrect secondary encodings, which I am not happy with and worked to reduce in Metaphone 3. On the other hand, in the cases where the secondary pronunciation is valid, you risk missing matches if you don’t use it.

        As a practical matter, you can use a delimiter in your return string to indicate that a secondary encoding will be found there after the first, and separate them out after the value is returned from the routine.

        Good luck!

        • Oh, I know. Frankly, the hit-or-miss of the secondary makes the semantics even worse. What I will probably do is return a complex structure and if the secondary is not different, just return the primary twice. That way the caller can be sure of what they are getting back, and can sort it out themselves.

      • J. Aravind

        I have implemented Double Metaphone as part of my R package PGRdup. It returns a `list` of two `vectors`. The first character `vector` contains the primary double metaphone encodings, while the second character `vector` contains the alternate encodings. In case alternate encodings are not there, both the vectors will have the primary encoding. The `C` code for the double metaphone algorithm was adapted from Maurice Aubrey’s perl module hosted at the gitpan/Text-DoubleMetaphone public github library.