The Soundex phonetic algorithms

soundex(word, maxCodeLen = 4L, clean = TRUE)

refinedSoundex(word, maxCodeLen = 10L, clean = TRUE)

Arguments

word

string or vector of strings to encode

maxCodeLen

maximum length of the resulting encodings, in characters

clean

if TRUE, return NA for unknown alphabetical characters

Value

soundex encoded character vector

Details

The function soundex phonentically encodes the given string using the soundex algorithm. The function refinedSoundex uses Apache's refined soundex algorithm. Both implementations are loosely based on the Apache Commons Java editons.

The variable maxCodeLen is the limit on how long the returned soundex should be.

The soundex and revisedSoundex algorithms are only defined for inputs over the standard English alphabet, i.e., "A-Z." Non-alphabetical characters are removed from the string in a locale-dependent fashion. This strips spaces, hyphens, and numbers. Other letters, such as "Ü," may be permissible in the current locale but are unknown to soundex and revisedSoundex. For inputs outside of its known range, the output is undefined and NA is returned and a warning this thrown. If clean is FALSE, soundex and revisedSoundex attempts to process the strings. The default is TRUE.

Caveats

The soundex and refinedSoundex algorithms are only defined for inputs over the standard English alphabet, i.e., "A-Z." For inputs outside this range, the output is undefined.

References

Charles P. Bourne and Donald F. Ford, "A study of methods for systematically abbreviating English words and names," Journal of the ACM, vol. 8, no. 4 (1961), p. 538-552.

James P. Howard, II, "Phonetic Spelling Algorithm Implementations for R," Journal of Statistical Software, vol. 25, no. 8, (2020), p. 1--21, <10.18637/jss.v095.i08>.

Howard B. Newcombe, James M. Kennedy, "Record linkage: making maximum use of the discriminating power of identifying information," Communications of the ACM, vol. 5, no. 11 (1962), p. 563-566.

See also

Other phonics: caverphone(), cologne(), lein(), metaphone(), mra_encode(), nysiis(), onca(), phonex(), phonics(), rogerroot(), statcan()

Examples

soundex("wheel")
#> [1] "W400"
soundex(c("school", "benji"))
#> [1] "S400" "B520"