The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.
similarity(s1, s2, dist)is now the canonical API for similarity scores.compare(s1, s2, dist)is still available as a deprecated alias for compatibility.- For q-gram semimetrics other than
QGram, if both inputs are shorter thanq, the distance is now defined as:0.0when the inputs are equal1.0when the inputs differ
String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)
The available distances are:
- Edit Distances
- Hamming Distance
Hamming() <: Metric - Jaro and Jaro-Winkler Distance
Jaro()JaroWinkler() <: SemiMetric - Levenshtein Distance
Levenshtein() <: Metric - Optimal String Alignment Distance (a.k.a. restricted Damerau-Levenshtein)
OptimalStringAlignment() <: SemiMetric - Damerau-Levenshtein Distance
DamerauLevenshtein() <: Metric - RatcliffObershelp Distance
RatcliffObershelp() <: SemiMetric
- Hamming Distance
- Q-gram distances (which compare the set of all substrings of length
qin each string)- QGram Distance
QGram(q::Int) <: SemiMetric - Cosine Distance
Cosine(q::Int) <: SemiMetric - Jaccard Distance
Jaccard(q::Int) <: SemiMetric - Overlap Distance
Overlap(q::Int) <: SemiMetric - Sorensen-Dice Distance
SorensenDice(q::Int) <: SemiMetric - MorisitaOverlap Distance
MorisitaOverlap(q::Int) <: SemiMetric - Normalized Multiset Distance
NMD(q::Int) <: SemiMetric
- QGram Distance
Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.
You can always compute a certain distance between two strings using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as
r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)The function similarity returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
compare is kept as a deprecated alias for compatibility.
Levenshtein()("martha", "martha")
#> 0
similarity("martha", "martha", Levenshtein())
#> 1.0Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:
pairwise(dist, X, Y)For instance,
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)
The package also adds convenience functions to find elements in a iterator of strings closest to a given string
-
findnearestreturns the value and index of the element initrwith the highest similarity score withs. Its syntax is:findnearest(s, itr, dist)Missing entries in
itrare ignored. -
findallreturns the indices of all elements initrwith a similarity score withshigher than a minimum score. Its syntax is:findall(s, itr, dist; min_score = 0.8)
Missing entries in
itrare ignored.
The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.
The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).
- Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
- TokenMax normalizes the distance, and combine the
Partial,TokenSortandTokenSetmodifiers, with penalty terms depending on string.TokenMax(Levenshtein())corresponds to the distance defined in fuzzywuzzy
Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0- All string distances are case sensitive.
- For q-gram semimetrics other than
QGram, when both inputs are shorter thanq, identical inputs have distance0.0and different inputs have distance1.0.