GitHub - matthieugomez/StringDistances.jl: String Distances in Julia

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

What's New In 1.0.0

similarity(s1, s2, dist) is now the canonical API for similarity scores. compare(s1, s2, dist) is still available as a deprecated alias for compatibility.
For q-gram semimetrics other than QGram, if both inputs are shorter than q, the distance is now defined as: 0.0 when the inputs are equal 1.0 when the inputs differ

Supported Distances

String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)

The available distances are:

Edit Distances
- Hamming Distance Hamming() <: Metric
- Jaro and Jaro-Winkler Distance Jaro() JaroWinkler() <: SemiMetric
- Levenshtein Distance Levenshtein() <: Metric
- Optimal String Alignment Distance (a.k.a. restricted Damerau-Levenshtein) OptimalStringAlignment() <: SemiMetric
- Damerau-Levenshtein Distance DamerauLevenshtein() <: Metric
- RatcliffObershelp Distance RatcliffObershelp() <: SemiMetric
Q-gram distances (which compare the set of all substrings of length q in each string)
- QGram Distance QGram(q::Int) <: SemiMetric
- Cosine Distance Cosine(q::Int) <: SemiMetric
- Jaccard Distance Jaccard(q::Int) <: SemiMetric
- Overlap Distance Overlap(q::Int) <: SemiMetric
- Sorensen-Dice Distance SorensenDice(q::Int) <: SemiMetric
- MorisitaOverlap Distance MorisitaOverlap(q::Int) <: SemiMetric
- Normalized Multiset Distance NMD(q::Int) <: SemiMetric

Syntax

Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.

Computing the distance between two strings (or iterators)

You can always compute a certain distance between two strings using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as

r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)

The function similarity returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar. compare is kept as a deprecated alias for compatibility.

Levenshtein()("martha", "martha")
#> 0
similarity("martha", "martha", Levenshtein())
#> 1.0

Computing the distance between two AbstractVectors of strings (or iterators)

Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:

pairwise(dist, X, Y)

For instance,

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)

Find closest string

The package also adds convenience functions to find elements in a iterator of strings closest to a given string

findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
 findnearest(s, itr, dist)
```
Missing entries in itr are ignored.
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum score. Its syntax is:
```
 findall(s, itr, dist; min_score = 0.8)
```
Missing entries in itr are ignored.

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.

fuzzywuzzy

The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0

Notes

All string distances are case sensitive.
For q-gram semimetrics other than QGram, when both inputs are shorter than q, identical inputs have distance 0.0 and different inputs have distance 1.0.

Name		Name	Last commit message	Last commit date
Latest commit History 409 Commits
.github/workflows		.github/workflows
benchmark		benchmark
src		src
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

What's New In 1.0.0

Supported Distances

Syntax

Computing the distance between two strings (or iterators)

Computing the distance between two AbstractVectors of strings (or iterators)

Find closest string

fuzzywuzzy

Notes

About

Uh oh!

Releases 24

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

What's New In 1.0.0

Supported Distances

Syntax

Computing the distance between two strings (or iterators)

Computing the distance between two AbstractVectors of strings (or iterators)

Find closest string

fuzzywuzzy

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages