Back to blog

I Benchmarked Every Go String Similarity Library: Building strsim

Go's string similarity ecosystem is fragmented across 8+ libraries with Unicode bugs, NaN edge cases, and missing phonetic support. Here's what I found and why I built strsim.

Go Open Source Algorithms Benchmarks NLP

The Problem

String similarity is everywhere: search suggestions, spell checking, duplicate detection, record matching, autocomplete. You’d expect a mature language like Go to have a solid, comprehensive library for this. It doesn’t.

Instead, the ecosystem is fragmented across 8+ small libraries, each implementing a subset of algorithms with inconsistent APIs. Worse, several have bugs that produce incorrect results — silently. I audited them all, found real problems, and built strsim to consolidate everything into one correct, fast, zero-dependency package.

The Go Ecosystem

Here are the Go libraries with meaningful adoption for string similarity:

LibraryStarsAlgorithmsPhoneticUnicode-safeEmpty string safe
go-edlib59510 (edit + token)NoYesNaN
strutil4158 (edit + token)NoYesNaN
smetrics2365 (edit) + SoundexSoundex onlyNoPanic
matchr~10012 (edit + phonetic)YesYesOK
strsim15 (edit + token + phonetic)Yes (4)YesYes

None of the existing libraries combine edit distance, token-based similarity, and phonetic encoding in a single package. And the bugs I found during auditing were more concerning than I expected.

What I Found by Actually Reading the Code

I didn’t just read READMEs — I cloned every repo, ran their tests, and systematically probed edge cases. Here’s what I found.

smetrics: Broken Unicode

smetrics operates on bytes, not runes. Every function compares a[i] == b[j] using byte indexing, which silently produces incorrect results for any non-ASCII text.

For the Japanese strings "日本語" vs "日本語テスト" (3 runes apart), smetrics reports a Levenshtein distance of 9 (counting individual byte differences) instead of the correct 3.

It also has a panic bug: Soundex("") crashes with an index-out-of-range error because it accesses s[0] without checking for empty input.

And Soundex(“Ashcraft”) returns A226 instead of the correct A261 — it doesn’t implement H/W transparency per the American Soundex specification.

go-edlib and strutil: NaN on Empty Strings

Both libraries return NaN for Similarity("", ""). The correct answer is 1.0 — two identical strings (both empty) have perfect similarity. This happens because they divide by max(len(a), len(b)) without checking for zero.

In production, a single NaN can propagate through your entire scoring pipeline. If you’re ranking search results and one comparison hits empty strings, your sort becomes undefined.

matchr: GPLv3

matchr actually has the best phonetic coverage (Double Metaphone, NYSIIS, Phonex). But it’s licensed under GPLv3, which means any binary using it must also be GPL-licensed. For a utility library consumed by other projects, this is a deal-breaker for most companies.

Benchmark Methodology

I compared strsim against go-edlib, strutil, and smetrics across 7 scenarios using Go’s testing.Benchmark for accurate timing:

  1. Short strings (7 chars) — baseline
  2. Medium strings (25 chars) — typical use
  3. Long strings (117 chars) — scale
  4. Unicode strings (Japanese) — correctness
  5. Empty vs non-empty — edge case
  6. Identical strings — fast path
  7. Batch matching (1000 candidates) — real-world search

I measured correctness, performance (ns/op), and memory (B/op, allocs/op). The benchmark code is available on GitHub for verification.

Results

Correctness: Edge Cases

TestExpectedstrsimgo-edlibstrutilsmetrics
Levenshtein("", "") similarity1.01.0NaNNaNN/A
JaroWinkler("", "") similarity1.01.00.01.01.0
Soundex("")(empty)(empty)N/AN/APANIC
Soundex(“Ashcraft”)A261A261N/AN/AA226
Levenshtein(“日本語”, “日本語テスト”)33339

strsim is the only library that passes every correctness check.

Performance: Jaro-Winkler Similarity

Jaro-Winkler is the most popular metric for name matching, deduplication, and fuzzy search. This is where strsim shines.

Scenariostrsimgo-edlibstrutilsmetrics
Short (7 chars)52 ns69 ns133 ns48 ns
Medium (25 chars)182 ns268 ns619 ns194 ns
Long (117 chars)1,952 ns3,197 ns6,463 ns1,991 ns
Unicode130 ns150 ns297 ns135 ns

strsim is the fastest for medium and long strings — the typical case for real-world text. For short strings, smetrics is close (48 vs 52 ns) by operating on bytes, but that comes with broken Unicode.

Performance: Levenshtein Distance

Scenariostrsimgo-edlibstrutilsmetrics
Short (7 chars)90 ns70 ns116 ns67 ns
Medium (25 chars)1,042 ns1,111 ns1,788 ns573 ns
Long (117 chars)21,158 ns19,753 ns38,585 ns14,003 ns

smetrics wins on Levenshtein because it has a tighter inner loop operating on bytes. But this speed comes at the cost of incorrect results for Unicode input. Among Unicode-correct libraries, strsim beats go-edlib on medium strings and is close on long strings.

Performance: Batch Matching (1000 Candidates)

In production, you often compare a query against hundreds or thousands of candidates. strsim includes built-in FindBestMatch, FindTopN, and FindAboveThreshold functions.

Libraryns/opB/opallocs/op
strsim (FindBestMatch)161,47843,5202,000
go-edlib (manual loop)212,586280,3222,000
strutil (manual loop)395,804168,3226,880
smetrics (manual loop)159,81343,5202,000
strsim ASCII (FindBestMatch)93,14643,5202,000

With ASCIIOnly mode enabled, strsim is the fastest — 1.7x faster than smetrics on batch matching with ASCII input.

The ASCIIOnly Fast Path

Many workloads are pure ASCII: English names, product codes, URLs, programming identifiers. For these, converting to []rune is wasted work. strsim’s ASCIIOnly mode skips the conversion entirely.

// Unicode-safe (default)
m := strsim.NewJaroWinkler()

// ASCII fast path — 2x faster for ASCII input
m := &strsim.JaroWinklerMetric{
    BoostThreshold: 0.7,
    PrefixSize:     4,
    ASCIIOnly:      true,
}

The trade-off is explicit: you opt in, the Godoc warns you, and you get measurably faster results.

AlgorithmUnicode (ns/op)ASCII (ns/op)Speedup
Hamming1171111x
Damerau-Levenshtein17,4776,8042.6x
Jaro-Winkler4932342.1x
Levenshtein3,1442,7871.1x

This is the key difference vs smetrics: smetrics always operates on bytes with no opt-in and no warning. strsim makes the trade-off explicit.

Feature Comparison

Featurestrsimgo-edlibstrutilsmetrics
LevenshteinYesYesYesYes
Damerau-LevenshteinYesYesNoNo
OSAYesYesNoNo
HammingYesYesYesYes
LCSYesYesNoNo
Jaro / Jaro-WinklerYesYesYesYes
Cosine (n-gram)YesYesNoNo
Jaccard / DiceYesYesYesNo
Overlap CoefficientYesNoYesNo
SoundexYesNoNoYes
MetaphoneYesNoNoNo
Double MetaphoneYesNoNoNo
NYSIISYesNoNoNo
Unified Metric interfaceYesNoYesNo
Batch operationsYesNoNoNo
ASCII fast pathYesNoNoNo
Total algorithms151085

Architecture

strsim is a flat package — no sub-packages, no internal directories. Every algorithm is in its own file with its own tests and benchmarks.

Interfaces

Every metric implements at least one interface:

// Similarity metric — returns [0, 1] where 1.0 = identical.
type Metric interface {
    Similarity(a, b string) float64
}

// Distance metric — also returns raw edit distance.
type DistanceMetric interface {
    Metric
    Distance(a, b string) int
}

// Phonetic encoder.
type Encoder interface {
    Encode(s string) string
}

This means you can swap metrics without changing your code:

func findSimilar(query string, items []string, m strsim.Metric) {
    matches := strsim.FindAboveThreshold(query, items, 0.8, m)
    for _, match := range matches {
        fmt.Printf("%s (%.2f)\n", match.Value, match.Similarity)
    }
}

// Same function, different algorithms
findSimilar("golang", items, strsim.NewJaroWinkler())
findSimilar("golang", items, strsim.NewLevenshtein())
findSimilar("golang", items, &strsim.NgramMetric{Size: 3})

Configurable Structs

Top-level functions use sensible defaults. When you need control, use the struct directly:

// Custom costs for Levenshtein
m := &strsim.LevenshteinMetric{
    InsertCost:  1,
    DeleteCost:  1,
    ReplaceCost: 2,  // Penalize substitutions more
    ASCIIOnly:   true,
}

// Custom Jaro-Winkler parameters
jw := &strsim.JaroWinklerMetric{
    BoostThreshold: 0.7,
    PrefixSize:     4,
}

Phonetic Algorithms

strsim includes four phonetic encoders — the most of any MIT-licensed Go library:

  • Soundex — American Soundex, the classic
  • Metaphone — Lawrence Philips’ original algorithm
  • Double Metaphone — handles Germanic, Slavic, Celtic, Greek, Italian, Spanish, and Chinese name origins
  • NYSIIS — New York State Identification and Intelligence System, particularly effective for American names

Each has an Encode() and a Match() function:

strsim.SoundexMatch("Robert", "Rupert")     // true  (both R163)
strsim.MetaphoneMatch("Smith", "Smyth")     // true  (both SM0)
strsim.DoubleMetaphoneMatch("Smith", "Schmidt") // true (shared XMT)

Combining edit distance with phonetic matching is powerful for name deduplication:

func matchName(query string, candidates []string) (strsim.Match, float64) {
    jw := strsim.NewJaroWinkler()
    best := strsim.FindBestMatch(query, candidates, jw)

    bonus := 0.0
    if strsim.SoundexMatch(query, best.Value) { bonus += 0.05 }
    if strsim.DoubleMetaphoneMatch(query, best.Value) { bonus += 0.05 }

    return best, min(best.Similarity + bonus, 1.0)
}

Key Findings

1. Unicode Bugs Are Rampant

smetrics operates entirely on bytes. For any non-ASCII input — accented characters, CJK text, emoji — it produces wrong results. This isn’t documented. If you’re processing international names (García, Müller, 田中), smetrics will silently give you garbage.

2. Empty String Handling Is Broken

go-edlib and strutil both return NaN for Similarity("", ""). In a scoring pipeline, one NaN can corrupt your entire ranking. strsim returns 1.0 for identical strings (including both empty) and 0.0 for completely different strings, always.

3. OSA and Damerau-Levenshtein Catch Transpositions

Standard Levenshtein treats “recieve” → “receive” as 2 edits (delete i, insert i). OSA and Damerau-Levenshtein recognize it as 1 transposition. For spell checking and typo detection, this matters. None of the other libraries except go-edlib offer this.

4. Phonetic Matching Handles What Edit Distance Can’t

“Bob” and “Robert” have a Jaro-Winkler similarity of 0.47 — well below any reasonable matching threshold. But they share the same Soundex code (R163). Combining both approaches catches matches that neither alone would find.

5. The Performance Gap Is in Medium/Long Strings

For short strings (< 10 chars), all libraries are within 20-30 ns of each other. The real differences emerge at 25+ characters, where strsim’s Jaro-Winkler is 30-70% faster than go-edlib and 60-200% faster than strutil.

When to Use What

AlgorithmBest For
Jaro-WinklerName matching, short strings, fuzzy search
LevenshteinSpell checking, typo detection
OSA / Damerau-LevenshteinSpell checking with transposition awareness
Cosine / Jaccard / DiceDocument similarity, longer texts
SoundexFast phonetic grouping, English names
Double MetaphoneMulti-origin name matching (European, Asian)
NYSIISAmerican name matching

For a complete scoring pipeline, combine Jaro-Winkler similarity with phonetic matching:

jw := strsim.NewJaroWinkler()
matches := strsim.FindAboveThreshold(query, candidates, 0.7, jw)

// Boost phonetically similar results
for i, m := range matches {
    if strsim.DoubleMetaphoneMatch(query, m.Value) {
        matches[i].Similarity = min(m.Similarity + 0.1, 1.0)
    }
}

Try It

go get github.com/jcoruiz/strsim
// Edit distance
d := strsim.Levenshtein("kitten", "sitting")  // 3

// Normalized similarity [0, 1]
s := strsim.JaroWinklerSimilarity("martha", "marhta")  // ~0.961

// Phonetic matching
match := strsim.SoundexMatch("Robert", "Rupert")  // true

// Find best match from a list
best := strsim.FindBestMatch("golang", candidates, strsim.NewJaroWinkler())

Zero dependencies. Go 1.22+. MIT licensed. Full documentation on pkg.go.dev. Source on GitHub. Benchmark results. Examples.