I Benchmarked Every Go String Similarity Library: Building strsim
Go's string similarity ecosystem is fragmented across 8+ libraries with Unicode bugs, NaN edge cases, and missing phonetic support. Here's what I found and why I built strsim.
The Problem
String similarity is everywhere: search suggestions, spell checking, duplicate detection, record matching, autocomplete. You’d expect a mature language like Go to have a solid, comprehensive library for this. It doesn’t.
Instead, the ecosystem is fragmented across 8+ small libraries, each implementing a subset of algorithms with inconsistent APIs. Worse, several have bugs that produce incorrect results — silently. I audited them all, found real problems, and built strsim to consolidate everything into one correct, fast, zero-dependency package.
The Go Ecosystem
Here are the Go libraries with meaningful adoption for string similarity:
| Library | Stars | Algorithms | Phonetic | Unicode-safe | Empty string safe |
|---|---|---|---|---|---|
| go-edlib | 595 | 10 (edit + token) | No | Yes | NaN |
| strutil | 415 | 8 (edit + token) | No | Yes | NaN |
| smetrics | 236 | 5 (edit) + Soundex | Soundex only | No | Panic |
| matchr | ~100 | 12 (edit + phonetic) | Yes | Yes | OK |
| strsim | — | 15 (edit + token + phonetic) | Yes (4) | Yes | Yes |
None of the existing libraries combine edit distance, token-based similarity, and phonetic encoding in a single package. And the bugs I found during auditing were more concerning than I expected.
What I Found by Actually Reading the Code
I didn’t just read READMEs — I cloned every repo, ran their tests, and systematically probed edge cases. Here’s what I found.
smetrics: Broken Unicode
smetrics operates on bytes, not runes. Every function compares a[i] == b[j] using byte indexing, which silently produces incorrect results for any non-ASCII text.
For the Japanese strings "日本語" vs "日本語テスト" (3 runes apart), smetrics reports a Levenshtein distance of 9 (counting individual byte differences) instead of the correct 3.
It also has a panic bug: Soundex("") crashes with an index-out-of-range error because it accesses s[0] without checking for empty input.
And Soundex(“Ashcraft”) returns A226 instead of the correct A261 — it doesn’t implement H/W transparency per the American Soundex specification.
go-edlib and strutil: NaN on Empty Strings
Both libraries return NaN for Similarity("", ""). The correct answer is 1.0 — two identical strings (both empty) have perfect similarity. This happens because they divide by max(len(a), len(b)) without checking for zero.
In production, a single NaN can propagate through your entire scoring pipeline. If you’re ranking search results and one comparison hits empty strings, your sort becomes undefined.
matchr: GPLv3
matchr actually has the best phonetic coverage (Double Metaphone, NYSIIS, Phonex). But it’s licensed under GPLv3, which means any binary using it must also be GPL-licensed. For a utility library consumed by other projects, this is a deal-breaker for most companies.
Benchmark Methodology
I compared strsim against go-edlib, strutil, and smetrics across 7 scenarios using Go’s testing.Benchmark for accurate timing:
- Short strings (7 chars) — baseline
- Medium strings (25 chars) — typical use
- Long strings (117 chars) — scale
- Unicode strings (Japanese) — correctness
- Empty vs non-empty — edge case
- Identical strings — fast path
- Batch matching (1000 candidates) — real-world search
I measured correctness, performance (ns/op), and memory (B/op, allocs/op). The benchmark code is available on GitHub for verification.
Results
Correctness: Edge Cases
| Test | Expected | strsim | go-edlib | strutil | smetrics |
|---|---|---|---|---|---|
| Levenshtein("", "") similarity | 1.0 | 1.0 | NaN | NaN | N/A |
| JaroWinkler("", "") similarity | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
| Soundex("") | (empty) | (empty) | N/A | N/A | PANIC |
| Soundex(“Ashcraft”) | A261 | A261 | N/A | N/A | A226 |
| Levenshtein(“日本語”, “日本語テスト”) | 3 | 3 | 3 | 3 | 9 |
strsim is the only library that passes every correctness check.
Performance: Jaro-Winkler Similarity
Jaro-Winkler is the most popular metric for name matching, deduplication, and fuzzy search. This is where strsim shines.
| Scenario | strsim | go-edlib | strutil | smetrics |
|---|---|---|---|---|
| Short (7 chars) | 52 ns | 69 ns | 133 ns | 48 ns |
| Medium (25 chars) | 182 ns | 268 ns | 619 ns | 194 ns |
| Long (117 chars) | 1,952 ns | 3,197 ns | 6,463 ns | 1,991 ns |
| Unicode | 130 ns | 150 ns | 297 ns | 135 ns |
strsim is the fastest for medium and long strings — the typical case for real-world text. For short strings, smetrics is close (48 vs 52 ns) by operating on bytes, but that comes with broken Unicode.
Performance: Levenshtein Distance
| Scenario | strsim | go-edlib | strutil | smetrics |
|---|---|---|---|---|
| Short (7 chars) | 90 ns | 70 ns | 116 ns | 67 ns |
| Medium (25 chars) | 1,042 ns | 1,111 ns | 1,788 ns | 573 ns |
| Long (117 chars) | 21,158 ns | 19,753 ns | 38,585 ns | 14,003 ns |
smetrics wins on Levenshtein because it has a tighter inner loop operating on bytes. But this speed comes at the cost of incorrect results for Unicode input. Among Unicode-correct libraries, strsim beats go-edlib on medium strings and is close on long strings.
Performance: Batch Matching (1000 Candidates)
In production, you often compare a query against hundreds or thousands of candidates. strsim includes built-in FindBestMatch, FindTopN, and FindAboveThreshold functions.
| Library | ns/op | B/op | allocs/op |
|---|---|---|---|
| strsim (FindBestMatch) | 161,478 | 43,520 | 2,000 |
| go-edlib (manual loop) | 212,586 | 280,322 | 2,000 |
| strutil (manual loop) | 395,804 | 168,322 | 6,880 |
| smetrics (manual loop) | 159,813 | 43,520 | 2,000 |
| strsim ASCII (FindBestMatch) | 93,146 | 43,520 | 2,000 |
With ASCIIOnly mode enabled, strsim is the fastest — 1.7x faster than smetrics on batch matching with ASCII input.
The ASCIIOnly Fast Path
Many workloads are pure ASCII: English names, product codes, URLs, programming identifiers. For these, converting to []rune is wasted work. strsim’s ASCIIOnly mode skips the conversion entirely.
// Unicode-safe (default)
m := strsim.NewJaroWinkler()
// ASCII fast path — 2x faster for ASCII input
m := &strsim.JaroWinklerMetric{
BoostThreshold: 0.7,
PrefixSize: 4,
ASCIIOnly: true,
}
The trade-off is explicit: you opt in, the Godoc warns you, and you get measurably faster results.
| Algorithm | Unicode (ns/op) | ASCII (ns/op) | Speedup |
|---|---|---|---|
| Hamming | 117 | 11 | 11x |
| Damerau-Levenshtein | 17,477 | 6,804 | 2.6x |
| Jaro-Winkler | 493 | 234 | 2.1x |
| Levenshtein | 3,144 | 2,787 | 1.1x |
This is the key difference vs smetrics: smetrics always operates on bytes with no opt-in and no warning. strsim makes the trade-off explicit.
Feature Comparison
| Feature | strsim | go-edlib | strutil | smetrics |
|---|---|---|---|---|
| Levenshtein | Yes | Yes | Yes | Yes |
| Damerau-Levenshtein | Yes | Yes | No | No |
| OSA | Yes | Yes | No | No |
| Hamming | Yes | Yes | Yes | Yes |
| LCS | Yes | Yes | No | No |
| Jaro / Jaro-Winkler | Yes | Yes | Yes | Yes |
| Cosine (n-gram) | Yes | Yes | No | No |
| Jaccard / Dice | Yes | Yes | Yes | No |
| Overlap Coefficient | Yes | No | Yes | No |
| Soundex | Yes | No | No | Yes |
| Metaphone | Yes | No | No | No |
| Double Metaphone | Yes | No | No | No |
| NYSIIS | Yes | No | No | No |
Unified Metric interface | Yes | No | Yes | No |
| Batch operations | Yes | No | No | No |
| ASCII fast path | Yes | No | No | No |
| Total algorithms | 15 | 10 | 8 | 5 |
Architecture
strsim is a flat package — no sub-packages, no internal directories. Every algorithm is in its own file with its own tests and benchmarks.
Interfaces
Every metric implements at least one interface:
// Similarity metric — returns [0, 1] where 1.0 = identical.
type Metric interface {
Similarity(a, b string) float64
}
// Distance metric — also returns raw edit distance.
type DistanceMetric interface {
Metric
Distance(a, b string) int
}
// Phonetic encoder.
type Encoder interface {
Encode(s string) string
}
This means you can swap metrics without changing your code:
func findSimilar(query string, items []string, m strsim.Metric) {
matches := strsim.FindAboveThreshold(query, items, 0.8, m)
for _, match := range matches {
fmt.Printf("%s (%.2f)\n", match.Value, match.Similarity)
}
}
// Same function, different algorithms
findSimilar("golang", items, strsim.NewJaroWinkler())
findSimilar("golang", items, strsim.NewLevenshtein())
findSimilar("golang", items, &strsim.NgramMetric{Size: 3})
Configurable Structs
Top-level functions use sensible defaults. When you need control, use the struct directly:
// Custom costs for Levenshtein
m := &strsim.LevenshteinMetric{
InsertCost: 1,
DeleteCost: 1,
ReplaceCost: 2, // Penalize substitutions more
ASCIIOnly: true,
}
// Custom Jaro-Winkler parameters
jw := &strsim.JaroWinklerMetric{
BoostThreshold: 0.7,
PrefixSize: 4,
}
Phonetic Algorithms
strsim includes four phonetic encoders — the most of any MIT-licensed Go library:
- Soundex — American Soundex, the classic
- Metaphone — Lawrence Philips’ original algorithm
- Double Metaphone — handles Germanic, Slavic, Celtic, Greek, Italian, Spanish, and Chinese name origins
- NYSIIS — New York State Identification and Intelligence System, particularly effective for American names
Each has an Encode() and a Match() function:
strsim.SoundexMatch("Robert", "Rupert") // true (both R163)
strsim.MetaphoneMatch("Smith", "Smyth") // true (both SM0)
strsim.DoubleMetaphoneMatch("Smith", "Schmidt") // true (shared XMT)
Combining edit distance with phonetic matching is powerful for name deduplication:
func matchName(query string, candidates []string) (strsim.Match, float64) {
jw := strsim.NewJaroWinkler()
best := strsim.FindBestMatch(query, candidates, jw)
bonus := 0.0
if strsim.SoundexMatch(query, best.Value) { bonus += 0.05 }
if strsim.DoubleMetaphoneMatch(query, best.Value) { bonus += 0.05 }
return best, min(best.Similarity + bonus, 1.0)
}
Key Findings
1. Unicode Bugs Are Rampant
smetrics operates entirely on bytes. For any non-ASCII input — accented characters, CJK text, emoji — it produces wrong results. This isn’t documented. If you’re processing international names (García, Müller, 田中), smetrics will silently give you garbage.
2. Empty String Handling Is Broken
go-edlib and strutil both return NaN for Similarity("", ""). In a scoring pipeline, one NaN can corrupt your entire ranking. strsim returns 1.0 for identical strings (including both empty) and 0.0 for completely different strings, always.
3. OSA and Damerau-Levenshtein Catch Transpositions
Standard Levenshtein treats “recieve” → “receive” as 2 edits (delete i, insert i). OSA and Damerau-Levenshtein recognize it as 1 transposition. For spell checking and typo detection, this matters. None of the other libraries except go-edlib offer this.
4. Phonetic Matching Handles What Edit Distance Can’t
“Bob” and “Robert” have a Jaro-Winkler similarity of 0.47 — well below any reasonable matching threshold. But they share the same Soundex code (R163). Combining both approaches catches matches that neither alone would find.
5. The Performance Gap Is in Medium/Long Strings
For short strings (< 10 chars), all libraries are within 20-30 ns of each other. The real differences emerge at 25+ characters, where strsim’s Jaro-Winkler is 30-70% faster than go-edlib and 60-200% faster than strutil.
When to Use What
| Algorithm | Best For |
|---|---|
| Jaro-Winkler | Name matching, short strings, fuzzy search |
| Levenshtein | Spell checking, typo detection |
| OSA / Damerau-Levenshtein | Spell checking with transposition awareness |
| Cosine / Jaccard / Dice | Document similarity, longer texts |
| Soundex | Fast phonetic grouping, English names |
| Double Metaphone | Multi-origin name matching (European, Asian) |
| NYSIIS | American name matching |
For a complete scoring pipeline, combine Jaro-Winkler similarity with phonetic matching:
jw := strsim.NewJaroWinkler()
matches := strsim.FindAboveThreshold(query, candidates, 0.7, jw)
// Boost phonetically similar results
for i, m := range matches {
if strsim.DoubleMetaphoneMatch(query, m.Value) {
matches[i].Similarity = min(m.Similarity + 0.1, 1.0)
}
}
Try It
go get github.com/jcoruiz/strsim
// Edit distance
d := strsim.Levenshtein("kitten", "sitting") // 3
// Normalized similarity [0, 1]
s := strsim.JaroWinklerSimilarity("martha", "marhta") // ~0.961
// Phonetic matching
match := strsim.SoundexMatch("Robert", "Rupert") // true
// Find best match from a list
best := strsim.FindBestMatch("golang", candidates, strsim.NewJaroWinkler())
Zero dependencies. Go 1.22+. MIT licensed. Full documentation on pkg.go.dev. Source on GitHub. Benchmark results. Examples.