Levenshtein Distance between each word in a given string

Question

From Calculate Levenshtein distance between two strings in Python it is possible to calculate distance and similarity between two given strings(sentences).

And from Levenshtein Distance and Text Similarity in Python to return the matrix for each character and distance for two strings.

Are there any ways to calculate distance and similarity between each word in a string and print the matrix for each word in a string(sentences)?

a = "This is a dog."
b = "This is a cat."

from difflib import ndiff

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

levenshtein(a, b)

Outputs

>> 3

Matrix

[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
 [ 1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]
 [ 2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12.]
 [ 3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11.]
 [ 4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
 [ 6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.]
 [ 7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.]
 [ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.]
 [10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.]
 [11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  1.  2.  3.  4.]
 [12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  2.  2.  3.  4.]
 [13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  3.  3.  3.  4.]
 [14. 13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  4.  4.  4.  3.]]

General Levenshtein distance for character level shown in below fig.

Is it possible to calculate Levenshtein Distance for Word Level?

Required Matrix

          This is a cat

This
is
a
dog

Do you mean other distance metrics, or other ways to code the levenshtein distance process? if former, there are methods such as word embeddings like word2vec that work like a charm, if the latter, why not use existing libraries? `NLTK` library has Levenshtein function. take a look http://www.nltk.org/howto/metrics.html . Please elaborate if you are looking for a more precise answer — Alireza, Apr 24 '20 at 10:59

score 0 · Answer 1 · answered May 11 '20 at 15:20

well... simply put a .split() at the end of your first two lines:

a = "This is a dog.".split()
b = "This is a cat.".split()

Your algorithm works with the iterables, and the string is broken into it's characters. You do the split, and a,b would be a list of words, then your algorithm works on the word-level

Output on your example:

[[0. 1. 2. 3. 4.]
 [1. 0. 1. 2. 3.]
 [2. 1. 0. 1. 2.]
 [3. 2. 1. 0. 1.]
 [4. 3. 2. 1. 1.]]

1.0

alvas · Accepted Answer · 2020-05-11T15:37:10.960

Maybe try this:

from functools import lru_cache
from itertools import product

@lru_cache(maxsize=4095)
def ld(s, t):
    """
    Levenshtein distance memoized implementation from Rosetta code:
    https://rosettacode.org/wiki/Levenshtein_distance#Python
    """
    if not s: return len(t)
    if not t: return len(s)
    if s[0] == t[0]: return ld(s[1:], t[1:])
    l1 = ld(s, t[1:])      # Deletion.
    l2 = ld(s[1:], t)      # Insertion.
    l3 = ld(s[1:], t[1:])  # Substitution.
    return 1 + min(l1, l2, l3)


a = "this is a sentence".split()
b = "yet another cat thing".split()

# To get the triplets.
for i, j in product(a, b):
    print((i, j, ld(i, j)))

To get a matrix:

from scipy.sparse import coo_matrix
import numpy as np

a = "this is a sentence".split()
b = "yet another cat thing , yes".split()

tripets = np.array([(i, j, ld(w1, w2)) for (i, w1) , (j, w2) in product(enumerate(a), enumerate(b))])
row, col, data = [np.squeeze(splt) for splt in np.hsplit(tripets, tripets.shape[-1])]
coo_matrix((data, (row, col))).toarray()

[out]:

array([[4, 5, 4, 2, 4, 3],
       [3, 7, 3, 4, 2, 2],
       [3, 6, 2, 5, 1, 3],
       [6, 7, 7, 7, 8, 7]])

Levenshtein Distance between each word in a given string

2 Answers2