Levenshtein Distance and Distance Similarity Functions

Russian scientist Vladimir Levenshtein discovered the Levenshtein Distance algorithm in 1965. The algorithm produces the number of edits (i.e., insertions, deletions, and substitutions) required to change one string into the other. Consider the distance between “Steven” and “Stephen”:

Step-1: Substitute "v" for "p"
Step-2: Insert "h" after the "p"

It takes two edits to change “Steven” to “Stephen”. The distance is 2.

The distance similarity expands on the distance algorithm by creating a percentage from the number of edits (distance algorithm’s output). The percentage indicates how similar the words are to one another. We needed two edits to change Steven to Stephen in the example above. That could be expressed by saying Steven is 71% similar to Stephen.

It’s calculated with the following formulae:

[distance_similarity] = 100 - ([distance] / [length_of_longest_string] * 100)

Apache Spark includes an implementation of the Levenschtein Distance function. To implement the distance similarity, your code needs to perform the extra calculations:

## Import the `levenshtein` function
from pyspark.sql.functions import levenshtein

## Creating a single data from with the values I want to compare
df0 = spark.createDataFrame([('kitten', 'sitting',)], ['l', 'r'])

## Let's compare the two values...
df0.select(levenshtein('l', 'r').alias('d')).collect()

The following list of name pairs can be enriched to provide their distance similarity.

foo,bar
Suzie Chappman,Suzy Chappman
Billy Wonka,William Wonka
Kelly Schmitt,Kelly Schmitt
Devon Parker,Devon Jones
Kylie LeBlanc,Stacy LeBow

In the following, we use PySpark to establish the similarity between each pair of names:

from pyspark.sql import SparkSession
from pyspark.sql.functions import levenshtein, length, when, col
from pyspark.sql.types import IntegerType

def spark_session_get():
    try:
        spark
    except NameError:
        app_name = "Sketch distance similarity between two words"
        spark = SparkSession.builder.appName(app_name).getOrCreate()
    
    return spark

def source_get(source_file_path, format='csv', header=True):
    return spark.read.load(source_file_path, header=header, format=format)

def source_enrich_distance_similarity(source_raw):
    source_with_distance = source_raw.withColumn('customer_foo_size', length(source_raw.SAR)) \
        .withColumn('customer_bar_size', length(source_raw.PAR)) \
        .withColumn('distance', levenshtein(source_raw.SAR, source_raw.PAR))

    return source_with_distance \
        .withColumn( 'similarity', 100 - (col("distance") / when(col("customer_bar_size") >= col("customer_foo_size"), col("customer_bar_size") ).otherwise(col("customer_foo_size")) * 100))

spark = spark_session_get()
source_file_path = './sample.csv'

source_raw = source_get(source_file_path)
source_enriched = source_enrich_distance_similarity(source_raw)
source_enriched.show()

Analysts can build on this example to establish the similarity between various words and phrases.

KeyLookup Exception Thrown When Calling awsglue.utils.getResolvedOptions() Locally

I was locally testing a PySpark script I’d written for an AWS Glue job I was building when I ran across the error relating to the call to getResolvedOptions(). The call was generating a KeyLookup exception. The problem was with the argv parameter I supplied.

When a Glue job is executes, parameters are passed to the script through sys.argv. Typically, you pass sys.argv to getResolvedOption(args, options) with the options you want to tease from the list – see Accessing Parameters Using getResolvedOptions for details.

You can mimic this behavior when running this script locally:

from pprint import pprint as pp
from awsglue.utils import getResolvedOptions

argv = ['whatevs', '--JOB_NAME=ThisIsMySickJobName']
args = getResolvedOptions(argv, ['JOB_NAME'])

pp(args)

The following is me running a script that contains the above code locally:

Screen Shot Calling getResolvedOptions() with Fabricated Arguments

The trick is that the list passed as the argv parameter needs values to use the pattern:

--KEY=VALUE

For example…

--JOB_NAME=ThisIsMySickJobName