Calculate Dice-Sorensen distance between 2 string typed columns Returns a float.
Calculate Dice-Sorensen distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match
Algorithm reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient Library reference: https://github.com/rockymadden/stringmetric
Calculate Jaro–Winkler distance between 2 string typed columns Returns a float.
Calculate Jaro–Winkler distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match
Algorithm reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Library reference: https://github.com/rockymadden/stringmetric
Calculate N-gram (N=2) distance between 2 string typed columns Returns a float.
Calculate N-gram (N=2) distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match
Algorithm reference: https://en.wikipedia.org/wiki/N-gram Library reference: https://github.com/rockymadden/stringmetric
Calculate N-gram (N=3) distance between 2 string typed columns Returns a float.
Calculate N-gram (N=3) distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match
Algorithm reference: https://en.wikipedia.org/wiki/N-gram Library reference: https://github.com/rockymadden/stringmetric
Calculate Normalized Levenshtein distance between 2 string typed columns Returns a float.
Calculate Normalized Levenshtein distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match
Algorithm reference: https://en.wikipedia.org/wiki/Levenshtein_distance Library reference: https://github.com/rockymadden/stringmetric
For an Array column create a String column with the Array values
Coalesce boolean columns into a String bitmap *
Coalesce boolean columns into a String bitmap *
Spark 1.6 will have collect_set aggregation function.
Count number of distinct values including null
Count non-null false values
Count number of null values
Aggregate function that counts the number of rows satisfying a given condition.
smvFirst: by default return null if the first record is null
smvFirst: by default return null if the first record is null
Since Spark 1.5 "first" will return the first non-null value, we have to create our version smvFirst which to retune the real first value, even if it's null. The alternative form will try to return the first non-null value
Spark 2.1 enhanced the first function to take nonNull parameter. We can simply forward the call and maintain the old interface.
the column
switches whether the function will try to find the first non-null value
True if any of the columns is not null
Creating unique id from the primary key list.
Creating unique id from the primary key list.
Return "Prefix" + MD5 Hex string(size 32 string) as the unique key
MD5's collisions rate on real data records could be ignored based on the following discussion.
https://marc-stevens.nl/research/md5-1block-collision/ The shortest messages have the same MD5 are 512-bit (64-byte) messages as below
4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 and the (different by two bits) 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 both have MD5 hash 008ee33a9d58b51cfeb425b0959121c9
There are other those pairs, but all carefully constructed. Theoretically the random collisions will happen on data size approaching 264 (since MD5 has 128-bit), which is much larger than the number of records we deal with (a billion is about 230) There for using MD5 to hash primary key columns is good enough for creating an unique key
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
Commonly used functions
1.5