Object

org.tresamigos.smv

smvfuncs

Related Doc: package smv

Permalink

object smvfuncs

Commonly used functions

Since

1.5

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. smvfuncs
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. val boolsToBitmap: (Row) ⇒ String

    Permalink
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. def diceSorensen(c1: Column, c2: Column): Column

    Permalink

    Calculate Dice-Sorensen distance between 2 string typed columns Returns a float.

    Calculate Dice-Sorensen distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match

    Algorithm reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient Library reference: https://github.com/rockymadden/stringmetric

  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  14. def jaroWinkler(c1: Column, c2: Column): Column

    Permalink

    Calculate Jaro–Winkler distance between 2 string typed columns Returns a float.

    Calculate Jaro–Winkler distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match

    Algorithm reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Library reference: https://github.com/rockymadden/stringmetric

  15. def nGram2(c1: Column, c2: Column): Column

    Permalink

    Calculate N-gram (N=2) distance between 2 string typed columns Returns a float.

    Calculate N-gram (N=2) distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match

    Algorithm reference: https://en.wikipedia.org/wiki/N-gram Library reference: https://github.com/rockymadden/stringmetric

  16. def nGram3(c1: Column, c2: Column): Column

    Permalink

    Calculate N-gram (N=3) distance between 2 string typed columns Returns a float.

    Calculate N-gram (N=3) distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match

    Algorithm reference: https://en.wikipedia.org/wiki/N-gram Library reference: https://github.com/rockymadden/stringmetric

  17. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  18. def normlevenshtein(c1: Column, c2: Column): Column

    Permalink

    Calculate Normalized Levenshtein distance between 2 string typed columns Returns a float.

    Calculate Normalized Levenshtein distance between 2 string typed columns Returns a float. 0 is no match, and 1 is full match

    Algorithm reference: https://en.wikipedia.org/wiki/Levenshtein_distance Library reference: https://github.com/rockymadden/stringmetric

  19. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  20. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  21. def smvArrayCat(sep: String, col: Column): Column

    Permalink
  22. def smvArrayCat(sep: String, col: Column, fn: (Any) ⇒ String): Column

    Permalink

    For an Array column create a String column with the Array values

  23. def smvBoolsToBitmap(headColumnName: String, tailColumnNames: String*): Column

    Permalink

    Coalesce boolean columns into a String bitmap *

  24. def smvBoolsToBitmap(boolColumns: Column*): Column

    Permalink

    Coalesce boolean columns into a String bitmap *

  25. def smvCollectSet(c: Column, dt: DataType): Column

    Permalink

    Spark 1.6 will have collect_set aggregation function.

  26. def smvCountDistinctWithNull(colN: String, colNs: String*): Column

    Permalink
  27. def smvCountDistinctWithNull(cols: Column*): Column

    Permalink

    Count number of distinct values including null

  28. def smvCountFalse(cond: Column): Column

    Permalink

    Count non-null false values

  29. def smvCountNull(cond: Column): Column

    Permalink

    Count number of null values

  30. def smvCountTrue(cond: Column): Column

    Permalink

    Aggregate function that counts the number of rows satisfying a given condition.

  31. def smvFirst(c: Column, nonNull: Boolean = false): Column

    Permalink

    smvFirst: by default return null if the first record is null

    smvFirst: by default return null if the first record is null

    Since Spark 1.5 "first" will return the first non-null value, we have to create our version smvFirst which to retune the real first value, even if it's null. The alternative form will try to return the first non-null value

    Spark 2.1 enhanced the first function to take nonNull parameter. We can simply forward the call and maintain the old interface.

    c

    the column

    nonNull

    switches whether the function will try to find the first non-null value

  32. def smvHasNonNull(columns: Column*): Column

    Permalink

    True if any of the columns is not null

  33. def smvHashKey(cols: Column*): Column

    Permalink
  34. def smvHashKey(prefix: String, cols: Column*): Column

    Permalink

    Creating unique id from the primary key list.

    Creating unique id from the primary key list.

    Return "Prefix" + MD5 Hex string(size 32 string) as the unique key

    MD5's collisions rate on real data records could be ignored based on the following discussion.

    https://marc-stevens.nl/research/md5-1block-collision/ The shortest messages have the same MD5 are 512-bit (64-byte) messages as below

    4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 and the (different by two bits) 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 both have MD5 hash 008ee33a9d58b51cfeb425b0959121c9

    There are other those pairs, but all carefully constructed. Theoretically the random collisions will happen on data size approaching 264 (since MD5 has 128-bit), which is much larger than the number of records we deal with (a billion is about 230) There for using MD5 to hash primary key columns is good enough for creating an unique key

  35. def smvStrCat(sep: String, columns: Column*): Column

    Permalink
  36. def smvStrCat(columns: Column*): Column

    Permalink

    Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

  37. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  38. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  39. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  40. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Deprecated Value Members

  1. def collectSet(dt: DataType)(c: Column): Column

    Permalink
    Annotations
    @deprecated
    Deprecated

    (Since version 2.1) Replaced by smvCollectSet(col, datatype)

Inherited from AnyRef

Inherited from Any

Aggregate Functions

Ungrouped