Package

org.tresamigos.smv

edd

Permalink

package edd

Provides Extended Data Dictionary functions for ad hoc data analysis

scala> val res1 = df.edd.summary($"amt", $"time")
scala> res1.eddShow

scala> val res2 = df.edd.histogram(AmtHist($"amt"), $"county", Hist($"pop", binSize=1000))
scala> res2.eddShow
scala> res2.saveReport("file/path")

Depends on the data types of the columns, Edd summary method will perform different statistics.

The histogram method takes a group of HistColumn as parameters. Or when a group of String as the column names are given, it will use the default HistColumn parameters. Two types of HistColumns are supported

The eddShow method will print report to the console, saveReport will save report as RDD[String], The strings are JSON strings.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. edd
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. case class AmtHist(colName: String) extends HistColumn with Product with Serializable

    Permalink

    Specify an Amount binned histogram Pre-defined bins for amount-like column to best report on log-normal distributed amount fields

    Specify an Amount binned histogram Pre-defined bins for amount-like column to best report on log-normal distributed amount fields

    Bins:

    • "<0.0" => bin by 1000
    • "0.0" => keep 0.0
    • (0.0, 10.0) => 0.01
    • [10.0, 200.0) => bin by 10, floor
    • [200.0, 1000.0) => bin by 50, floor
    • [1000.0, 10000.0) => bin by 500, floor
    • [10000.0, 1000000.0) => bin by 5000, floor
    • [1000000.0, Inf) => bin by 1000000, floor
  2. class Edd extends AnyRef

    Permalink

    Implement the edd method of DFHelper

    Implement the edd method of DFHelper

    Provides summary and histogram methods

  3. case class EddResultFunctions(eddRes: DataFrame) extends Product with Serializable

    Permalink

    Implement methods on Edd results

    Implement methods on Edd results

    scala> import org.tresamigos.smv.edd._
    scala> df.summary().eddShow
    scala> df.summary().saveReport("file/path")
    scala> val eddResult: DataFrame = df.summary()

    with import the edd package, EddResultFunctions can be implicitly converted to DataFrame

  4. case class Hist(colName: String, binSize: Double = 100.0, sortByFreq: Boolean = false) extends HistColumn with Product with Serializable

    Permalink

    Define histogram parameters for specified the column

    Define histogram parameters for specified the column

    colName

    column name as a String

    binSize

    bin size for numeric column, default 100.0

    sortByFreq

    histogram result sort be frequency or not, default false (sort by key)

Value Members

  1. object Edd

    Permalink
  2. implicit def makeEddDFCvrt(erf: EddResultFunctions): DataFrame

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped