Class

org.tresamigos.smv

SmvDataSet

Related Doc: package smv

Permalink

abstract class SmvDataSet extends FilenamePart

Dependency management unit within the SMV application framework. Execution order within the SMV application framework is derived from dependency between SmvDataSet instances. Instances of this class can either be a file or a module. In either case, there would be a single result DataFrame.

Linear Supertypes
FilenamePart, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SmvDataSet
  2. FilenamePart
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SmvDataSet()

    Permalink

Abstract Value Members

  1. abstract def description(): String

    Permalink
  2. abstract def dsType(): String

    Permalink

    DataSet type: could be 4 values, Input, Link, Module, Output

  3. abstract def isEphemeral: Boolean

    Permalink

    flag if this module is ephemeral or short lived so that it will not be persisted when a graph is executed.

    flag if this module is ephemeral or short lived so that it will not be persisted when a graph is executed. This is quite handy for "filter" or "map" type modules so that we don't force an extra I/O step when it is not needed. By default all modules are persisted unless the flag is overridden to true. Note: the module will still be persisted if it was specifically selected to run by the user.

  4. abstract def requiresDS(): Seq[SmvDataSet]

    Permalink

    modules must override to provide set of datasets they depend on.

    modules must override to provide set of datasets they depend on. This is no longer the canonical list of dependencies. Internally we should query resolvedRequiresDS for dependencies.

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def allDeps: Seq[SmvDataSet]

    Permalink

    All dependencies with the dependency hierarchy flattened

  5. lazy val ancestors: Seq[SmvDataSet]

    Permalink
  6. def app: SmvApp

    Permalink
  7. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. def datasetHash(): Int

    Permalink

    Hash computed from the dataset, could be overridden to include things other than CRC

  10. def dqm(): SmvDQM

    Permalink

    Define the DQM rules, fixes and policies to be applied to this DataSet.

    Define the DQM rules, fixes and policies to be applied to this DataSet. See org.tresamigos.smv.dqm, org.tresamigos.smv.dqm.DQMRule, and org.tresamigos.smv.dqm.DQMFix for details on creating rules and fixes.

    Concrete modules and files should override this method to define rules/fixes to apply. The default is to provide an empty set of DQM rules/fixes.

  11. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  13. def exportToHive(collector: SmvRunInfoCollector): Serializable

    Permalink

    Exports a dataframe to a hive table.

  14. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. def fnpart: String

    Permalink

    Names the persisted file for the result of this SmvDataSet

    Names the persisted file for the result of this SmvDataSet

    Definition Classes
    SmvDataSetFilenamePart
  16. def fqn: String

    Permalink

    The FQN of an SmvDataSet is its classname for Scala implementations.

    The FQN of an SmvDataSet is its classname for Scala implementations.

    Scala proxies for implementations in other languages must override this to name the proxied FQN.

  17. def getAncillary[T <: SmvAncillary](anc: T): T

    Permalink

    TODO: remove this method as checkDependency replaced this function

  18. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  19. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  20. def instanceValHash(): Int

    Permalink

    Hash computed based on instance values of the dataset, such as the timestamp of an input file *

  21. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  22. val isObjectInShell: Boolean

    Permalink

    Objects defined in Spark Shell has class name start with $ *

  23. def metadata(df: DataFrame): SmvMetadata

    Permalink

    Can be overridden to supply custom metadata TODO: make SmvMetadata more user friendly or find alternative format for user metadata

  24. def moduleCsvPath(prefix: String = ""): String

    Permalink

    Returns the path for the module's csv output

  25. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  26. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  27. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  28. def persist(dataframe: DataFrame, prefix: String = ""): Unit

    Permalink
  29. def publishHiveSql: Option[String]

    Permalink

    An optional sql query to run to publish the results of this module when the --publish-hive command line is used.

    An optional sql query to run to publish the results of this module when the --publish-hive command line is used. The DataFrame result of running this module will be available to the query as the "dftable" table. For example: return "insert overwrite table mytable select * from dftable" If this method is not specified, the default is to just create the table specified by tableName() with the results of the module.

  30. def rdd(forceRun: Boolean = false, genEdd: Boolean = app.genEdd, collector: SmvRunInfoCollector): DataFrame

    Permalink

    returns the DataFrame from this dataset (file/module).

    returns the DataFrame from this dataset (file/module). The value is cached so this function can be called repeatedly. The cache is external to SmvDataSet so that it we will not recalculate the DF even after dynamically loading the same SmvDataSet. If force argument is true, the we skip the cache. Note: the RDD graph is cached and NOT the data (i.e. rdd.cache is NOT called here)

  31. def readFile(path: String, attr: CsvAttributes = CsvAttributes.defaultCsv): DataFrame

    Permalink

    Read a dataframe from a persisted file path, that is usually an input data set or the output of an upstream SmvModule.

    Read a dataframe from a persisted file path, that is usually an input data set or the output of an upstream SmvModule.

    The default format is headerless CSV with '"' as the quote character

  32. def requiresAnc(): Seq[SmvAncillary]

    Permalink
  33. def resolve(resolver: DataSetResolver): SmvDataSet

    Permalink
  34. var resolvedRequiresDS: Seq[SmvDataSet]

    Permalink

    fixed list of SmvDataSet dependencies

  35. def runInfo: SmvRunInfo

    Permalink

    Returns the run information from this dataset's last run.

    Returns the run information from this dataset's last run.

    If the dataset has never been run, returns an empty run info with null for its components.

  36. def setTimestamp(dt: DateTime): Unit

    Permalink
  37. def sourceCodeHash(): Int

    Permalink

    Hash computed based on the source code of the dataset's class *

  38. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  39. def tableName: String

    Permalink

    full name of hive output table if this module is published to hive.

  40. def toString(): String

    Permalink
    Definition Classes
    SmvDataSet → AnyRef → Any
  41. def urn: URN

    Permalink
  42. def validateMetadata(metadata: SmvMetadata, history: Seq[SmvMetadata]): Option[String]

    Permalink

    Override to validate module results based on current and historic metadata.

    Override to validate module results based on current and historic metadata. If Some, DQM will fail. Defaults to None.

  43. def verHex: String

    Permalink
  44. def version(): Int

    Permalink

    user tagged code "version".

    user tagged code "version". Derived classes should update the value when code or data

  45. def versionedFqn: String

    Permalink
  46. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  47. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from FilenamePart

Inherited from AnyRef

Inherited from Any

Ungrouped