All dependencies with the dependency hierarchy flattened
All dependencies with the dependency hierarchy flattened
SmvModuleLinks should not cache or validate their data
SmvModuleLinks should not cache or validate their data
Hash computed from the dataset, could be overridden to include things other than CRC
Hash computed from the dataset, could be overridden to include things other than CRC
Define the DQM rules, fixes and policies to be applied to this DataSet.
Define the DQM rules, fixes and policies to be applied to this DataSet.
See org.tresamigos.smv.dqm, org.tresamigos.smv.dqm.DQMRule, and org.tresamigos.smv.dqm.DQMFix
for details on creating rules and fixes.
Concrete modules and files should override this method to define rules/fixes to apply. The default is to provide an empty set of DQM rules/fixes.
DataSet type: could be 4 values, Input, Link, Module, Output
DataSet type: could be 4 values, Input, Link, Module, Output
Exports a dataframe to a hive table.
Exports a dataframe to a hive table.
Names the persisted file for the result of this SmvDataSet
Names the persisted file for the result of this SmvDataSet
The FQN of an SmvDataSet is its classname for Scala implementations.
The FQN of an SmvDataSet is its classname for Scala implementations.
Scala proxies for implementations in other languages must override this to name the proxied FQN.
TODO: remove this method as checkDependency replaced this function
TODO: remove this method as checkDependency replaced this function
If the depended smvModule has a published version, SmvModuleLink's datasetHash depends on the version string and the target's FQN (even with versioned data the hash should change if the target changes).
If the depended smvModule has a published version, SmvModuleLink's datasetHash depends on the version string and the target's FQN (even with versioned data the hash should change if the target changes). Otherwise, depends on the smvModule's hashOfHash
flag if this module is ephemeral or short lived so that it will not be persisted when a graph is executed.
flag if this module is ephemeral or short lived so that it will not be persisted when a graph is executed. This is quite handy for "filter" or "map" type modules so that we don't force an extra I/O step when it is not needed. By default all modules are persisted unless the flag is overriden to true. Note: the module will still be persisted if it was specifically selected to run by the user.
Objects defined in Spark Shell has class name start with $ *
Objects defined in Spark Shell has class name start with $ *
Can be overridden to supply custom metadata TODO: make SmvMetadata more user friendly or find alternative format for user metadata
Can be overridden to supply custom metadata TODO: make SmvMetadata more user friendly or find alternative format for user metadata
Returns the path for the module's csv output
Returns the path for the module's csv output
An optional sql query to run to publish the results of this module when the --publish-hive command line is used.
An optional sql query to run to publish the results of this module when the --publish-hive command line is used. The DataFrame result of running this module will be available to the query as the "dftable" table. For example: return "insert overwrite table mytable select * from dftable" If this method is not specified, the default is to just create the table specified by tableName() with the results of the module.
"Running" a link requires that we read the published output from the upstream DataSet.
"Running" a link requires that we read the published output from the upstream DataSet.
When publish version is specified, it will try to read from the published dir. Otherwise
it will either "follow-the-link", which means resolve the modules the linked DS depends on
and run the DS, or "not-follow-the-link", which will try to read from the persisted data dir
and fail if not found.
Read a dataframe from a persisted file path, that is usually an input data set or the output of an upstream SmvModule.
Read a dataframe from a persisted file path, that is usually an input data set or the output of an upstream SmvModule.
The default format is headerless CSV with '"' as the quote character
override the module run/requiresDS methods to be a no-op as it will never be called (we overwrite doRun as well.)
override the module run/requiresDS methods to be a no-op as it will never be called (we overwrite doRun as well.)
Resolve the target SmvModule and wrap it in a new SmvModuleLink
Resolve the target SmvModule and wrap it in a new SmvModuleLink
fixed list of SmvDataSet dependencies
fixed list of SmvDataSet dependencies
Returns the run information from this dataset's last run.
Returns the run information from this dataset's last run.
If the dataset has never been run, returns an empty run info with null for its components.
Create a snapshot in the current module at some result DataFrame.
Create a snapshot in the current module at some result DataFrame. This is useful for debugging a long SmvModule by creating snapshots along the way.
object MyMod extends SmvModule("...") { override def requiresDS = Seq(...) override def run(...) = { val s1 = ... snapshot(s1, "s1") val s2 = f(s1) snapshot(s2, "s2") ... }
Hash computed based on the source code of the dataset's class *
Hash computed based on the source code of the dataset's class *
full name of hive output table if this module is published to hive.
full name of hive output table if this module is published to hive.
Override to validate module results based on current and historic metadata.
Override to validate module results based on current and historic metadata. If Some, DQM will fail. Defaults to None.
user tagged code "version".
user tagged code "version". Derived classes should update the value when code or data
Link to an output module in another stage. Because modules in a given stage can not access modules in another stage, this class enables the user to link an output module from one stage as an input into current stage.
Similar to File/Module, a
dqm()method can also be overriden in the link