Create a DataFrame from string for temporary use (in test or shell) By default, don't persist validation result
Create a DataFrame from string for temporary use (in test or shell) By default, don't persist validation result
Passing null for data will create an empty dataframe with a specified schema.
Returns the app-level dependency graph as a dot string
Returns the app-level dependency graph as a json string
Get the DataFrame associated with data set.
Get the DataFrame associated with data set. The DataFrame plan (not data) is cached in dfCache the to ensure only a single DataFrame exists for a given data set (file/module). Note: this keyed by the "versioned" dataset FQN.
zero parameter wrapper around dependencyGraphJsonString that can be called from python directly.
zero parameter wrapper around dependencyGraphJsonString that can be called from python directly. TODO: remove this once we pass args to dependencyGraphJsonString
list of all the files with specific suffix in the given directory
Returns metadata for a given urn
Returns the run information for a given dataset and all its dependencies (including transitive dependencies), from the last run
sequence of SmvModules to run based on the command line arguments.
sequence of SmvModules to run based on the command line arguments. Returns the union of -a/-m/-s command line flags.
Sequence of SmvModules to run + all of their ancestors
if the publish to hive flag is setn, the publish
if the export-csv option is specified, then publish locally
Publish through JDBC if the --publish-jdbc flag is set
The main entry point into the app.
The main entry point into the app. This will parse the command line arguments to determine which modules should be run/graphed/etc.
proceeds with the execution of an smvDS passed from runModule or runModuleByName TODO: the name of this function should make its distinction from runModule clear (this is an implementation)
Run a module by its fully qualified name in its respective language environment If force argument is true, any existing persisted results will be deleted and the module's DataFrame cache will be ignored, forcing the module to run again.
Run a module by its fully qualified name in its respective language environment If force argument is true, any existing persisted results will be deleted and the module's DataFrame cache will be ignored, forcing the module to run again. If a version is specified, try to read the module from the published data for the given version. If dynamic runtime configuration is specified, run the module with the configuration provided.
Run a module based on the end of its name (must be unique).
Run a module based on the end of its name (must be unique). If force argument is true, any existing persisted results will be deleted and the module's DataFrame cache will be ignored, forcing the module to run again. If a version is specified, try to read the module from the published data for the given version
Register Kryo Classes Since none of the SMV classes will be put in an RDD, register them or not does not make significant performance improvement
Register Kryo Classes Since none of the SMV classes will be put in an RDD, register them or not does not make significant performance improvement
val allSerializables = SmvReflection.objectsInPackage[Serializable]("org.tresamigos.smv") sparkConf.registerKryoClasses(allSerializables.map{_.getClass}.toArray)
Driver for SMV applications. Most apps do not need to override this class and should just be launched using the SmvApp object (defined below)