ColumnHelper class provides additional methods/operators on Column
ColumnHelper class provides additional methods/operators on Column
import org.tresamigos.smv
will import the implicit convertion from Column to ColumnHelper
Policy for validating a module's current metadata against its historical metadata.
Policy for validating a module's current metadata against its historical metadata. This policy is added to every module's DQM
DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project.
DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project. Every DSM method to load SmvDataSets creates a new transaction within which all of the indicated SmvDataSets are loaded from the most recent source and resolved. All SmvDataSets provided by DSM are resolved. DSM delegates to DataSetRepo to discover SmvDataSets to DataSetResolver to load and resolve SmvDataSets. DSM methods like load which look up SmvDataSets by name accept an arbitrary number of names so that all the target SmvDataSets are loaded within the same transaction (which is much faster).
DataSetRepo is the entity responsible for discovering and loading the datasets in a given language.
DataSetRepo is the entity responsible for discovering and loading the datasets in a given language. A new repo is created for each new transaction.
DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets.
DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets. A DSR object represent a single transaction. Each DSR creates a set of DataSetRepos at instantiation. When asked for an SmvDataSet, DSR queries the repos for that SmvDataSet and resolves it. The SmvDataSet is responsible for resolving itself, given access to the DSR to load/resolve the SmvDataSet's dependencies. DSR caches the SmvDataSets it has already resolved to ensure that any SmvDataSet is only resolved once.
A module's file name part is stackable, e.g.
A module's file name part is stackable, e.g. with Using[SmvRunConfig]
Adapts a java InputStream object to the IAnyInputStream interface, so it can be used in I/O methods that can work with input streams from both Java and Python sources.
Maps SmvDataSet to DataFrame by FQN.
Maps SmvDataSet to DataFrame by FQN. This is the type of the parameter expected by SmvModule's run method.
Subclasses Function1[SmvDataSet, DataFrame] so it can be used the
same way as before, when runParams was type-aliased to
Map[SmvDataSet, DataFrame]
Driver for SMV applications.
Driver for SMV applications. Most apps do not need to override this class and should just be launched using the SmvApp object (defined below)
Container of all SMV config driven elements (cmd line, app props, user props, etc).
Represents a raw input file with a given file path (can be local or hdfs) and CSV attributes.
a built-in SmvModule from schema string and data string
a built-in SmvModule from schema string and data string
E.g.
SmvCsvStringData("a:String;b:Double;c:String", "aa,1.0,cc;aa2,3.5,CC")
Both SmvFile and SmvCsvStringData shared the parser validation part, extract the common part to the new ABC: SmvDSWithParser
Dependency management unit within the SMV application framework.
Dependency management unit within the SMV application framework. Execution order within the SMV application framework is derived from dependency between SmvDataSet instances. Instances of this class can either be a file or a module. In either case, there would be a single result DataFrame.
Class for declaring datasets defined in another language.
Class for declaring datasets defined in another language. Resolves to an instance of SmvExtModulePython.
Declarative class for links to datasets defined in another language.
Declarative class for links to datasets defined in another language. Resolves to a link to an SmvExtModulePython.
Concrete SmvDataSet representation of modules defined in Python.
Concrete SmvDataSet representation of modules defined in Python. Created exclusively by DataSetRepoPython. Wraps an ISmvModule.
SMV operations that can be applied to grouped data.
SMV operations that can be applied to grouped data. For example:
df.smvGroupBy("k").smvDecile("amt")
We can not use the standard Spark GroupedData because the internal DataFrame and keys are not exposed.
Define whether to keep name columns and parent columns when perform rollup operation
Define whether to keep name columns and parent columns when perform rollup operation
config to add prefix_name volume in addition to type and value fields
specifies parent hierarchy's name, when specified, will add
parent_prefix_type/value fields based on the specified hierarchy
SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy.
Through the SmvHierarchyFuncs it provides rollup methods on the hierarchy structure.
SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy.
Through the SmvHierarchyFuncs it provides rollup methods on the hierarchy structure.
object GeoHier extends SmvHierarchies("geo", SmvHierarchy("county", ZipRefTable, Seq("zip", "County", "State", "Country")), SmvHierarchy("terr", ZipRefTable, Seq("zip", "Territory", "Devision", "Region", "Country")) )
object MyModule extends SmvModule("...") { override def requiresDS() = Seq(...) override def requiresAnc() = Seq(GeoHier) override def run(...) = { ... GeoHier.levelRollup(df, "zip3", "State")( sum($"v") as "v", avg($"v2") as "v2" )(SmvHierOpParam(true, Some("terr"))) } }
The methods provided by SmvHierarchies, levelRollup, etc., will output
{prefix}_type and {prefix}_value columns. For above example, they are geo_type and
geo_value. The values of those 2 columns are the name of the original hierarchy level's
and the values respectively. For examples,
geo_type, geo_value zip, 92127 County, 06073
SmvHierarchy combines a hierarchy Map (a SmvOutput) with
the hierarchy structure.
SmvHierarchy combines a hierarchy Map (a SmvOutput) with
the hierarchy structure. The hierarchy sequence ordered from "small" to "large".
For example:
SmvHierarchy("zipmap", ZipTable, Seq("zip", "county", "state"))
where ZipTable is a SmvOutput which has zip, county, state as its
columns and zip is the primary key (unique) of that table.
SMV Dataset Wrapper around a hive table.
Wrapper for a database table accessed via JDBC
Provides a file-based mutex control, or non-reentrant lock.
Provides a file-based mutex control, or non-reentrant lock.
A typical use should use the SmvLock.withLock method in the companion
object, or follow the idiom below:
val sl = SmvLock("/path/to/my/file.lock")
sl.lock()
try { // access lock-protected resource } finally { sl.unlock() }
The parenthese () is recommended to indicate use of side effect.
Representation of module metadata which can be saved to file.
Representation of module metadata which can be saved to file.
TODO: Add getter methods and more types of metadata (e.g. validation results)
Interface for updating metadata history.
base module class.
base module class. All SMV modules need to extend this class and provide their description and dependency requirements (what does it depend on). The module run method will be provided the result of all dependent inputs and the result of the run is the result of this module. All modules that depend on this module will be provided the DataFrame result from the run method of this module. Note: the module should *not* persist any RDD itself.
Link to an output module in another stage.
Link to an output module in another stage. Because modules in a given stage can not access modules in another stage, this class enables the user to link an output module from one stage as an input into current stage.
package stage2.input object Account1Link extends SmvModuleLink(stage1.Accounts)
Similar to File/Module, a dqm() method can also be overriden in the link
Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.
Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.
SmvCsvFile can also take dir as path parameter, but all files are considered
slices. In that case if none of them has headers, it's equivalent to SmvMultiCsvFiles.
However if every file has header, SmvCsvFile will not remove header correctly.
A marker trait that indicates that a SmvDataSet/SmvModule decorated with this trait is an output DataSet/module.
Base marker trait for run configuration objects
CSV file schema definition.
CSV file schema definition. This class should only be used for parsing/persisting the schema file associated with a CSV file. It is no longer needed as a general purpose schema definition as spark now has that covered.
Represents a single raw input file with a given file path.
Represents a single raw input file with a given file path. E.g. SmvCsvFile or SmvFrlFile
Abstraction of the transaction boundary for loading SmvDataSets.
Abstraction of the transaction boundary for loading SmvDataSets. A TX object will instantiate a set of repos when it is it self instantiated and will reuse the same repos for all queries. This means that each new TX object will reload the SmvDataSet from source **once** during its lifetime.
NOTE: Once a new TX is created, the well-formedness of the SmvDataSets provided by the previous TX is not guaranteed. Particularly it may become impossible to run modules from the previous TX.
SmvDataSet that can be configured to return different DataFrames.
computes the CRC32 checksum for the code of the given class name.
computes the CRC32 checksum for the code of the given class name. The class must be reachable through the configured java class path.
Factory object to create InputStreamAdapters from input sources
Common entry point for all SMV applications.
Common entry point for all SMV applications. This is the object that should be provided to spark-submit.
Factory for SmvExtModulePython.
Factory for SmvExtModulePython. Creates an SmvExtModulePython with SmvOuptut if the Python dataset is SmvOutput
Instead of using String for join type, always use the link here.
Instead of using String for join type, always use the link here.
If there are typos on the join type, using the link in client code will cause compile time failure, which using string itself will cause run-time failure.
Spark(as of 1.4)'s join type is a String. Could use enum or case objects here, but there are clients using the String api, so will push that change till later.
Factory which constructs the correct type of URN object given the URN as a string.
DQM (Data Quality Module) providing classes for DF data quality assurance
DQM (Data Quality Module) providing classes for DF data quality assurance
Main class org.tresamigos.smv.dqm.SmvDQM can be used with the SmvApp/Module
Framework or on stand-alone DF.
With the SmvApp/Module framework, a dqm method is defined on the
org.tresamigos.smv.SmvDataSet level, and can be overridden to define DQM rules,
fixes and policies, which then will be automatically checked when the SmvDataSet
gets resolved.
For working on a stand-alone DF, please refer the SmvDQM class's documentation.
Provides Extended Data Dictionary functions for ad hoc data analysis
Provides Extended Data Dictionary functions for ad hoc data analysis
scala> val res1 = df.edd.summary($"amt", $"time") scala> res1.eddShow scala> val res2 = df.edd.histogram(AmtHist($"amt"), $"county", Hist($"pop", binSize=1000)) scala> res2.eddShow scala> res2.saveReport("file/path")
Depends on the data types of the columns, Edd summary method will perform different statistics.
The histogram method takes a group of HistColumn as parameters.
Or when a group of String as the column names are given, it will use the default HistColumn
parameters.
Two types of HistColumns are supported
The eddShow method will print report to the console, saveReport will save report as RDD[String],
The strings are JSON strings.
Predicate functions working with urn
Converts a link urn to the mod urn representing its target
implicitly convert Column to ColumnHelper
implicitly convert DataFrame to SmvDFHelper
implicitly convert SmvGroupedData to GroupedData
implicitly convert SmvGroupedData (created by smvGropyBy method)
to SmvDFHelper
implicitly convert DataFrame to SmvGroupedData
implicitly convert StructType to StructTypeHelper
implicitly convert StructType to StructTypeHelper
implicitly convert Symbol to ColumnHelper
Create an urn for a link from its target fqn
Create an urn for a module from its fqn
Repeatedly changes candidate so that it is not found in the collection.
Repeatedly changes candidate so that it is not found in the collection.
Useful when choosing a unique column name to add to a data frame.
Provide functions for the interactive shell
Provide functions for the interactive shell
In SMV's tools/conf/smv_shell_init.scala or project's conf/shell_init.scala add
import org.tresamigos.smv.shell._
create a UDF from a map e.g.
create a UDF from a map e.g.
val lookUpGender = smvCreateLookUp(Map(0->"m", 1->"f")) val res = df.select(lookUpGender($"sex") as "gender")
IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false.
IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false. NOTE: It returns false, if all the values are nulls
restore 1.1 sum behaviour (and what is coming back in 1.4) where if all values are null, sum is 0
Commonly used functions
Commonly used functions
1.5
Converts a possible urn to the module's fqn
True if any of the columns is not null
True if any of the columns is not null
(Since version 2.1) use smvHasNonNull in smvfuncs package instead
smvFirst: by default return null if the first record is null
smvFirst: by default return null if the first record is null
Since Spark "first" will return the first non-null value, we have to create our version smvFirst which to retune the real first value, even if it's null. The alternative form will try to return the first non-null value
the column
switches whether the function will try to find the first non-null value
(Since version 1.6) use the one in smvfuncs package instead
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
(Since version 2.1) use smvHasNonNull in smvfuncs package instead
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
(Since version 2.1) use smvHasNonNull in smvfuncs package instead
Spark Modularized View (SMV)
Main classes
smvSelectPlussmvStrToTimestampsmvPivotMain packages
df.edd.summary().eddShowSmvDQM().add(DQMRule(...))