smv

Type Members

class ColumnHelper extends AnyRef

ColumnHelper class provides additional methods/operators on Column
ColumnHelper class provides additional methods/operators on Column
import org.tresamigos.smv
will import the implicit convertion from Column to ColumnHelper
case class CsvAttributes(delimiter: Char = ',', quotechar: Char = '\"', hasHeader: Boolean = false) extends Product with Serializable
class DQMMetadataPolicy extends DQMPolicy

Policy for validating a module's current metadata against its historical metadata.
Policy for validating a module's current metadata against its historical metadata. This policy is added to every module's DQM
class DataSetMgr extends AnyRef

DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project.
DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project. Every DSM method to load SmvDataSets creates a new transaction within which all of the indicated SmvDataSets are loaded from the most recent source and resolved. All SmvDataSets provided by DSM are resolved. DSM delegates to DataSetRepo to discover SmvDataSets to DataSetResolver to load and resolve SmvDataSets. DSM methods like load which look up SmvDataSets by name accept an arbitrary number of names so that all the target SmvDataSets are loaded within the same transaction (which is much faster).
abstract class DataSetRepo extends AnyRef

DataSetRepo is the entity responsible for discovering and loading the datasets in a given language.
DataSetRepo is the entity responsible for discovering and loading the datasets in a given language. A new repo is created for each new transaction.
abstract class DataSetRepoFactory extends AnyRef
class DataSetRepoFactoryPython extends DataSetRepoFactory with InterfacesWithPy4J
class DataSetRepoFactoryScala extends DataSetRepoFactory
class DataSetRepoPython extends DataSetRepo with InterfacesWithPy4J
class DataSetRepoScala extends DataSetRepo
class DataSetResolver extends AnyRef

DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets.
DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets. A DSR object represent a single transaction. Each DSR creates a set of DataSetRepos at instantiation. When asked for an SmvDataSet, DSR queries the repos for that SmvDataSet and resolves it. The SmvDataSet is responsible for resolving itself, given access to the DSR to load/resolve the SmvDataSet's dependencies. DSR caches the SmvDataSets it has already resolved to ensure that any SmvDataSet is only resolved once.
trait FilenamePart extends AnyRef

A module's file name part is stackable, e.g.
A module's file name part is stackable, e.g. with Using[SmvRunConfig]
trait IAnyInputStream extends AnyRef
trait IDataSetRepoFactoryPy4J extends AnyRef
trait IDataSetRepoPy4J extends AnyRef
trait IPythonResponsePy4J[T] extends AnyRef
trait ISmvModule extends AnyRef
class InputStreamAdapter extends IAnyInputStream

Adapts a java InputStream object to the IAnyInputStream interface, so it can be used in I/O methods that can work with input streams from both Java and Python sources.
case class LinkURN(fqn: String) extends URN with Product with Serializable
case class ModURN(fqn: String) extends URN with Product with Serializable
class RunParams extends (SmvDataSet) ⇒ DataFrame

Maps SmvDataSet to DataFrame by FQN.
Maps SmvDataSet to DataFrame by FQN. This is the type of the parameter expected by SmvModule's run method.
Subclasses Function1[SmvDataSet, DataFrame] so it can be used the same way as before, when runParams was type-aliased to Map[SmvDataSet, DataFrame]
class SchemaDiscoveryHelper extends AnyRef
abstract class SmvAncillary extends AnyRef
class SmvApp extends AnyRef

Driver for SMV applications.
Driver for SMV applications. Most apps do not need to override this class and should just be launched using the SmvApp object (defined below)
class SmvConfig extends AnyRef

Container of all SMV config driven elements (cmd line, app props, user props, etc).
class SmvCsvFile extends SmvSingleFile

Represents a raw input file with a given file path (can be local or hdfs) and CSV attributes.
class SmvCsvStringData extends SmvInputDataSet with SmvDSWithParser

a built-in SmvModule from schema string and data string
a built-in SmvModule from schema string and data string
E.g.
```
SmvCsvStringData("a:String;b:Double;c:String", "aa,1.0,cc;aa2,3.5,CC")
```
class SmvDFHelper extends AnyRef
trait SmvDSWithParser extends SmvDataSet

Both SmvFile and SmvCsvStringData shared the parser validation part, extract the common part to the new ABC: SmvDSWithParser
abstract class SmvDataSet extends FilenamePart

Dependency management unit within the SMV application framework.
Dependency management unit within the SMV application framework. Execution order within the SMV application framework is derived from dependency between SmvDataSet instances. Instances of this class can either be a file or a module. In either case, there would be a single result DataFrame.
class SmvDqmValidationError extends SmvRuntimeException
case class SmvExtModule(modFqn: String) extends SmvModule with Product with Serializable

Class for declaring datasets defined in another language.
Class for declaring datasets defined in another language. Resolves to an instance of SmvExtModulePython.
case class SmvExtModuleLink(modFqn: String) extends SmvModuleLink with Product with Serializable

Declarative class for links to datasets defined in another language.
Declarative class for links to datasets defined in another language. Resolves to a link to an SmvExtModulePython.
class SmvExtModulePython extends SmvDataSet with InterfacesWithPy4J

Concrete SmvDataSet representation of modules defined in Python.
Concrete SmvDataSet representation of modules defined in Python. Created exclusively by DataSetRepoPython. Wraps an ISmvModule.
abstract class SmvFile extends SmvInputDataSet with SmvDSWithParser
class SmvFrlFile extends SmvSingleFile
class SmvGroupedDataFunc extends AnyRef

SMV operations that can be applied to grouped data.
SMV operations that can be applied to grouped data. For example:
```
df.smvGroupBy("k").smvDecile("amt")
```
We can not use the standard Spark GroupedData because the internal DataFrame and keys are not exposed.
case class SmvHierOpParam(hasName: Boolean, parentHier: Option[String]) extends Product with Serializable

Define whether to keep name columns and parent columns when perform rollup operation
Define whether to keep name columns and parent columns when perform rollup operation
hasName
config to add prefix_name volume in addition to type and value fields
parentHier
specifies parent hierarchy's name, when specified, will add parent_prefix_type/value fields based on the specified hierarchy
class SmvHierarchies extends SmvAncillary

SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy. Through theSmvHierarchyFuncs it provides rollup methods on the hierarchy structure.
SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy. Through theSmvHierarchyFuncs it provides rollup methods on the hierarchy structure.
Define an SmvHierarchies
```
object GeoHier extends SmvHierarchies("geo",
  SmvHierarchy("county", ZipRefTable, Seq("zip", "County", "State", "Country")),
  SmvHierarchy("terr", ZipRefTable, Seq("zip", "Territory", "Devision", "Region", "Country"))
)
```
Use the SmvHierarchies
```
object MyModule extends SmvModule("...") {
   override def requiresDS() = Seq(...)
   override def requiresAnc() = Seq(GeoHier)
   override def run(...) = {
     ...
     GeoHier.levelRollup(df, "zip3", "State")(
       sum($"v") as "v",
       avg($"v2") as "v2"
     )(SmvHierOpParam(true, Some("terr")))
   }
}
```
The methods provided by SmvHierarchies, levelRollup, etc., will output {prefix}_type and {prefix}_value columns. For above example, they are geo_type and geo_value. The values of those 2 columns are the name of the original hierarchy level's and the values respectively. For examples,
```
geo_type, geo_value
zip,      92127
County,   06073
```
case class SmvHierarchy(name: String, hierarchyMap: SmvOutput, hierarchy: Seq[String], nameColPostfix: String = "_name") extends Product with Serializable

SmvHierarchy combines a hierarchy Map (a SmvOutput) with the hierarchy structure.
SmvHierarchy combines a hierarchy Map (a SmvOutput) with the hierarchy structure. The hierarchy sequence ordered from "small" to "large". For example:
```
SmvHierarchy("zipmap", ZipTable, Seq("zip", "county", "state"))
```
where ZipTable is a SmvOutput which has zip, county, state as its columns and zip is the primary key (unique) of that table.
class SmvHiveTable extends SmvInputDataSet

SMV Dataset Wrapper around a hive table.
class SmvJdbcTable extends SmvInputDataSet

Wrapper for a database table accessed via JDBC
trait SmvKeys extends AnyRef
case class SmvLock(path: String, timeout: Long = Long.MaxValue) extends Product with Serializable

Provides a file-based mutex control, or non-reentrant lock.
Provides a file-based mutex control, or non-reentrant lock.
A typical use should use the SmvLock.withLock method in the companion object, or follow the idiom below:
val sl = SmvLock("/path/to/my/file.lock") sl.lock() try { // access lock-protected resource } finally { sl.unlock() }
The parenthese () is recommended to indicate use of side effect.
class SmvMetadata extends AnyRef

Representation of module metadata which can be saved to file.
Representation of module metadata which can be saved to file.
TODO: Add getter methods and more types of metadata (e.g. validation results)
class SmvMetadataHistory extends AnyRef

Interface for updating metadata history.
abstract class SmvModule extends SmvDataSet

base module class.
base module class. All SMV modules need to extend this class and provide their description and dependency requirements (what does it depend on). The module run method will be provided the result of all dependent inputs and the result of the run is the result of this module. All modules that depend on this module will be provided the DataFrame result from the run method of this module. Note: the module should *not* persist any RDD itself.
class SmvModuleLink extends SmvModule

Link to an output module in another stage.
Link to an output module in another stage. Because modules in a given stage can not access modules in another stage, this class enables the user to link an output module from one stage as an input into current stage.
```
package stage2.input

object Account1Link extends SmvModuleLink(stage1.Accounts)
```
Similar to File/Module, a dqm() method can also be overriden in the link
class SmvMultiCsvFiles extends SmvFile

Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.
Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.
SmvCsvFile can also take dir as path parameter, but all files are considered slices. In that case if none of them has headers, it's equivalent to SmvMultiCsvFiles. However if every file has header, SmvCsvFile will not remove header correctly.
class SmvMultiJoin extends AnyRef
trait SmvOutput extends AnyRef

A marker trait that indicates that a SmvDataSet/SmvModule decorated with this trait is an output DataSet/module.
trait SmvRunConfig extends AnyRef

Base marker trait for run configuration objects
case class SmvRunInfo(validation: DqmValidationResult, metadata: SmvMetadata, metadataHistory: SmvMetadataHistory) extends Product with Serializable
class SmvRunInfoCollector extends AnyRef
class SmvRuntimeException extends RuntimeException
class SmvSchema extends Serializable

CSV file schema definition.
CSV file schema definition. This class should only be used for parsing/persisting the schema file associated with a CSV file. It is no longer needed as a general purpose schema definition as spark now has that covered.
abstract class SmvSingleFile extends SmvFile

Represents a single raw input file with a given file path.
Represents a single raw input file with a given file path. E.g. SmvCsvFile or SmvFrlFile
class SmvUnsupportedType extends SmvRuntimeException
class TX extends AnyRef

Abstraction of the transaction boundary for loading SmvDataSets.
Abstraction of the transaction boundary for loading SmvDataSets. A TX object will instantiate a set of repos when it is it self instantiated and will reuse the same repos for all queries. This means that each new TX object will reload the SmvDataSet from source **once** during its lifetime.
NOTE: Once a new TX is created, the well-formedness of the SmvDataSets provided by the previous TX is not guaranteed. Particularly it may become impossible to run modules from the previous TX.
sealed abstract class URN extends AnyRef
trait Using[+T <: SmvRunConfig] extends FilenamePart

SmvDataSet that can be configured to return different DataFrames.

Value Members

object ClassCRC

computes the CRC32 checksum for the code of the given class name.
computes the CRC32 checksum for the code of the given class name. The class must be reachable through the configured java class path.
object CsvAttributes extends Serializable
object InputStreamAdapter

Factory object to create InputStreamAdapters from input sources
val LinkDsPrefix: String
val ModDsPrefix: String
object SmvApp

Common entry point for all SMV applications.
Common entry point for all SMV applications. This is the object that should be provided to spark-submit.
object SmvConfig
object SmvCsvFile
object SmvCsvStringData
object SmvExtModulePython extends InterfacesWithPy4J

Factory for SmvExtModulePython.
Factory for SmvExtModulePython. Creates an SmvExtModulePython with SmvOuptut if the Python dataset is SmvOutput
object SmvFrlFile
object SmvHiveTable
object SmvJoinType

Instead of using String for join type, always use the link here.
Instead of using String for join type, always use the link here.
If there are typos on the join type, using the link in client code will cause compile time failure, which using string itself will cause run-time failure.
Spark(as of 1.4)'s join type is a String. Could use enum or case objects here, but there are clients using the String api, so will push that change till later.
object SmvLock extends Serializable
object SmvMetadata
object SmvMetadataHistory
object SmvSchema extends Serializable
object URN

Factory which constructs the correct type of URN object given the URN as a string.
package cds
package classloaders
package dqm

DQM (Data Quality Module) providing classes for DF data quality assurance
DQM (Data Quality Module) providing classes for DF data quality assurance
Main class org.tresamigos.smv.dqm.SmvDQM can be used with the SmvApp/Module Framework or on stand-alone DF. With the SmvApp/Module framework, a dqm method is defined on the org.tresamigos.smv.SmvDataSet level, and can be overridden to define DQM rules, fixes and policies, which then will be automatically checked when the SmvDataSet gets resolved.
For working on a stand-alone DF, please refer the SmvDQM class's documentation.
package edd

Provides Extended Data Dictionary functions for ad hoc data analysis
Provides Extended Data Dictionary functions for ad hoc data analysis
```
scala> val res1 = df.edd.summary($"amt", $"time")
scala> res1.eddShow

scala> val res2 = df.edd.histogram(AmtHist($"amt"), $"county", Hist($"pop", binSize=1000))
scala> res2.eddShow
scala> res2.saveReport("file/path")
```
Depends on the data types of the columns, Edd summary method will perform different statistics.
The histogram method takes a group of HistColumn as parameters. Or when a group of String as the column names are given, it will use the default HistColumn parameters. Two types of HistColumns are supported
- org.tresamigos.smv.edd.Hist
- org.tresamigos.smv.edd.AmtHist
The eddShow method will print report to the console, saveReport will save report as RDD[String], The strings are JSON strings.
package git
object histBoolean extends Histogram
object histDouble extends Histogram
object histInt extends Histogram
object histStr extends Histogram
def isLink(modUrn: String): Boolean

Predicate functions working with urn
def link2mod(linkUrn: String): String

Converts a link urn to the mod urn representing its target
implicit def makeColHelper(col: Column): ColumnHelper

implicitly convert Column to ColumnHelper
implicit def makeDFHelper(df: DataFrame): SmvDFHelper

implicitly convert DataFrame to SmvDFHelper
implicit def makeSmvGDCvrt(sgd: SmvGroupedData): RelationalGroupedDataset

implicitly convert SmvGroupedData to GroupedData
implicit def makeSmvGDFunc(sgd: SmvGroupedData): SmvGroupedDataFunc

implicitly convert SmvGroupedData (created by smvGropyBy method) to SmvDFHelper
implicit def makeSmvGroupedData(df: DataFrame): SmvGroupedData

implicitly convert DataFrame to SmvGroupedData
implicit def makeStructTypeHelper(schema: StructType): StructTypeHelper

implicitly convert StructType to StructTypeHelper
implicitly convert StructType to StructTypeHelper

Annotations
@DeveloperApi()
implicit def makeSymColHelper(sym: Symbol): ColumnHelper

implicitly convert Symbol to ColumnHelper
package matcher
package matcher_old
object mfvInt extends MostFrequentValue
object mfvStr extends MostFrequentValue
def mkLinkUrn(targetFqn: String): String

Create an urn for a link from its target fqn
def mkModUrn(modFqn: String): String

Create an urn for a module from its fqn
def mkUniq(collection: Seq[String], candidate: String, ignoreCase: Boolean = false, postfix: String = null): String

Repeatedly changes candidate so that it is not found in the collection.
Repeatedly changes candidate so that it is not found in the collection.
Useful when choosing a unique column name to add to a data frame.

Annotations
@tailrec()
package panel
package python
package shell

Provide functions for the interactive shell
Provide functions for the interactive shell
In SMV's tools/conf/smv_shell_init.scala or project's conf/shell_init.scala add
```
import org.tresamigos.smv.shell._
```
def smvCreateLookUp[S, D](map: Map[S, D])(implicit st: scala.reflect.api.JavaUniverse.TypeTag[S], dt: scala.reflect.api.JavaUniverse.TypeTag[D]): UserDefinedFunction

create a UDF from a map e.g.
create a UDF from a map e.g.
```
val lookUpGender = smvCreateLookUp(Map(0->"m", 1->"f"))
val res = df.select(lookUpGender($"sex") as "gender")
```
def smvIsAny(col: Column): Column

IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false.
IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false. NOTE: It returns false, if all the values are nulls
def smvSum0(col: Column): Column

restore 1.1 sum behaviour (and what is coming back in 1.4) where if all values are null, sum is 0
object smvfuncs

Commonly used functions
Commonly used functions

Since
1.5
def urn2fqn(modUrn: String): String

Converts a possible urn to the module's fqn
package util

Deprecated Value Members

def hasNonNull(columns: Column*): Column

True if any of the columns is not null
True if any of the columns is not null

Annotations
@deprecated
Deprecated
(Since version 2.1) use smvHasNonNull in smvfuncs package instead
def smvFirst(c: Column, nonNull: Boolean = false): Column

smvFirst: by default return null if the first record is null
smvFirst: by default return null if the first record is null
Since Spark "first" will return the first non-null value, we have to create our version smvFirst which to retune the real first value, even if it's null. The alternative form will try to return the first non-null value
c
the column
nonNull
switches whether the function will try to find the first non-null value

Annotations
@deprecated
Deprecated
(Since version 1.6) use the one in smvfuncs package instead
def smvStrCat(sep: String, columns: Column*): Column

Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

Annotations
@deprecated
Deprecated
(Since version 2.1) use smvHasNonNull in smvfuncs package instead
def smvStrCat(columns: Column*): Column

Patch Spark's concat and concat_ws to treat null as empty string in concatenation.
Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

Annotations
@deprecated
Deprecated
(Since version 2.1) use smvHasNonNull in smvfuncs package instead

package smv

Spark Modularized View (SMV)

Type Members

class ColumnHelper extends AnyRef

case class CsvAttributes(delimiter: Char = ',', quotechar: Char = '\"', hasHeader: Boolean = false) extends Product with Serializable

class DQMMetadataPolicy extends DQMPolicy

class DataSetMgr extends AnyRef

abstract class DataSetRepo extends AnyRef

abstract class DataSetRepoFactory extends AnyRef

class DataSetRepoFactoryPython extends DataSetRepoFactory with InterfacesWithPy4J

class DataSetRepoFactoryScala extends DataSetRepoFactory

class DataSetRepoPython extends DataSetRepo with InterfacesWithPy4J

class DataSetRepoScala extends DataSetRepo

class DataSetResolver extends AnyRef

trait FilenamePart extends AnyRef

trait IAnyInputStream extends AnyRef

trait IDataSetRepoFactoryPy4J extends AnyRef

trait IDataSetRepoPy4J extends AnyRef

trait IPythonResponsePy4J[T] extends AnyRef

trait ISmvModule extends AnyRef

class InputStreamAdapter extends IAnyInputStream

case class LinkURN(fqn: String) extends URN with Product with Serializable

case class ModURN(fqn: String) extends URN with Product with Serializable

class RunParams extends (SmvDataSet) ⇒ DataFrame

class SchemaDiscoveryHelper extends AnyRef

abstract class SmvAncillary extends AnyRef

class SmvApp extends AnyRef

class SmvConfig extends AnyRef

class SmvCsvFile extends SmvSingleFile

class SmvCsvStringData extends SmvInputDataSet with SmvDSWithParser

class SmvDFHelper extends AnyRef

trait SmvDSWithParser extends SmvDataSet

abstract class SmvDataSet extends FilenamePart

class SmvDqmValidationError extends SmvRuntimeException

case class SmvExtModule(modFqn: String) extends SmvModule with Product with Serializable

case class SmvExtModuleLink(modFqn: String) extends SmvModuleLink with Product with Serializable

class SmvExtModulePython extends SmvDataSet with InterfacesWithPy4J

abstract class SmvFile extends SmvInputDataSet with SmvDSWithParser

class SmvFrlFile extends SmvSingleFile

class SmvGroupedDataFunc extends AnyRef

case class SmvHierOpParam(hasName: Boolean, parentHier: Option[String]) extends Product with Serializable

class SmvHierarchies extends SmvAncillary

Define an SmvHierarchies

Use the SmvHierarchies

case class SmvHierarchy(name: String, hierarchyMap: SmvOutput, hierarchy: Seq[String], nameColPostfix: String = "_name") extends Product with Serializable

class SmvHiveTable extends SmvInputDataSet

class SmvJdbcTable extends SmvInputDataSet

trait SmvKeys extends AnyRef

case class SmvLock(path: String, timeout: Long = Long.MaxValue) extends Product with Serializable

class SmvMetadata extends AnyRef

class SmvMetadataHistory extends AnyRef

abstract class SmvModule extends SmvDataSet

class SmvModuleLink extends SmvModule

class SmvMultiCsvFiles extends SmvFile

class SmvMultiJoin extends AnyRef

trait SmvOutput extends AnyRef

trait SmvRunConfig extends AnyRef

case class SmvRunInfo(validation: DqmValidationResult, metadata: SmvMetadata, metadataHistory: SmvMetadataHistory) extends Product with Serializable

class SmvRunInfoCollector extends AnyRef

class SmvRuntimeException extends RuntimeException

class SmvSchema extends Serializable

abstract class SmvSingleFile extends SmvFile

class SmvUnsupportedType extends SmvRuntimeException

class TX extends AnyRef

sealed abstract class URN extends AnyRef

trait Using[+T <: SmvRunConfig] extends FilenamePart

Value Members

object ClassCRC

object CsvAttributes extends Serializable

object InputStreamAdapter

val LinkDsPrefix: String

val ModDsPrefix: String

object SmvApp

object SmvConfig

object SmvCsvFile

object SmvCsvStringData

object SmvExtModulePython extends InterfacesWithPy4J

object SmvFrlFile

object SmvHiveTable