Package

org.tresamigos

smv

Permalink

package smv

Spark Modularized View (SMV)

Main classes

Main packages

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. smv
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class ColumnHelper extends AnyRef

    Permalink

    ColumnHelper class provides additional methods/operators on Column

    ColumnHelper class provides additional methods/operators on Column

    import org.tresamigos.smv

    will import the implicit convertion from Column to ColumnHelper

  2. case class CsvAttributes(delimiter: Char = ',', quotechar: Char = '\"', hasHeader: Boolean = false) extends Product with Serializable

    Permalink
  3. class DQMMetadataPolicy extends DQMPolicy

    Permalink

    Policy for validating a module's current metadata against its historical metadata.

    Policy for validating a module's current metadata against its historical metadata. This policy is added to every module's DQM

  4. class DataSetMgr extends AnyRef

    Permalink

    DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project.

    DataSetMgr (DSM) is the entrypoint for SmvApp to load the SmvDataSets in a project. Every DSM method to load SmvDataSets creates a new transaction within which all of the indicated SmvDataSets are loaded from the most recent source and resolved. All SmvDataSets provided by DSM are resolved. DSM delegates to DataSetRepo to discover SmvDataSets to DataSetResolver to load and resolve SmvDataSets. DSM methods like load which look up SmvDataSets by name accept an arbitrary number of names so that all the target SmvDataSets are loaded within the same transaction (which is much faster).

  5. abstract class DataSetRepo extends AnyRef

    Permalink

    DataSetRepo is the entity responsible for discovering and loading the datasets in a given language.

    DataSetRepo is the entity responsible for discovering and loading the datasets in a given language. A new repo is created for each new transaction.

  6. abstract class DataSetRepoFactory extends AnyRef

    Permalink
  7. class DataSetRepoFactoryPython extends DataSetRepoFactory with InterfacesWithPy4J

    Permalink
  8. class DataSetRepoFactoryScala extends DataSetRepoFactory

    Permalink
  9. class DataSetRepoPython extends DataSetRepo with InterfacesWithPy4J

    Permalink
  10. class DataSetRepoScala extends DataSetRepo

    Permalink
  11. class DataSetResolver extends AnyRef

    Permalink

    DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets.

    DataSetResolver (DSR) is the entrypoint through which the DataSetMgr acquires SmvDataSets. A DSR object represent a single transaction. Each DSR creates a set of DataSetRepos at instantiation. When asked for an SmvDataSet, DSR queries the repos for that SmvDataSet and resolves it. The SmvDataSet is responsible for resolving itself, given access to the DSR to load/resolve the SmvDataSet's dependencies. DSR caches the SmvDataSets it has already resolved to ensure that any SmvDataSet is only resolved once.

  12. trait FilenamePart extends AnyRef

    Permalink

    A module's file name part is stackable, e.g.

    A module's file name part is stackable, e.g. with Using[SmvRunConfig]

  13. trait IAnyInputStream extends AnyRef

    Permalink
  14. trait IDataSetRepoFactoryPy4J extends AnyRef

    Permalink
  15. trait IDataSetRepoPy4J extends AnyRef

    Permalink
  16. trait IPythonResponsePy4J[T] extends AnyRef

    Permalink
  17. trait ISmvModule extends AnyRef

    Permalink
  18. class InputStreamAdapter extends IAnyInputStream

    Permalink

    Adapts a java InputStream object to the IAnyInputStream interface, so it can be used in I/O methods that can work with input streams from both Java and Python sources.

  19. case class LinkURN(fqn: String) extends URN with Product with Serializable

    Permalink
  20. case class ModURN(fqn: String) extends URN with Product with Serializable

    Permalink
  21. class RunParams extends (SmvDataSet) ⇒ DataFrame

    Permalink

    Maps SmvDataSet to DataFrame by FQN.

    Maps SmvDataSet to DataFrame by FQN. This is the type of the parameter expected by SmvModule's run method.

    Subclasses Function1[SmvDataSet, DataFrame] so it can be used the same way as before, when runParams was type-aliased to Map[SmvDataSet, DataFrame]

  22. class SchemaDiscoveryHelper extends AnyRef

    Permalink
  23. abstract class SmvAncillary extends AnyRef

    Permalink
  24. class SmvApp extends AnyRef

    Permalink

    Driver for SMV applications.

    Driver for SMV applications. Most apps do not need to override this class and should just be launched using the SmvApp object (defined below)

  25. class SmvConfig extends AnyRef

    Permalink

    Container of all SMV config driven elements (cmd line, app props, user props, etc).

  26. class SmvCsvFile extends SmvSingleFile

    Permalink

    Represents a raw input file with a given file path (can be local or hdfs) and CSV attributes.

  27. class SmvCsvStringData extends SmvInputDataSet with SmvDSWithParser

    Permalink

    a built-in SmvModule from schema string and data string

    a built-in SmvModule from schema string and data string

    E.g.

    SmvCsvStringData("a:String;b:Double;c:String", "aa,1.0,cc;aa2,3.5,CC")
  28. class SmvDFHelper extends AnyRef

    Permalink
  29. trait SmvDSWithParser extends SmvDataSet

    Permalink

    Both SmvFile and SmvCsvStringData shared the parser validation part, extract the common part to the new ABC: SmvDSWithParser

  30. abstract class SmvDataSet extends FilenamePart

    Permalink

    Dependency management unit within the SMV application framework.

    Dependency management unit within the SMV application framework. Execution order within the SMV application framework is derived from dependency between SmvDataSet instances. Instances of this class can either be a file or a module. In either case, there would be a single result DataFrame.

  31. class SmvDqmValidationError extends SmvRuntimeException

    Permalink
  32. case class SmvExtModule(modFqn: String) extends SmvModule with Product with Serializable

    Permalink

    Class for declaring datasets defined in another language.

    Class for declaring datasets defined in another language. Resolves to an instance of SmvExtModulePython.

  33. case class SmvExtModuleLink(modFqn: String) extends SmvModuleLink with Product with Serializable

    Permalink

    Declarative class for links to datasets defined in another language.

    Declarative class for links to datasets defined in another language. Resolves to a link to an SmvExtModulePython.

  34. class SmvExtModulePython extends SmvDataSet with InterfacesWithPy4J

    Permalink

    Concrete SmvDataSet representation of modules defined in Python.

    Concrete SmvDataSet representation of modules defined in Python. Created exclusively by DataSetRepoPython. Wraps an ISmvModule.

  35. abstract class SmvFile extends SmvInputDataSet with SmvDSWithParser

    Permalink
  36. class SmvFrlFile extends SmvSingleFile

    Permalink
  37. class SmvGroupedDataFunc extends AnyRef

    Permalink

    SMV operations that can be applied to grouped data.

    SMV operations that can be applied to grouped data. For example:

    df.smvGroupBy("k").smvDecile("amt")

    We can not use the standard Spark GroupedData because the internal DataFrame and keys are not exposed.

  38. case class SmvHierOpParam(hasName: Boolean, parentHier: Option[String]) extends Product with Serializable

    Permalink

    Define whether to keep name columns and parent columns when perform rollup operation

    Define whether to keep name columns and parent columns when perform rollup operation

    hasName

    config to add prefix_name volume in addition to type and value fields

    parentHier

    specifies parent hierarchy's name, when specified, will add parent_prefix_type/value fields based on the specified hierarchy

  39. class SmvHierarchies extends SmvAncillary

    Permalink

    SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy. Through the SmvHierarchyFuncs it provides rollup methods on the hierarchy structure.

    SmvHierarchies is a SmvAncillary which combines a sequence of SmvHierarchy. Through the SmvHierarchyFuncs it provides rollup methods on the hierarchy structure.

    Define an SmvHierarchies
    object GeoHier extends SmvHierarchies("geo",
      SmvHierarchy("county", ZipRefTable, Seq("zip", "County", "State", "Country")),
      SmvHierarchy("terr", ZipRefTable, Seq("zip", "Territory", "Devision", "Region", "Country"))
    )
    Use the SmvHierarchies
    object MyModule extends SmvModule("...") {
       override def requiresDS() = Seq(...)
       override def requiresAnc() = Seq(GeoHier)
       override def run(...) = {
         ...
         GeoHier.levelRollup(df, "zip3", "State")(
           sum($"v") as "v",
           avg($"v2") as "v2"
         )(SmvHierOpParam(true, Some("terr")))
       }
    }

    The methods provided by SmvHierarchies, levelRollup, etc., will output {prefix}_type and {prefix}_value columns. For above example, they are geo_type and geo_value. The values of those 2 columns are the name of the original hierarchy level's and the values respectively. For examples,

    geo_type, geo_value
    zip,      92127
    County,   06073
  40. case class SmvHierarchy(name: String, hierarchyMap: SmvOutput, hierarchy: Seq[String], nameColPostfix: String = "_name") extends Product with Serializable

    Permalink

    SmvHierarchy combines a hierarchy Map (a SmvOutput) with the hierarchy structure.

    SmvHierarchy combines a hierarchy Map (a SmvOutput) with the hierarchy structure. The hierarchy sequence ordered from "small" to "large". For example:

    SmvHierarchy("zipmap", ZipTable, Seq("zip", "county", "state"))

    where ZipTable is a SmvOutput which has zip, county, state as its columns and zip is the primary key (unique) of that table.

  41. class SmvHiveTable extends SmvInputDataSet

    Permalink

    SMV Dataset Wrapper around a hive table.

  42. class SmvJdbcTable extends SmvInputDataSet

    Permalink

    Wrapper for a database table accessed via JDBC

  43. trait SmvKeys extends AnyRef

    Permalink
  44. case class SmvLock(path: String, timeout: Long = Long.MaxValue) extends Product with Serializable

    Permalink

    Provides a file-based mutex control, or non-reentrant lock.

    Provides a file-based mutex control, or non-reentrant lock.

    A typical use should use the SmvLock.withLock method in the companion object, or follow the idiom below:

    val sl = SmvLock("/path/to/my/file.lock") sl.lock() try { // access lock-protected resource } finally { sl.unlock() }

    The parenthese () is recommended to indicate use of side effect.

  45. class SmvMetadata extends AnyRef

    Permalink

    Representation of module metadata which can be saved to file.

    Representation of module metadata which can be saved to file.

    TODO: Add getter methods and more types of metadata (e.g. validation results)

  46. class SmvMetadataHistory extends AnyRef

    Permalink

    Interface for updating metadata history.

  47. abstract class SmvModule extends SmvDataSet

    Permalink

    base module class.

    base module class. All SMV modules need to extend this class and provide their description and dependency requirements (what does it depend on). The module run method will be provided the result of all dependent inputs and the result of the run is the result of this module. All modules that depend on this module will be provided the DataFrame result from the run method of this module. Note: the module should *not* persist any RDD itself.

  48. class SmvModuleLink extends SmvModule

    Permalink

    Link to an output module in another stage.

    Link to an output module in another stage. Because modules in a given stage can not access modules in another stage, this class enables the user to link an output module from one stage as an input into current stage.

    package stage2.input
    
    object Account1Link extends SmvModuleLink(stage1.Accounts)

    Similar to File/Module, a dqm() method can also be overriden in the link

  49. class SmvMultiCsvFiles extends SmvFile

    Permalink

    Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.

    Instead of a single input file, specify a data dir with files which has the same schema and CsvAttributes.

    SmvCsvFile can also take dir as path parameter, but all files are considered slices. In that case if none of them has headers, it's equivalent to SmvMultiCsvFiles. However if every file has header, SmvCsvFile will not remove header correctly.

  50. class SmvMultiJoin extends AnyRef

    Permalink
  51. trait SmvOutput extends AnyRef

    Permalink

    A marker trait that indicates that a SmvDataSet/SmvModule decorated with this trait is an output DataSet/module.

  52. trait SmvRunConfig extends AnyRef

    Permalink

    Base marker trait for run configuration objects

  53. case class SmvRunInfo(validation: DqmValidationResult, metadata: SmvMetadata, metadataHistory: SmvMetadataHistory) extends Product with Serializable

    Permalink
  54. class SmvRunInfoCollector extends AnyRef

    Permalink
  55. class SmvRuntimeException extends RuntimeException

    Permalink
  56. class SmvSchema extends Serializable

    Permalink

    CSV file schema definition.

    CSV file schema definition. This class should only be used for parsing/persisting the schema file associated with a CSV file. It is no longer needed as a general purpose schema definition as spark now has that covered.

  57. abstract class SmvSingleFile extends SmvFile

    Permalink

    Represents a single raw input file with a given file path.

    Represents a single raw input file with a given file path. E.g. SmvCsvFile or SmvFrlFile

  58. class SmvUnsupportedType extends SmvRuntimeException

    Permalink
  59. class TX extends AnyRef

    Permalink

    Abstraction of the transaction boundary for loading SmvDataSets.

    Abstraction of the transaction boundary for loading SmvDataSets. A TX object will instantiate a set of repos when it is it self instantiated and will reuse the same repos for all queries. This means that each new TX object will reload the SmvDataSet from source **once** during its lifetime.

    NOTE: Once a new TX is created, the well-formedness of the SmvDataSets provided by the previous TX is not guaranteed. Particularly it may become impossible to run modules from the previous TX.

  60. sealed abstract class URN extends AnyRef

    Permalink
  61. trait Using[+T <: SmvRunConfig] extends FilenamePart

    Permalink

    SmvDataSet that can be configured to return different DataFrames.

Value Members

  1. object ClassCRC

    Permalink

    computes the CRC32 checksum for the code of the given class name.

    computes the CRC32 checksum for the code of the given class name. The class must be reachable through the configured java class path.

  2. object CsvAttributes extends Serializable

    Permalink
  3. object InputStreamAdapter

    Permalink

    Factory object to create InputStreamAdapters from input sources

  4. val LinkDsPrefix: String

    Permalink
  5. val ModDsPrefix: String

    Permalink
  6. object SmvApp

    Permalink

    Common entry point for all SMV applications.

    Common entry point for all SMV applications. This is the object that should be provided to spark-submit.

  7. object SmvConfig

    Permalink
  8. object SmvCsvFile

    Permalink
  9. object SmvCsvStringData

    Permalink
  10. object SmvExtModulePython extends InterfacesWithPy4J

    Permalink

    Factory for SmvExtModulePython.

    Factory for SmvExtModulePython. Creates an SmvExtModulePython with SmvOuptut if the Python dataset is SmvOutput

  11. object SmvFrlFile

    Permalink
  12. object SmvHiveTable

    Permalink
  13. object SmvJoinType

    Permalink

    Instead of using String for join type, always use the link here.

    Instead of using String for join type, always use the link here.

    If there are typos on the join type, using the link in client code will cause compile time failure, which using string itself will cause run-time failure.

    Spark(as of 1.4)'s join type is a String. Could use enum or case objects here, but there are clients using the String api, so will push that change till later.

  14. object SmvLock extends Serializable

    Permalink
  15. object SmvMetadata

    Permalink
  16. object SmvMetadataHistory

    Permalink
  17. object SmvSchema extends Serializable

    Permalink
  18. object URN

    Permalink

    Factory which constructs the correct type of URN object given the URN as a string.

  19. package cds

    Permalink
  20. package classloaders

    Permalink
  21. package dqm

    Permalink

    DQM (Data Quality Module) providing classes for DF data quality assurance

    DQM (Data Quality Module) providing classes for DF data quality assurance

    Main class org.tresamigos.smv.dqm.SmvDQM can be used with the SmvApp/Module Framework or on stand-alone DF. With the SmvApp/Module framework, a dqm method is defined on the org.tresamigos.smv.SmvDataSet level, and can be overridden to define DQM rules, fixes and policies, which then will be automatically checked when the SmvDataSet gets resolved.

    For working on a stand-alone DF, please refer the SmvDQM class's documentation.

  22. package edd

    Permalink

    Provides Extended Data Dictionary functions for ad hoc data analysis

    Provides Extended Data Dictionary functions for ad hoc data analysis

    scala> val res1 = df.edd.summary($"amt", $"time")
    scala> res1.eddShow
    
    scala> val res2 = df.edd.histogram(AmtHist($"amt"), $"county", Hist($"pop", binSize=1000))
    scala> res2.eddShow
    scala> res2.saveReport("file/path")

    Depends on the data types of the columns, Edd summary method will perform different statistics.

    The histogram method takes a group of HistColumn as parameters. Or when a group of String as the column names are given, it will use the default HistColumn parameters. Two types of HistColumns are supported

    The eddShow method will print report to the console, saveReport will save report as RDD[String], The strings are JSON strings.

  23. package git

    Permalink
  24. object histBoolean extends Histogram

    Permalink
  25. object histDouble extends Histogram

    Permalink
  26. object histInt extends Histogram

    Permalink
  27. object histStr extends Histogram

    Permalink
  28. def isLink(modUrn: String): Boolean

    Permalink

    Predicate functions working with urn

  29. def link2mod(linkUrn: String): String

    Permalink

    Converts a link urn to the mod urn representing its target

  30. implicit def makeColHelper(col: Column): ColumnHelper

    Permalink

    implicitly convert Column to ColumnHelper

  31. implicit def makeDFHelper(df: DataFrame): SmvDFHelper

    Permalink

    implicitly convert DataFrame to SmvDFHelper

  32. implicit def makeSmvGDCvrt(sgd: SmvGroupedData): RelationalGroupedDataset

    Permalink

    implicitly convert SmvGroupedData to GroupedData

  33. implicit def makeSmvGDFunc(sgd: SmvGroupedData): SmvGroupedDataFunc

    Permalink

    implicitly convert SmvGroupedData (created by smvGropyBy method) to SmvDFHelper

  34. implicit def makeSmvGroupedData(df: DataFrame): SmvGroupedData

    Permalink

    implicitly convert DataFrame to SmvGroupedData

  35. implicit def makeStructTypeHelper(schema: StructType): StructTypeHelper

    Permalink

    implicitly convert StructType to StructTypeHelper

    implicitly convert StructType to StructTypeHelper

    Annotations
    @DeveloperApi()
  36. implicit def makeSymColHelper(sym: Symbol): ColumnHelper

    Permalink

    implicitly convert Symbol to ColumnHelper

  37. package matcher

    Permalink
  38. package matcher_old

    Permalink
  39. object mfvInt extends MostFrequentValue

    Permalink
  40. object mfvStr extends MostFrequentValue

    Permalink
  41. def mkLinkUrn(targetFqn: String): String

    Permalink

    Create an urn for a link from its target fqn

  42. def mkModUrn(modFqn: String): String

    Permalink

    Create an urn for a module from its fqn

  43. def mkUniq(collection: Seq[String], candidate: String, ignoreCase: Boolean = false, postfix: String = null): String

    Permalink

    Repeatedly changes candidate so that it is not found in the collection.

    Repeatedly changes candidate so that it is not found in the collection.

    Useful when choosing a unique column name to add to a data frame.

    Annotations
    @tailrec()
  44. package panel

    Permalink
  45. package python

    Permalink
  46. package shell

    Permalink

    Provide functions for the interactive shell

    Provide functions for the interactive shell

    In SMV's tools/conf/smv_shell_init.scala or project's conf/shell_init.scala add

    import org.tresamigos.smv.shell._
  47. def smvCreateLookUp[S, D](map: Map[S, D])(implicit st: scala.reflect.api.JavaUniverse.TypeTag[S], dt: scala.reflect.api.JavaUniverse.TypeTag[D]): UserDefinedFunction

    Permalink

    create a UDF from a map e.g.

    create a UDF from a map e.g.

    val lookUpGender = smvCreateLookUp(Map(0->"m", 1->"f"))
    val res = df.select(lookUpGender($"sex") as "gender")
  48. def smvIsAny(col: Column): Column

    Permalink

    IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false.

    IsAny aggregate function Return true if any of the values of the aggregated column are true, otherwise false. NOTE: It returns false, if all the values are nulls

  49. def smvSum0(col: Column): Column

    Permalink

    restore 1.1 sum behaviour (and what is coming back in 1.4) where if all values are null, sum is 0

  50. object smvfuncs

    Permalink

    Commonly used functions

    Commonly used functions

    Since

    1.5

  51. def urn2fqn(modUrn: String): String

    Permalink

    Converts a possible urn to the module's fqn

  52. package util

    Permalink

Deprecated Value Members

  1. def hasNonNull(columns: Column*): Column

    Permalink

    True if any of the columns is not null

    True if any of the columns is not null

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1) use smvHasNonNull in smvfuncs package instead

  2. def smvFirst(c: Column, nonNull: Boolean = false): Column

    Permalink

    smvFirst: by default return null if the first record is null

    smvFirst: by default return null if the first record is null

    Since Spark "first" will return the first non-null value, we have to create our version smvFirst which to retune the real first value, even if it's null. The alternative form will try to return the first non-null value

    c

    the column

    nonNull

    switches whether the function will try to find the first non-null value

    Annotations
    @deprecated
    Deprecated

    (Since version 1.6) use the one in smvfuncs package instead

  3. def smvStrCat(sep: String, columns: Column*): Column

    Permalink

    Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

    Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1) use smvHasNonNull in smvfuncs package instead

  4. def smvStrCat(columns: Column*): Column

    Permalink

    Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

    Patch Spark's concat and concat_ws to treat null as empty string in concatenation.

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1) use smvHasNonNull in smvfuncs package instead

Inherited from AnyRef

Inherited from Any

Aggregate Functions

All others

Ungrouped