Package

org.tresamigos.smv

dqm

Permalink

package dqm

DQM (Data Quality Module) providing classes for DF data quality assurance

Main class org.tresamigos.smv.dqm.SmvDQM can be used with the SmvApp/Module Framework or on stand-alone DF. With the SmvApp/Module framework, a dqm method is defined on the org.tresamigos.smv.SmvDataSet level, and can be overridden to define DQM rules, fixes and policies, which then will be automatically checked when the SmvDataSet gets resolved.

For working on a stand-alone DF, please refer the SmvDQM class's documentation.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. dqm
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. case class DQMFix(condition: Column, fix: Column, fixName: String = null, taskPolicy: DQMTaskPolicy = FailNone) extends DQMTask with Product with Serializable

    Permalink

    DQMFix will fix a column with a default value

    DQMFix will fix a column with a default value

    val f = DQMFix($"age" > 100, lit(100) as "age", "age_cap100", FailNone)

    If "age" greater than 100, make it 100. Task name "age_cap100", which can be referred in the org.tresamigos.smv.dqm.DQMState This task will not trigger a DF fail

  2. abstract class DQMPolicy extends AnyRef

    Permalink

    DQMPolicy defines a requirement on an entire DF

  3. case class DQMRule(rule: Column, ruleName: String = null, taskPolicy: DQMTaskPolicy = FailNone) extends DQMTask with Product with Serializable

    Permalink

    DQMRule defines a requirement on the records of a DF

    DQMRule defines a requirement on the records of a DF

    val r = DQMRule($"a" + $"b" < 100.0, "a_b_sum_lt100", FailPercent(0.01))

    Require the sum of "a" and "b" columns less than 100. Rule name "a_b_sum_lt100", which can be referred in the org.tresamigos.smv.dqm.DQMState If 1% or more of the records fail this rule, the entire DF will fail

  4. class DQMRuleError extends Exception with Serializable

    Permalink
  5. class DQMState extends Serializable

    Permalink

    DQMState keeps tracking of org.tresamigos.smv.dqm.DQMTask behavior on a DF

    DQMState keeps tracking of org.tresamigos.smv.dqm.DQMTask behavior on a DF

    Since the logs are implemented with Aggregators, it need a SparkContext to construct. A list of DQMRule names and a list of DQMFix names are needed also.

  6. abstract class DQMTask extends AnyRef

    Permalink
  7. sealed abstract class DQMTaskPolicy extends AnyRef

    Permalink

    Each DQMTask (DQMRule/DQMFix) need to have a DQMTaskPolicy

  8. class DQMValidator extends AnyRef

    Permalink

    Validates data against DQM rules

  9. case class DqmStateSnapshot(totalRecords: Long, parseError: ErrorReport, fixCounts: Map[String, Int], ruleErrors: Map[String, ErrorReport]) extends Serializable with Product

    Permalink

    A serializable snapshot of a DQMState

  10. case class DqmValidationResult(passed: Boolean, dqmStateSnapshot: DqmStateSnapshot, errorMessages: Seq[(String, String)] = Nil, checkLog: Seq[String] = Nil) extends Product with Serializable

    Permalink

    DqmValidator will generate DqmValidationResult, which has

    DqmValidator will generate DqmValidationResult, which has

    passed

    whether the validation passed or not

    errorMessages

    detailed messages for sub results which the passed flag depends on

    checkLog

    useful logs for reporting

  11. case class ErrorReport(total: Int, firstN: Seq[String]) extends Serializable with Product

    Permalink

    Error recorded by RejectLoggers

  12. case class FailCount(threshold: Int) extends DQMTaskPolicy with Product with Serializable

    Permalink

    Tasks with FailCount(n) will fail the DF if the task is triggered >= n times

  13. case class FailParserCountPolicy(threshold: Int) extends DQMPolicy with Product with Serializable

    Permalink

    If the total time of parser fails >= threshold, fail the DF

  14. case class FailPercent(threshold: Double) extends DQMTaskPolicy with Product with Serializable

    Permalink

    Tasks with FailPercent(r) will fail the DF if the task is triggered >= r percent of the total number of records in the DF.

    Tasks with FailPercent(r) will fail the DF if the task is triggered >= r percent of the total number of records in the DF. "r" is between 0.0 and 1.0

  15. case class FailTotalFixCountPolicy(threshold: Int) extends DQMPolicy with Product with Serializable

    Permalink

    For all the fixes in a DQM, if the total time of them be triggered is >= threshold, the DF will Fail

  16. case class FailTotalFixPercentPolicy(threshold: Double) extends DQMPolicy with Product with Serializable

    Permalink

    For all the fixes in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail.

    For all the fixes in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail. The threshold is between 0.0 and 1.0.

  17. case class FailTotalRuleCountPolicy(threshold: Int) extends DQMPolicy with Product with Serializable

    Permalink

    For all the rules in a DQM, if the total time of them be triggered is >= threshold, the DF will Fail

  18. case class FailTotalRulePercentPolicy(threshold: Double) extends DQMPolicy with Product with Serializable

    Permalink

    For all the rules in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail.

    For all the rules in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail. The threshold is between 0.0 and 1.0.

  19. class SmvDQM extends AnyRef

    Permalink

    DQM class for data quality check and fix

    DQM class for data quality check and fix

    Support 2 types of recode level tasks: Rule and Fix. A "rule" is a requirement on a record, if a record can't satisfy a rule, the record will be filtered. A "fix" is a requirement on a field with a default value, so that it can fix a record. DQM also support different "Policies". Policies are requirements on the entire DF level. A policy is a function on (DF, org.tresamigos.smv.dqm.DQMState). By given a df and the DQMState, which are results from the rules and fixes, a policy determine whether the df is passed the DQM or failed.

    Create a DQM:

    val dqm = SmvDQM().
      add(DQMRule($"amt" < 1000000.0, "rule1", FailAny)).
      add(DQMFix($"age" > 100, lit(100) as "age", "fix1")).
      add(DQMFix($"weight" < 5, lit(5) as "weight", "fix2")).
      add(FailTotalFixCountPolicy(20))

    In this example, "amt" field is required to be lower than one million, if any record does not satisfy it, the DF will fail this DQM. The "age" field will be capped to 100, and the "weight" field will be capped on the lower bound to 5. None of the 2 fixes will trigger a DF fail. However, we added a policy which require no more than 20 fixes in the entire DF, otherwise the DF will fail this DQM.

    Attach DQM to a DF:

    val dfWithDqm = dqm.attachTasks(df)

    Check the DQM policies: Since all the rules and fixes are performed when the DF has an action, user need to make sure that there is one and only one action operation happened on the DF. Please note that actions like "count" might be optimized so that transformations which have no impact on "count" might be totally ignored. If there no natural action to be apply, you may need to do convert DF to RDD first

    dfWithDqm.rdd().count

    After the action, we can check the policies

    val result = dqm.validate(dfWithDqm)

    The result is a org.tresamigos.smv.ValidationResult

  20. case class UDPolicy(_policy: (DataFrame, DQMState) ⇒ Boolean, name: String) extends DQMPolicy with Product with Serializable

    Permalink

Value Members

  1. def BoundRule[T](col: Column, lower: T, upper: T)(implicit arg0: Ordering[T]): DQMRule

    Permalink

    BoundRule requires lower <= col < upper

  2. object DQMPolicy

    Permalink
  3. object DQMValidator

    Permalink
  4. object DqmTaskPolicies

    Permalink

    For access by Python modules

  5. object FailAny extends DQMTaskPolicy with Product with Serializable

    Permalink

    Any rule fail or fix with FailAny will cause the entire DF fail

  6. object FailNone extends DQMTaskPolicy with Product with Serializable

    Permalink

    Task with FailNone will not trigger any DF level policy

  7. def FormatFix(col: Column, fmt: String, default: Any): DQMFix

    Permalink

    FormatFix to assign default if col does not match fmt

  8. def FormatRule(col: Column, fmt: String): DQMRule

    Permalink

    FormatRule requires col matches fmt

  9. def SetFix(col: Column, set: Set[Any], default: Any): DQMFix

    Permalink

    SetFix to assign default if col not in set

  10. def SetRule(col: Column, set: Set[Any]): DQMRule

    Permalink

    SetRule requires col in set

  11. object SmvDQM

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped