Class/Object

org.tresamigos.smv.dqm

SmvDQM

Related Docs: object SmvDQM | package dqm

Permalink

class SmvDQM extends AnyRef

DQM class for data quality check and fix

Support 2 types of recode level tasks: Rule and Fix. A "rule" is a requirement on a record, if a record can't satisfy a rule, the record will be filtered. A "fix" is a requirement on a field with a default value, so that it can fix a record. DQM also support different "Policies". Policies are requirements on the entire DF level. A policy is a function on (DF, org.tresamigos.smv.dqm.DQMState). By given a df and the DQMState, which are results from the rules and fixes, a policy determine whether the df is passed the DQM or failed.

Create a DQM:

val dqm = SmvDQM().
  add(DQMRule($"amt" < 1000000.0, "rule1", FailAny)).
  add(DQMFix($"age" > 100, lit(100) as "age", "fix1")).
  add(DQMFix($"weight" < 5, lit(5) as "weight", "fix2")).
  add(FailTotalFixCountPolicy(20))

In this example, "amt" field is required to be lower than one million, if any record does not satisfy it, the DF will fail this DQM. The "age" field will be capped to 100, and the "weight" field will be capped on the lower bound to 5. None of the 2 fixes will trigger a DF fail. However, we added a policy which require no more than 20 fixes in the entire DF, otherwise the DF will fail this DQM.

Attach DQM to a DF:

val dfWithDqm = dqm.attachTasks(df)

Check the DQM policies: Since all the rules and fixes are performed when the DF has an action, user need to make sure that there is one and only one action operation happened on the DF. Please note that actions like "count" might be optimized so that transformations which have no impact on "count" might be totally ignored. If there no natural action to be apply, you may need to do convert DF to RDD first

dfWithDqm.rdd().count

After the action, we can check the policies

val result = dqm.validate(dfWithDqm)

The result is a org.tresamigos.smv.ValidationResult

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SmvDQM
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SmvDQM(rules: Seq[DQMRule] = Nil, fixes: Seq[DQMFix] = Nil, policies: Seq[DQMPolicy] = Nil, needAction: Boolean = false)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def add(policy: DQMPolicy): SmvDQM

    Permalink
  5. def add(fix: DQMFix): SmvDQM

    Permalink
  6. def add(rule: DQMRule): SmvDQM

    Permalink
  7. def addAction(): SmvDQM

    Permalink
  8. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  9. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  12. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  14. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  15. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  16. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  17. val needAction: Boolean

    Permalink
  18. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  19. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  21. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  22. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped