Support 2 types of recode level tasks: Rule and Fix. A "rule" is a requirement on a record, if
a record can't satisfy a rule, the record will be filtered. A "fix" is a requirement on a field
with a default value, so that it can fix a record.
DQM also support different "Policies". Policies are requirements on the entire DF level. A policy
is a function on (DF, org.tresamigos.smv.dqm.DQMState). By given a df and the DQMState, which
are results from the rules and fixes, a policy determine whether the df is passed the DQM or failed.
Create a DQM:
val dqm = SmvDQM().
add(DQMRule($"amt" < 1000000.0, "rule1", FailAny)).
add(DQMFix($"age" > 100, lit(100) as "age", "fix1")).
add(DQMFix($"weight" < 5, lit(5) as "weight", "fix2")).
add(FailTotalFixCountPolicy(20))
In this example, "amt" field is required to be lower than one million, if any record does not
satisfy it, the DF will fail this DQM. The "age" field will be capped to 100, and the "weight"
field will be capped on the lower bound to 5.
None of the 2 fixes will trigger a DF fail. However, we added a policy which require no more
than 20 fixes in the entire DF, otherwise the DF will fail this DQM.
Attach DQM to a DF:
val dfWithDqm = dqm.attachTasks(df)
Check the DQM policies:
Since all the rules and fixes are performed when the DF has an action, user need to make sure
that there is one and only one action operation happened on the DF. Please note that actions
like "count" might be optimized so that transformations which have no impact on "count" might be
totally ignored. If there no natural action to be apply, you may need to do convert DF to RDD first
dfWithDqm.rdd().count
After the action, we can check the policies
val result = dqm.validate(dfWithDqm)
The result is a org.tresamigos.smv.ValidationResult
DQM class for data quality check and fix
Support 2 types of recode level tasks: Rule and Fix. A "rule" is a requirement on a record, if a record can't satisfy a rule, the record will be filtered. A "fix" is a requirement on a field with a default value, so that it can fix a record. DQM also support different "Policies". Policies are requirements on the entire DF level. A policy is a function on (DF, org.tresamigos.smv.dqm.DQMState). By given a df and the DQMState, which are results from the rules and fixes, a policy determine whether the df is passed the DQM or failed.
Create a DQM:
In this example, "amt" field is required to be lower than one million, if any record does not satisfy it, the DF will fail this DQM. The "age" field will be capped to 100, and the "weight" field will be capped on the lower bound to 5. None of the 2 fixes will trigger a DF fail. However, we added a policy which require no more than 20 fixes in the entire DF, otherwise the DF will fail this DQM.
Attach DQM to a DF:
val dfWithDqm = dqm.attachTasks(df)Check the DQM policies: Since all the rules and fixes are performed when the DF has an action, user need to make sure that there is one and only one action operation happened on the DF. Please note that actions like "count" might be optimized so that transformations which have no impact on "count" might be totally ignored. If there no natural action to be apply, you may need to do convert DF to RDD first
After the action, we can check the policies
val result = dqm.validate(dfWithDqm)The result is a org.tresamigos.smv.ValidationResult