DQMFix will fix a column with a default value
DQMFix will fix a column with a default value
val f = DQMFix($"age" > 100, lit(100) as "age", "age_cap100", FailNone)
If "age" greater than 100, make it 100. Task name "age_cap100", which can be referred in the org.tresamigos.smv.dqm.DQMState This task will not trigger a DF fail
DQMPolicy defines a requirement on an entire DF
DQMRule defines a requirement on the records of a DF
DQMRule defines a requirement on the records of a DF
val r = DQMRule($"a" + $"b" < 100.0, "a_b_sum_lt100", FailPercent(0.01))
Require the sum of "a" and "b" columns less than 100. Rule name "a_b_sum_lt100", which can be referred in the org.tresamigos.smv.dqm.DQMState If 1% or more of the records fail this rule, the entire DF will fail
DQMState keeps tracking of org.tresamigos.smv.dqm.DQMTask behavior on a DF
DQMState keeps tracking of org.tresamigos.smv.dqm.DQMTask behavior on a DF
Since the logs are implemented with Aggregators, it need a SparkContext to construct. A list of DQMRule names and a list of DQMFix names are needed also.
Each DQMTask (DQMRule/DQMFix) need to have a DQMTaskPolicy
Validates data against DQM rules
A serializable snapshot of a DQMState
DqmValidator will generate DqmValidationResult, which has
DqmValidator will generate DqmValidationResult, which has
whether the validation passed or not
detailed messages for sub results which the passed flag depends on
useful logs for reporting
Error recorded by RejectLoggers
Tasks with FailCount(n) will fail the DF if the task is triggered >= n times
If the total time of parser fails >= threshold, fail the DF
Tasks with FailPercent(r) will fail the DF if the task is triggered >= r percent of the total number of records in the DF.
Tasks with FailPercent(r) will fail the DF if the task is triggered >= r percent of the total number of records in the DF. "r" is between 0.0 and 1.0
For all the fixes in a DQM, if the total time of them be triggered is >= threshold, the DF will Fail
For all the fixes in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail.
For all the fixes in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail. The threshold is between 0.0 and 1.0.
For all the rules in a DQM, if the total time of them be triggered is >= threshold, the DF will Fail
For all the rules in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail.
For all the rules in a DQM, if the total time of them be triggered is >= threshold * total Records, the DF will Fail. The threshold is between 0.0 and 1.0.
DQM class for data quality check and fix
DQM class for data quality check and fix
Support 2 types of recode level tasks: Rule and Fix. A "rule" is a requirement on a record, if a record can't satisfy a rule, the record will be filtered. A "fix" is a requirement on a field with a default value, so that it can fix a record. DQM also support different "Policies". Policies are requirements on the entire DF level. A policy is a function on (DF, org.tresamigos.smv.dqm.DQMState). By given a df and the DQMState, which are results from the rules and fixes, a policy determine whether the df is passed the DQM or failed.
Create a DQM:
val dqm = SmvDQM(). add(DQMRule($"amt" < 1000000.0, "rule1", FailAny)). add(DQMFix($"age" > 100, lit(100) as "age", "fix1")). add(DQMFix($"weight" < 5, lit(5) as "weight", "fix2")). add(FailTotalFixCountPolicy(20))
In this example, "amt" field is required to be lower than one million, if any record does not satisfy it, the DF will fail this DQM. The "age" field will be capped to 100, and the "weight" field will be capped on the lower bound to 5. None of the 2 fixes will trigger a DF fail. However, we added a policy which require no more than 20 fixes in the entire DF, otherwise the DF will fail this DQM.
Attach DQM to a DF:
val dfWithDqm = dqm.attachTasks(df)Check the DQM policies: Since all the rules and fixes are performed when the DF has an action, user need to make sure that there is one and only one action operation happened on the DF. Please note that actions like "count" might be optimized so that transformations which have no impact on "count" might be totally ignored. If there no natural action to be apply, you may need to do convert DF to RDD first
dfWithDqm.rdd().count
After the action, we can check the policies
val result = dqm.validate(dfWithDqm)The result is a org.tresamigos.smv.ValidationResult
BoundRule requires lower <= col < upper
For access by Python modules
Any rule fail or fix with FailAny will cause the entire DF fail
Task with FailNone will not trigger any DF level policy
FormatFix to assign default if col does not match fmt
FormatRule requires col matches fmt
SetFix to assign default if col not in set
SetRule requires col in set
DQM (Data Quality Module) providing classes for DF data quality assurance
Main class org.tresamigos.smv.dqm.SmvDQM can be used with the SmvApp/Module Framework or on stand-alone DF. With the SmvApp/Module framework, a
dqmmethod is defined on the org.tresamigos.smv.SmvDataSet level, and can be overridden to define DQM rules, fixes and policies, which then will be automatically checked when the SmvDataSet gets resolved.For working on a stand-alone DF, please refer the SmvDQM class's documentation.