Some of the most common data quality tests

To Nha Notes | Sept. 20, 2022, 8:39 p.m.

  • Null values

Are any values unknown (NULL)?

  • Volume

Did I get any data at all? Did I get too much or too little?

  • Distribution

Is my data within an accepted range? Are my values in range within a given column?

  • Uniqueness

Are any values duplicated?

  • Known invariants

From our own experience, two of the best tools out there to test your data are dbt tests and Great Expectations (as a more general-purpose tool). Both tools are open source and allow you to discover data quality issues before they end up in the hands of stakeholders. While dbt is not a testing solution per se, their out-of-the-box tests work well if you’re already using the framework to model and transform your data.

To run data quality tests, you need to do two simple things:

  • Load the transformed data into a temporary staging table/data set.

  • Run tests to ensure that the data in the staging table falls within the thresholds demanded of production (i.e., you need to answer “yes” to the question: is this what reliable data looks like?).

If a data quality test fails, an alert is sent to the data engineer or analyst responsible for that asset, and the pipeline is not run. This allows data engineers to catch unexpected data quality issues before impacting end users/systems. Data testing can be done before transformation and after each step in the transformation process.

Open sources

Apache Griffin