To Nha Notes | Sept. 20, 2022, 8:39 p.m.
Are any values unknown (NULL)?
Did I get any data at all? Did I get too much or too little?
Is my data within an accepted range? Are my values in range within a given column?
Are any values duplicated?
From our own experience, two of the best tools out there to test your data are dbt tests and Great Expectations (as a more general-purpose tool). Both tools are open source and allow you to discover data quality issues before they end up in the hands of stakeholders. While dbt is not a testing solution per se, their out-of-the-box tests work well if you’re already using the framework to model and transform your data.
To run data quality tests, you need to do two simple things:
Load the transformed data into a temporary staging table/data set.
Run tests to ensure that the data in the staging table falls within the thresholds demanded of production (i.e., you need to answer “yes” to the question: is this what reliable data looks like?).
If a data quality test fails, an alert is sent to the data engineer or analyst responsible for that asset, and the pipeline is not run. This allows data engineers to catch unexpected data quality issues before impacting end users/systems. Data testing can be done before transformation and after each step in the transformation process.