Data quality tools to check data quality from an Airflow DAG

To Nha Notes | Nov. 19, 2022, 8:25 p.m.

There are multiple open source tools that can be used to check data quality from an Airflow DAG. While this guide lists the most commonly used tools here, it focuses on the two tools that also integrate with OpenLineage:

  • SQL Check operators: A group of operators that you can use to define data quality checks in Python dictionaries and SQL from within your DAGs.
  • Great Expectations: An open source data validation framework where checks are defined in JSON. Airflow offers a provider package including the GreatExpectationsOperator for easy integration.

Other tools that can be used for data quality checks include:

  • Soda: An open source data validation framework that uses YAML to define checks which can be run in Airflow using the BashOperator. Soda also offers the ability to write any custom checks using SQL.
  • dbt test: A testing framework for models using the dbt test CLI command, which you can run in Airflow with the BashOperator or PythonOperator. dbt can emit data to OpenLineage when using dbt-ol, but data quality metric information from the dbt test command is not currently collected, only the results of the test. See the --store-failures flag to collect more information from tests.

Choosing a tool

Which tool you choose is determined by the needs and preferences of your organization. Astronomer recommends using SQL Check operators if you want to:

  • Write checks without needing to set up software in addition to Airflow.
  • Write checks as Python dictionaries and in SQL.
  • Use any SQL statement that returns a single row of booleans as a data quality check.
  • Implement many different downstream dependencies depending on the outcome of different checks.
  • Have full observability of which checks failed from within Airflow task logs, including the full SQL statements of failed checks.

Astronomer recommends using a data validation framework such as Great Expectations or Soda in the following circumstances:

  • You want to collect the results of your data quality checks in a central place.
  • You prefer to write checks in JSON (Great Expectations) or YAML (Soda).
  • Most or all of your checks can be implemented by the predefined checks in the solution of your choice.
  • You want to abstract your data quality checks from the DAG code.

Currently only SQL Check operators and the GreatExpectationsOperator offer data lineage extraction through Openlineage.

SQL Check operators

The following SQL Check operators are recommended for implementing data quality checks:

  • SQLColumnCheckOperator: Runs multiple predefined data quality checks on multiple columns within the same task.
  • SQLTableCheckOperator: Runs multiple user-defined checks on one or more columns of a table.
  • SQLCheckOperator: Takes any SQL query and returns a single row that is evaluated to booleans. This operator is useful for more complicated checks that could span several tables of your database.
  • SQLIntervalCheckOperator: Checks current data against historical data.

Additionally, two older SQL Check operators exist that can run one check at a time against a defined value or threshold:

  • SQLValueCheckOperator: A simpler operator that can be used when a specific, known value is being checked either as an exact value or within a percentage threshold.
  • SQLThresholdCheckOperator: An operator with flexible upper and lower thresholds, where the threshold bounds may also be described as SQL queries that return a numeric value.

Astronomer recommends using the SQLColumnCheckOperator and SQLTableCheckOperator over the SQLValueCheckOperator and SQLThresholdCheckOperator whenever possible to improve code readability.

Requirements

The SQLColumnCheckOperator and the SQLTableCheckOperator are available in the common SQL provider package which can be installed with:

pip install apache-airflow-providers-common-sql

The SQLCheckOperator, SQLIntervalCheckOperator, SQLValueCheckOperator and SQLThresholdCheckOperator are built into core Airflow and do not require a separate package installation. Check snowflake_check_operators.py

Example SQLColumnCheckOperator

The SQLColumnCheckOperator has a column_mapping parameter which stores a dictionary of checks. Using this dictionary, it can run many checks within one task and still provide observability in the Airflow logs over which checks passed and which failed.

In the example below, 5 checks are performed on 3 different columns using the SQLColumnCheckOperator:

  • "MY_DATE_COL" is checked to ensure that it contains only unique dates.
  • "MY_TEXT_COL" is checked to ensure it has at least 10 distinct values and no NULL values.
  • "MY_NUM_COL" is checked to ensure it contains a minimum value of less than 10 and a maximum value of 100 with a 10% tolerance (maximum values between 90 and 110 are accepted).

check_columns = SQLColumnCheckOperator(
        task_id="check_columns",
        conn_id=example_conn,
        table=example_table,
        column_mapping={
            "MY_DATE_COL": {
                "unique_check": {"equal_to": 0}
            },
            "MY_TEXT_COL": {
                "distinct_check": {"geq_to": 10},
                "null_check": {"equal_to": 0}
            },
            "MY_NUM_COL": {
                "min": {"less_than": 10},
                "max": {"equal_to": 100, "tolerance": 0.1}
            },
        }
    )

If the resulting boolean value is True the check passes, otherwise it fails. Airflow generates logs that show the set of returned records for every check that passes and the full query and result for checks that failed.

 

References

https://docs.astronomer.io/learn/data-quality

https://docs.astronomer.io/learn/airflow-sql-data-quality

https://github.com/astronomer/airflow-data-quality-demo

https://github.com/astronomer/airflow-data-quality-demo/blob/main/dags/snowflake_examples/complex_snowflake_transform.py