Passing Data Between Airflow Tasks

To Nha Notes | March 10, 2022, 5 p.m.

Sharing data between tasks is a very common use case in Airflow. If you’ve been writing DAGs, you probably know that breaking them up into appropriately small tasks is best practice for debugging and recovering quickly from failures. But, maybe one of your downstream tasks requires metadata about an upstream task, or processes the results of the task immediately before it.

There are a few methods you can use to implement data sharing between your Airflow tasks. In this guide, we’ll walk through the two most commonly used methods, discuss when to use each, and show some example DAGs to demonstrate the implementation.

Knowing the size of the data you are passing between Airflow tasks is important when deciding which implementation method to use. As we’ll describe in detail below, XComs are one method of passing data between task, but they are only appropriate for small amounts of data. Large data sets will require a method making use of intermediate storage and possibly utilizing an external processing framework.

XComs are a relative of Variables, with the main difference being that XComs are per-task-instance and designed for communication within a DAG run, while Variables are global and designed for overall configuration and value sharing.

Intermediary Data Storage

As mentioned above, XCom can be a great option for sharing data between tasks because it doesn’t rely on any tools external to Airflow itself. However, it is only designed to be used for very small amounts of data. What if the data you need to pass is a little bit larger, for example a small dataframe?

The best way to manage this use case is to use intermediary data storage. This means saving your data to some system external to Airflow at the end of one task, then reading it in from that system in the next task. This is commonly done using cloud file storage such as S3, GCS, Azure Blob Storage, etc., but it could also be done by loading the data in either a temporary or persistent table in a database.

We’ll note here that while this is a great way to pass data that is too large to be managed with XCom, you should still exercise caution. Airflow is meant to be an orchestrator, not an execution framework. If your data is very large, it is probably a good idea to complete any processing using a framework like Spark or compute-optimized data warehouses like Snowflake or dbt.

References:

https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html

https://www.astronomer.io/guides/airflow-passing-data-between-tasks/

https://www.astronomer.io/guides/custom-xcom-backends/

https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-xcom-backends

https://www.astronomer.io/guides/airflow-great-expectations/