How to work with Great expectations and Snowflake data source

To Nha Notes | Aug. 16, 2022, 4:05 p.m.

Install Great Expectations locally

pip install great_expectations

https://docs.greatexpectations.io/docs/guides/setup/installation/local

Initialize a new Data Context with the CLI

great_expectations init

https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_a_new_data_context_with_the_cli

Connect to a Snowflake database

great_expectations datasource new

Using v3 (Batch Request) API

What data would you like Great Expectations to connect to?
    1. Files on a filesystem (for processing with Pandas or Spark)
    2. Relational database (SQL)
: 2

Which database backend are you using?
    1. MySQL
    2. Postgres
    3. Redshift
    4. Snowflake
    5. BigQuery
    6. other - Do you have a working SQLAlchemy connection string?
: 4
What authentication method would you like to use?
    1. User and Password
    2. Single sign-on (SSO)
    3. Key pair authentication
: 1
Because you requested to create a new Datasource, we'll open a notebook for you now to complete it!

Run python code in the notebook great_expectations/uncommitted/datasource_new.ipynb via Jupyter in VS code or the jupyter installed on local   to complete above data source initialization.

Note: if get SQL syntax error, exclude the suite expect_column_quantile_values_to_be_between on generation via jupyter notebook by using below settings for UserConfigurableProfiler.  

excluded_expectations = [

      # "expect_table_row_count_to_be_between",

      # "expect_table_columns_to_match_ordered_list",

      # "expect_column_values_to_not_be_null",

      # "expect_column_proportion_of_unique_values_to_be_between",

      # "expect_column_values_to_be_in_type_list",

      # "expect_column_values_to_be_in_set",

      # "expect_column_values_to_be_between",

      # "expect_column_min_to_be_between",

      # "expect_column_max_to_be_between",

      # "expect_column_mean_to_be_between",

      # "expect_column_median_to_be_between",

      "expect_column_quantile_values_to_be_between",

      # "expect_column_values_to_be_unique",

      # "expect_compound_columns_to_be_unique"

  ]

Check other expectations if need to exclude more other ones.

Alternatively, we can run below python code via Jupyter shell to do similar thing.

import os

from ruamel import yaml

import great_expectations as ge
from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest

sfAccount = os.environ.get("SNOWFLAKE_ACCOUNT")
sfUser = os.environ.get("SNOWFLAKE_USER")
sfPswd = os.environ.get("SNOWFLAKE_PW")
sfDatabase = os.environ.get("SNOWFLAKE_DATABASE")
sfSchema = os.environ.get("SNOWFLAKE_SCHEMA")
sfWarehouse = os.environ.get("SNOWFLAKE_WAREHOUSE")

CONNECTION_STRING = f"snowflake://{sfUser}:{sfPswd}@{sfAccount}/{sfDatabase}/{sfSchema}?warehouse={sfWarehouse}&application=great_expectations_oss"

context = ge.get_context()

# Put your connection string in this template:
datasource_config = {
    "name": "my_snowflake_datasource",
    "class_name": "Datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": "snowflake://<USER_NAME>:<PASSWORD>@<ACCOUNT_NAME>/<DATABASE_NAME>/<SCHEMA_NAME>?warehouse=<WAREHOUSE_NAME>&role=<ROLE_NAME>&application=great_expectations_oss",
    },
    "data_connectors": {
        "default_runtime_data_connector_name": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"],
        },
        "default_inferred_data_connector_name": {
            "class_name": "InferredAssetSqlDataConnector",
            "include_schema_name": True,
        },
    },
}

# Please note this override is only to provide good UX for docs and tests.
# In normal usage you'd set your path directly in the yaml above.
datasource_config["execution_engine"]["connection_string"] = CONNECTION_STRING

# Test the configuration
context.test_yaml_config(yaml.dump(datasource_config))

# Save the Datasource configuration to your DataContext
context.add_datasource(**datasource_config)

# Check to see that new data source is saved to below configuration file

great_expectations/great_expectations.yml

# Test new Datasource
batch_request = BatchRequest(
    datasource_name="my_snowflake_datasource",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name=f"{sfSchema.lower()}.taxi_data",  # this is the name of the table you want to retrieve
)
context.create_expectation_suite(
    expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="test_suite"
)
print(validator.head())

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/snowflake

Create an Expectation Suite

great_expectations suite new

Using v3 (Batch Request) API

How would you like to create your Expectation Suite?
    1. Manually, without interacting with a sample batch of data (default)
    2. Interactively, with a sample batch of data
    3. Automatically, using a profiler
: 3

A batch of data is required to edit the suite - let's help you to specify it.


Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
    1. public.customers
    2. public.demonstration1
    3. public.order_details
    4. public.orders
    5. public.product_variations
    6. public.products

When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "snowflake_suit_001" here:

  file:///home/tonha/workspace/mp-mwaa/great_expectations/expectations/snowflake_suit_001.json

Validate your data using a Checkpoint

In normal usage, the best way to validate data is with a Checkpoints. Checkpoints bundle Batches of data with corresponding Expectation Suites for validation.

great_expectations --v3-api checkpoint new my_checkpoint

https://legacy.docs.greatexpectations.io/en/stable/guides/tutorials/getting_started_v3_api/validate_your_data.html

 

References

Snowflake python example

How to instantiate a Data Context without a yml file

How to populate credentials from a secrets store

https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_overview

https://qxf2.com/blog/data-validation-great-expectations-real-example/

https://legacy.docs.greatexpectations.io/en/stable/guides/how_to_guides/configuring_datasources/how_to_configure_a_snowflake_datasource.html

https://www.digitalocean.com/community/tutorials/how-to-test-your-data-with-great-expectations

https://techblog.cisco.com/blog/optimize-your-data-pipeline-with-apache-airflow-and-great-expectations

https://github.com/great-expectations/great_expectations/tree/develop/docs_rtd/guides/how_to_guides

https://github.com/great-expectations/airflow-provider-great-expectations/blob/main/great_expectations_provider/example_dags/example_great_expectations_dag.py

https://github.com/datarootsio/tutorial-great-expectations/blob/main/tutorial_great_expectations.ipynb

https://legacy.docs.greatexpectations.io/en/stable/guides/workflows_patterns/deployment_airflow.html