To Nha Notes | Aug. 16, 2022, 4:05 p.m.

pip install great_expectations
https://docs.greatexpectations.io/docs/guides/setup/installation/local
great_expectations init
great_expectations datasource new
Using v3 (Batch Request) API
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 2
Which database backend are you using?
1. MySQL
2. Postgres
3. Redshift
4. Snowflake
5. BigQuery
6. other - Do you have a working SQLAlchemy connection string?
: 4
What authentication method would you like to use?
1. User and Password
2. Single sign-on (SSO)
3. Key pair authentication
: 1
Because you requested to create a new Datasource, we'll open a notebook for you now to complete it!
Run python code in the notebook great_expectations/uncommitted/datasource_new.ipynb via Jupyter in VS code or the jupyter installed on local to complete above data source initialization.
Note: if get SQL syntax error, exclude the suite expect_column_quantile_values_to_be_between on generation via jupyter notebook by using below settings for UserConfigurableProfiler.
excluded_expectations = [
# "expect_table_row_count_to_be_between",
# "expect_table_columns_to_match_ordered_list",
# "expect_column_values_to_not_be_null",
# "expect_column_proportion_of_unique_values_to_be_between",
# "expect_column_values_to_be_in_type_list",
# "expect_column_values_to_be_in_set",
# "expect_column_values_to_be_between",
# "expect_column_min_to_be_between",
# "expect_column_max_to_be_between",
# "expect_column_mean_to_be_between",
# "expect_column_median_to_be_between",
"expect_column_quantile_values_to_be_between",
# "expect_column_values_to_be_unique",
# "expect_compound_columns_to_be_unique"
]
Check other expectations if need to exclude more other ones.
Alternatively, we can run below python code via Jupyter shell to do similar thing.
import os from ruamel import yaml import great_expectations as ge from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest sfAccount = os.environ.get("SNOWFLAKE_ACCOUNT") sfUser = os.environ.get("SNOWFLAKE_USER") sfPswd = os.environ.get("SNOWFLAKE_PW") sfDatabase = os.environ.get("SNOWFLAKE_DATABASE") sfSchema = os.environ.get("SNOWFLAKE_SCHEMA") sfWarehouse = os.environ.get("SNOWFLAKE_WAREHOUSE") CONNECTION_STRING = f"snowflake://{sfUser}:{sfPswd}@{sfAccount}/{sfDatabase}/{sfSchema}?warehouse={sfWarehouse}&application=great_expectations_oss" context = ge.get_context() # Put your connection string in this template: datasource_config = { "name": "my_snowflake_datasource", "class_name": "Datasource", "execution_engine": { "class_name": "SqlAlchemyExecutionEngine", "connection_string": "snowflake://<USER_NAME>:<PASSWORD>@<ACCOUNT_NAME>/<DATABASE_NAME>/<SCHEMA_NAME>?warehouse=<WAREHOUSE_NAME>&role=<ROLE_NAME>&application=great_expectations_oss", }, "data_connectors": { "default_runtime_data_connector_name": { "class_name": "RuntimeDataConnector", "batch_identifiers": ["default_identifier_name"], }, "default_inferred_data_connector_name": { "class_name": "InferredAssetSqlDataConnector", "include_schema_name": True, }, }, } # Please note this override is only to provide good UX for docs and tests. # In normal usage you'd set your path directly in the yaml above. datasource_config["execution_engine"]["connection_string"] = CONNECTION_STRING # Test the configuration context.test_yaml_config(yaml.dump(datasource_config)) # Save the Datasource configuration to your DataContext context.add_datasource(**datasource_config) # Check to see that new data source is saved to below configuration file great_expectations/great_expectations.yml # Test new Datasource batch_request = BatchRequest( datasource_name="my_snowflake_datasource", data_connector_name="default_inferred_data_connector_name", data_asset_name=f"{sfSchema.lower()}.taxi_data", # this is the name of the table you want to retrieve ) context.create_expectation_suite( expectation_suite_name="test_suite", overwrite_existing=True ) validator = context.get_validator( batch_request=batch_request, expectation_suite_name="test_suite" ) print(validator.head())
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/snowflake
great_expectations suite new
Using v3 (Batch Request) API
How would you like to create your Expectation Suite?
1. Manually, without interacting with a sample batch of data (default)
2. Interactively, with a sample batch of data
3. Automatically, using a profiler
: 3
A batch of data is required to edit the suite - let's help you to specify it.
Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
1. public.customers
2. public.demonstration1
3. public.order_details
4. public.orders
5. public.product_variations
6. public.products
When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "snowflake_suit_001" here:
file:///home/tonha/workspace/mp-mwaa/great_expectations/expectations/snowflake_suit_001.json
In normal usage, the best way to validate data is with a Checkpoints. Checkpoints bundle Batches of data with corresponding Expectation Suites for validation.
great_expectations --v3-api checkpoint new my_checkpoint
References
How to instantiate a Data Context without a yml file
How to populate credentials from a secrets store
https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_overview
https://qxf2.com/blog/data-validation-great-expectations-real-example/
https://www.digitalocean.com/community/tutorials/how-to-test-your-data-with-great-expectations
https://github.com/great-expectations/great_expectations/tree/develop/docs_rtd/guides/how_to_guides
https://legacy.docs.greatexpectations.io/en/stable/guides/workflows_patterns/deployment_airflow.html