How to use Great Expectations with Amazon Web Services using Redshift
Great Expectations can work within many frameworks. In this guide you will be shown a workflow for using Great Expectations with AWS and cloud storage. You will configure a local Great Expectations project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You will further configure Great Expectations to access data from a Redshift database.
This guide will demonstrate each of the steps necessary to go from installing a new instance of Great Expectations to Validating your data for the first time and viewing your Validation Results as Data Docs.
Prerequisites
- Python 3. To download and install Python, see Python downloads.
- The AWS CLI. To download and install the AWS CLI, see Installing or updating the latest version of the AWS CLI.
- AWS credentials. See Configuring the AWS CLI.
- Permissions to install the Python packages (
boto3
andgreat_expectations
) with pip. - An S3 bucket and prefix to store Expectations and Validation Results.
Steps
Part 1: Setup
1.1 Ensure that the AWS CLI is ready for use
1.1.1 Verify that the AWS CLI is installed
You can verify that the AWS CLI has been installed by running the command:
aws --version
If this command does not respond by informing you of the version information of the AWS CLI, you may need to install the AWS CLI or otherwise troubleshoot your current installation. For detailed guidance on how to do this, please refer to Amazon's documentation on how to install the AWS CLI)
1.1.2 Verify that your AWS credentials are properly configured
Run the following command in the AWS CLI to verify that your AWS credentials are properly configured:
aws sts get-caller-identity
When your credentials are properly configured, your UserId
, Account
and Arn
are returned. If your credentials are not configured correctly, an error message appears.
If an error message appears, or if you couldn't use the AWS CLI to verify your credentials configuration, see Configuring the AWS CLI.
1.2 Prepare a local installation of Great Expectations
1.2.1 Verify that your Python version meets requirements
First, check the version of Python that you have installed. As of this writing, Great Expectations supports versions 3.7 through 3.10 of Python.
You can check your version of Python by running:
python --version
If this command returns something other than a Python 3 version number (like Python 3.X.X), you may need to try this:
python3 --version
If you do not have Python 3 installed, please refer to python.org for the necessary downloads and guidance to perform the installation.
1.2.2 Create a virtual environment for your Great Expectations project
Once you have confirmed that Python 3 is installed locally, you can create a virtual environment with venv
before installing your packages with pip
.
Python Virtual Environments
Depending on whether you found that you needed to run python
or python3
in the previous step, you will create your virtual environment by running either:
python -m venv my_venv
or
python3 -m venv my_venv
This command will create a new directory called my_venv
where your virtual environment is located. In order to activate the virtual environment run:
source my_venv/bin/activate
You can name your virtual environment anything you like. Simply replace my_venv
in the examples above with the name that you would like to use.
1.2.3 Ensure you have the latest version of pip
Once your virtual environment is activated, you should ensure that you have the latest version of pip installed.
Pip is a tool that is used to easily install Python packages. If you have Python 3 installed you can ensure that you have the latest version of pip by running either:
python -m ensurepip --upgrade
or
python3 -m ensurepip --upgrade
1.2.4 Install boto3
Python interacts with AWS through the boto3
library. Great Expectations makes use of this library in the background when working with AWS. Although you won't use boto3
directly, you'll need to install it in your virtual environment.
Run one of the following pip commands to install boto3
in your virtual environment:
python -m pip install boto3
or
python3 -m pip install boto3
To set up boto3 with AWS, and use boto3
from within Python, see the Boto3 documentation.
1.2.5 Install Great Expectations
You can use pip to install Great Expectations by running the appropriate pip command below:
python -m pip install great_expectations
or
python3 -m pip install great_expectations
1.2.6 Verify that Great Expectations installed successfully
You can confirm that installation worked by running:
great_expectations --version
This should return something like:
great_expectations, version 0.16.16
1.2.7 Install additional dependencies for Redshift
To use connect to your Redshift database, Great Expectations will require the installation of additional dependencies. Fortunately, it is simple to install the necessary dependencies for Redshift by using pip
and running the following from your terminal:
pip install sqlalchemy sqlalchemy-redshift psycopg2
# or if on macOS:
pip install sqlalchemy sqlalchemy-redshift psycopg2-binary
As of this writing, Great Expectations is not compatible with SQLAlchemy version 2 or greater. We recommend using the latest non-version-2 release.
1.3 Create your Data Context
The simplest way to create a new Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. is by using the create()
method.
From a Notebook or script where you want to deploy Great Expectations run the following command. Here the full_path_to_project_directory
can be an empty directory where you intend to build your Great Expectations configuration.:
import great_expectations as gx
context = gx.data_context.FileDataContext.create(full_path_to_project_directory)
1.4 Configure your Expectations Store on Amazon S3
1.4.1 Identify your Data Context Expectations Store
Your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
The following section in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml
file tells Great Expectations to look for Expectations in a Store named expectations_store
:
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
expectations_store_name: expectations_store
The default base_directory
for expectations_store
is expectations/
.
1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3
To manually add an Expectations StoreA connector to store and retrieve information about collections of verifiable assertions about data. to your configuration, add the following configuration to the stores
section of your great_expectations.yml
file:
stores:
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: <your>
prefix: <your>
expectations_store_name: expectations_S3_store
As shown in the previous example, you need to change the default store_backend
settings to make the Store work with S3. The class_name
is set to TupleS3StoreBackend
, bucket
is the address of your S3 bucket, and prefix
is the folder in your S3 bucket where Expectations are located.
The following example shows the additional options that are available to customize TupleS3StoreBackend
:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
In the previous example, the Store name is expectations_S3_store
. If you use a personalized Store name, you must also update the value of the expectations_store_name
key to match the Store name. For example:
expectations_store_name: expectations_S3_store
When you update the expectations_store_name
key value, Great Expectations uses the new Store for Validation Results.
Add the following code to great_expectations.yml
to configure the IAM user:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
aws_session_token: ${AWS_ACCESS_KEY_ID}
Add the following code to great_expectations.yml
to configure the IAM Assume Role:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
assume_role_arn: '<your_role_to_assume>'
region_name: '<your_aws_region_name>'
assume_role_duration: session_duration_in_seconds
If you are also storing Validations in S3 or DataDocs in S3, make sure that the prefix
values are disjoint and one is not a substring of the other.
1.4.3 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Expectations saved that you want to transfer to your S3 bucket.
To copy Expectations into Amazon S3, use the aws s3 sync
command as shown in the following example:
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
The base_directory
is set to expectations/
by default.
In the following example, the Expectations exp1
and exp2
are copied to Amazon S3 and a confirmation message is returned:
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
1.4.4 (Optional) Verify that copied Expectations can be accessed from Amazon S3
If you copied your existing Expectation Suites to the S3 bucket, run the following Python code to confirm that Great Expectations can find them:
import great_expectations as gx
context = gx.get_context()
context.list_expectation_suite_names()
The Expectations you copied to S3 are returned as a list. Expectations that weren't copied to the new Store aren't listed.
1.5 Configure your Validation Results Store on Amazon S3
1.5.1 Identify your Data Context's Validation Results Store
Your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
The following section in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml
file tells Great Expectations to look for Validation Results in a Store named validations_store
. It also creates a ValidationsStore
named validations_store
that is backed by a Filesystem and stores Validation Results under the base_directory
uncommitted/validations
(the default).
stores:
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
validations_store_name: validations_store
1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3
To manually add a Validation Results Store, add the following configuration to the stores
section of your great_expectations.yml
file:
stores:
validations_S3_store:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: <your>
prefix: <your>
As shown in the previous example, you need to change the default store_backend
settings to make the Store work with S3. The class_name
is set to TupleS3StoreBackend
, bucket
is the address of your S3 bucket, and prefix
is the folder in your S3 bucket where Validation Results are located.
The following example shows the additional options that are available to customize TupleS3StoreBackend
:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
In the previous example, the Store name is validations_S3_store
. If you use a personalized Store name, you must also update the value of the validations_store_name
key to match the Store name. For example:
validations_store_name: validations_S3_store
When you update the validations_store_name
key value, Great Expectations uses the new Store for Validation Results.
Add the following code to great_expectations.yml
to configure the IAM user:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
aws_session_token: ${AWS_ACCESS_KEY_ID}
Add the following code to great_expectations.yml
to configure the IAM Assume Role:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
assume_role_arn: '<your_role_to_assume>'
region_name: '<your_aws_region_name>'
assume_role_duration: session_duration_in_seconds
If you are also storing ExpectationsA verifiable assertion about data. in S3 (How to configure an Expectation store to use Amazon S3), or DataDocs in S3 (How to host and share Data Docs on Amazon S3), then make sure the prefix
values are disjoint and one is not a substring of the other.
1.5.3 (Optional) Copy existing Validation results to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Validation Results saved that you want to transfer to your S3 bucket.
To copy Validation Results into Amazon S3, use the aws s3 sync
command as shown in the following example:
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
The base_directory
is set to uncommitted/validations/
by default.
In the following example, the Validation Results Validation1
and Validation2
are copied to Amazon S3 and a confirmation message is returned:
upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json
1.6 Configure Data Docs for hosting and sharing from Amazon S3
1.6.1 Create an Amazon S3 bucket for your Data Docs
You can create an S3 bucket configured for a specific location using the AWS CLI. Make sure you modify the bucket name and region for your situation.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
1.6.2 Configure your bucket policy to enable appropriate access
The example policy below enforces IP-based access - modify the bucket name and IP addresses for your situation. After you have customized the example policy to suit your situation, save it to a file called ip-policy.json
in your local directory.
Your policy should provide access only to appropriate users. Data Docs sites can include critical information about raw data and should generally not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with the arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs' front page.
Amazon Web Service's S3 buckets are a third party utility. For more (and the most up to date) information on configuring AWS S3 bucket policies, please refer to Amazon's guide on using bucket policies.
1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
1.6.4 Add a new Amazon S3 site to the data_docs_sites
section of your great_expectations.yml
The below example shows the default local_site
configuration that you will find in your great_expectations.yml
file, followed by the s3_site
configuration that you will need to add. You may optionally remove the default local_site
configuration completely and replace it with the new s3_site
configuration if you would only like to maintain a single S3 Data Docs site.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
S3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: <your>
site_index_builder:
class_name: DefaultSiteIndexBuilder
1.6.5 Test that your Data Docs configuration is correct by building the site
Use the following command: context.build_data_docs()
to build and open your newly configured S3 Data Docs site.
context.build_data_docs()
Additional notes on hosting Data Docs from an Amazon S3 bucket
Optionally, you may wish to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
> aws s3 website s3://data-docs.my_org/ --index-document index.html
If you wish to host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet in step 4, immediately after thebucket
property.If you wish to host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you will be able to access the pages from your DNS (http://www.mydns.com/index.html
in our example)data_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
Part 2: Connect to data
2.1 Instantiate your project's DataContext
The simplest way to create a new Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. is by using the create()
method.
From a Notebook or script where you want to deploy Great Expectations run the following command. Here the full_path_to_project_directory
can be an empty directory where you intend to build your Great Expectations configuration.:
import great_expectations as gx
context = gx.data_context.FileDataContext.create(full_path_to_project_directory)
If you have already instantiated your DataContext
in a previous step, this step can be skipped.
2.1.1 Determine your connection string
For this guide we will use a connection_string
like this:
redshift+psycopg2://<USER_NAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?sslmode=<SSLMODE>
Note: Depending on your Redshift cluster configuration, you may or may not need the sslmode
parameter. For more details, please refer to Amazon's documentation for configuring security options on Amazon Redshift.
We recommend that database credentials be stored in the config_variables.yml
file, which is located in the uncommitted/
folder by default, and is not part of source control.
For additional options on configuring the config_variables.yml
file or additional environment variables, please see our guide on how to configure credentials.
2.2 Add Datasource to your DataContext
Creating a Redshift Datasource is as simple as providing the add_or_update_sql(...)
method a name
by which to reference it in the future and the connection_string
with which to access it.
datasource_name = "my_redshift_datasource"
connection_string = "redshift+psycopg2://<user_name>:<password>@<host>:<port>/<database>?sslmode=<sslmode>"
With these two values, we can create our Datasource:
datasource = context.sources.add_or_update_sql(
name=datasource_name,
connection_string=connection_string,
)
2.3. Connect to a specific set of data with a Data Asset
Now that our Datasource has been created, we will use it to connect to a specific set of data in the database it is configured for. This is done by defining a Data Asset in the Datasource. A Datasource may contain multiple Data Assets, each of which will serve as the interface between GX and the specific set of data it has been configured for.
With SQL databases, there are two types of Data Assets that can be used. The first is a Table Data Asset, which connects GX to the data contained in a single table in the source database. The other is a Query Data Asset, which connects GX to the data returned by a SQL query. We will demonstrate how to create both of these in the following steps.
Although there is no set maximum number of Data Assets you can define for a datasource, there is a functional minimum. In order for GX to retrieve data from your Datasource you will need to create at least one Data Asset.
We will indicate a table to connect to with a Table Data Asset. This is done by providing the add_table_asset(...)
method a name
by which we will reference the Data Asset in the future and a table_name
to specify the table we wish the Data Asset to connect to.
table_asset = datasource.add_table_asset(name="my_table_asset", table_name="taxi_data")
To indicate the query that provides data to connect to we will define a Query Data Asset. This done by providing the add_query_asset(...)
method a name
by which we will reference the Data Asset in the future and a query
which will provide the data we wish the Data Asset to connect to.
query_asset = datasource.add_query_asset(
name="my_query_asset", query="SELECT * from taxi_data"
)
2.4 Test your new Datasource
Verify your new DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. by loading data from it into a ValidatorUsed to run an Expectation Suite against data. using a Batch RequestProvided to a Datasource in order to create a Batch..
request = table_asset.build_batch_request()
context.add_or_update_expectation_suite(expectation_suite_name="test_suite")
validator = context.get_validator(
batch_request=request, expectation_suite_name="test_suite"
)
print(validator.head())
Part 3: Create Expectations
3.1: Prepare a Batch Request, empty Expectation Suite, and Validator
When we tested our Datasource in step 2.3: Test your new Datasource we also created all of the components we need to begin creating Expectations: A Batch Request to provide sample data we can test our new Expectations against, an empty Expectation Suite to contain our new Expectations, and a Validator to create those Expectations with.
We can reuse those components now. Alternatively, you may follow the same process that we did before and define a new Batch Request, Expectation Suite, and Validator if you wish to use a different Batch of data as the reference sample when you are creating Expectations or if you wish to use a different name than test_suite
for your Expectation Suite.
3.2: Use a Validator to add Expectations to the Expectation Suite
There are many Expectations available for you to use. To demonstrate creating an Expectation through the use of the Validator we defined earlier, here are examples of the process for two of them:
validator.expect_column_values_to_not_be_null(column="passenger_count")
validator.expect_column_values_to_be_between(
column="congestion_surcharge", min_value=0, max_value=1000
)
Each time you evaluate an Expectation (e.g. via validator.expect_*
) two things will happen. First, the Expectation will immediately be Validated against your provided Batch of data. This instant feedback helps to zero in on unexpected data very quickly, taking a lot of the guesswork out of data exploration. Second, the Expectation configuration will be stored in the Expectation Suite you provided when the Validator was initialized.
You can also create Expectation Suites using a Data Assistant to automatically create expectations based on your data or manually using domain knowledge and without inspecting data directly.
To find out more about the available Expectations, please see our Expectations Gallery.
3.3: Save the Expectation Suite
When you have run all of the Expectations you want for this dataset, you can call validator.save_expectation_suite()
to save the Expectation Suite (all of the unique Expectation Configurations from each run of validator.expect_*
)for later use in a Checkpoint.
validator.save_expectation_suite(discard_failed_expectations=False)
Part 4: Validate Data
4.1: Create and run a Checkpoint
Here we will create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for our Batch, which we can use to validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it.
Checkpoints are a robust resource that can be preconfigured with a Batch Request and Expectation Suite or take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when the Checkpoint is run.
This guide will demonstrate using a SimpleCheckpoint
that takes in a Batch Request and Expectation Suite as parameters for the context.run_checkpoint(...)
command.
For more information on pre-configuring a Checkpoint with a Batch Request and Expectation Suite, please see our guides on Checkpoints.
4.1.1 Create a Checkpoint
We create the Checkpoint using a SimpleCheckpoint:
checkpoint = gx.checkpoint.SimpleCheckpoint(
name="my_checkpoint",
data_context=context,
validations=[{"batch_request": request, "expectation_suite_name": "test_suite"}],
)
We have named the checkpoint my_checkpoint
, and added one Validation, using the BatchRequest
we created earlier, and our ExpectationSuite
containing 2 Expectations, test_suite
.
4.1.2 Run the Checkpoint
Finally, having added our Checkpoint to our Data Context, we will run the Checkpoint:
checkpoint_result = checkpoint.run()
4.2: Build and view Data Docs
Since we used a SimpleCheckpoint
, our Checkpoint already contained an UpdateDataDocsAction
which rendered our Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. from the Validation Results we just generated. That means our Data Docs store will contain a new entry for the rendered Validation Result.
For more information on Actions that Checkpoints can perform and how to add them, please see our guides on Actions.
Viewing this new entry is as simple as running:
context.open_data_docs()