How to configure an Expectation Store to use Amazon S3
By default, newly ProfiledThe act of generating Metrics and candidate Expectations from data. ExpectationsA verifiable assertion about data. are stored as Expectation SuitesA collection of verifiable assertions about data. in JSON format in the expectations/
subdirectory of your great_expectations/
folder. This guide will help you configure Great Expectations to store them in an Amazon S3 bucket.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured a Data Context.
- Configured an Expectations Suite.
- The ability to install boto3 in your local environment.
- Identified the S3 bucket and prefix where Expectations will be stored.
Steps
1. Install boto3 with pip
Python interacts with AWS through the boto3
library. Great Expectations makes use of this library in the background when working with AWS. Therefore, although you will not need to use boto3
directly, you will need to have it installed into your virtual environment.
You can do this with the pip command:
python -m pip install boto3
or
python3 -m pip install boto3
For more detailed instructions on how to set up boto3 with AWS, and information on how you can use boto3
from within Python, please reference boto3's documentation site.
2. Verify your AWS credentials are properly configured
If you have installed the AWS CLI, you can verify that your AWS credentials are properly configured by running the command:
aws sts get-caller-identity
If your credentials are properly configured, this will output your UserId
, Account
and Arn
. If your credentials are not configured correctly, this will throw an error.
If an error is thrown, or if you were unable to use the AWS CLI to verify your credentials configuration, you can find additional guidance on configuring your AWS credentials by referencing Amazon's documentation on configuring the AWS CLI.
2. Identify your Data Context Expectations Store
You can find your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data.'s configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
In your great_expectations.yml
file, look for the following lines:
expectations_store_name: expectations_store
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
This configuration tells Great Expectations to look for Expectations in a store called expectations_store
. The base_directory
for expectations_store
is set to expectations/
by default.
3. Update your configuration file to include a new Store for Expectations on S3
You can manually add an Expectations StoreA connector to store and retrieve information about collections of verifiable assertions about data. by adding the configuration shown below into the stores
section of your great_expectations.yml
file.
stores:
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
To make the store work with S3 you will need to make some changes to default the store_backend
settings, as has been done in the above example. The class_name
should be set to TupleS3StoreBackend
, bucket
will be set to the address of your S3 bucket, and prefix
will be set to the folder in your S3 bucket where Expectation files will be located.
Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
For the above example, please also note that the new Store's name is set to expectations_S3_store
. This value can be any name you like as long as you also update the value of the expectations_store_name
key to match the new Store's name.
expectations_store_name: expectations_S3_store
This update to the value of the expectations_store_name
key will tell Great Expectations to use the new Store for Expectations.
If you are also storing Validations in S3 or DataDocs in S3, please ensure that the prefix
values are disjoint and one is not a substring of the other.
5. Confirm that the new Expectations Store has been added
You can verify that your Stores are properly configured by running the command:
great_expectations store list
This will list the currently configured Stores that Great Expectations has access to. If you added a new S3 Expectations Store, the output should include the following ExpectationsStore
entry:
- name: expectations_S3_store
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
Notice the output contains only one Expectation Store: your configuration contains the original expectations_store
on the local filesystem and the expectations_S3_store
we just configured, but the great_expectations store list
command only lists your active stores. For your Expecation Store, this is the one that you set as the value of the expectations_store_name
variable in the configuration file: expectations_S3_store
.
4. Copy existing Expectation JSON files to the S3 bucket (This step is optional)
If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Expectations saved that you wish to keep and transfer to your S3 bucket.
One way to copy Expectations into Amazon S3 is by using the aws s3 sync
command. As mentioned earlier, the base_directory
is set to expectations/
by default.
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
In the example below, two Expectations, exp1
and exp2
are copied to Amazon S3. This results in the following output:
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
If you have Expectations to copy into S3, your output should look similar.
6. Confirm that Expectations can be accessed from Amazon S3 by running great_expectations suite list
If you followed the optional step to copy your existing Expectations to the S3 bucket, you can confirm that Great Expectations can find them by running the command:
great_expectations suite list
Your output should include the Expectations you copied to Amazon S3. In the example, these Expectations were stored in Expectation Suites named exp1
and exp2
. This would result in the following output from the above command:
2 Expectation Suites found:
- exp1
- exp2
Your output should look similar, with the names of your Expectation Suites replacing the names from the example.
If you did not copy Expectations to the new Store, you will see a message saying no Expectations were found.