How to get a Batch of data from a configured Datasource
This guide will help you load a BatchA selection of records from a Data Asset. for introspection and validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.
What used to be called a “Batch” in the old API was replaced with ValidatorUsed to run an Expectation Suite against data.. A Validator knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.
You can read more about the core classes that make Great Expectations run in our Core Concepts reference guide.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
Steps: Loading a Batch of data
To load a Batch
, the steps you will take are the same regardless of the type of Datasource
or Data Connector
you have set up. To learn more about Datasources
, Data Connectors
and Batch(es)
see our Datasources Core Concepts Guide in the Core Concepts reference guide.
1. Construct a BatchRequest
# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
)
Since a BatchRequest
can return multiple Batch(es)
, you can optionally provide additional parameters to filter the retrieved Batch(es)
. See Datasources Core Concepts Guide for more info on filtering besides batch_filter_parameters
and limit
including custom filter functions and sampling. The example BatchRequest
s below shows several non-exhaustive possibilities.
# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020
data_connector_query_2020 = {
"batch_filter_parameters": {"param_1_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
limit=5,
)
# Here is an example `data_connector_query` filtering based on parameters from `group_names`
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
"batch_filter_parameters": {
"param_1_from_your_data_connector_eg_year": "2020",
"param_2_from_your_data_connector_eg_month": "01",
}
}
# This BatchRequest will use the above filter to retrieve only the batch from Jan 2020
batch_request_202001 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
# Here is an example `data_connector_query` filtering based on an `index` which can be
# any valid python slice. The example here is retrieving the latest batch using `-1`:
data_connector_query_last_index = {
"index": -1,
}
last_index_batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_last_index,
)
You may also wish to list available batches to verify that your BatchRequest
is retrieving the correct Batch(es)
, or to see which are available. You can use context.get_batch_list()
for this purpose, which can take a variety of flexible input types similar to a BatchRequest
. Some examples are shown below:
# List all Batches associated with the DataAsset
batch_list_all_a = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
)
# Alternatively you can use the previously created batch_request to achieve the same thing
batch_list_all_b = context.get_batch_list(batch_request=batch_request)
# You can use a query to filter the batch_list
batch_list_202001_query = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
# Or limit to a specific number of batches
batch_list_all_limit_10 = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
limit=10,
)
2. Get access to your Batch via a Validator
# First create an expectation suite to use with our validator
context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
# Now create our validator
validator = context.get_validator(
batch_request=last_index_batch_request, expectation_suite_name="test_suite"
)
3. Check your data
You can check that the first few lines of the Batch
you loaded into your Validator
are what you expect by running:
print(validator.head())
Now that you have a Validator
, you can use it to create Expectations
or validate the data.
Additional Batch querying and loading examples
We will use the "Yellow Taxi" dataset example from
How to configure a DataConnector to introspect and partition a file system or blob store
to demonstrate the Batch
querying possibilities enabled by the particular data partitioning strategy specified as part
of the Data Connector
configuration.
1. Partition only by file name and type
In this example, the Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. representing a relatively general naming structure of files in a directory, with
each file name having a certain prefix (e.g., yellow_tripdata_sample_
) and whose contents are of the desired type
(e.g., CSV) is taxi_data_flat
in the Data Connector
configured_data_connector_name
:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
To query for Batch
objects, set data_asset_name
to taxi_data_flat
in the following BatchRequest
specification. (Customize for your own data set, as appropriate.)
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
)
Then perform the relevant checks: verify that the expected number of Batch
objects was retrieved and confirm the
size of a Batch
. For example (be sure to adjust this code to match the specifics of your data and configuration):
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 12
assert batch_list[0].data.dataframe.shape[0] == 10000
2. Partition by year and month
Next, use the more detailed partitioning strategy represented by the Data Asset
taxi_data_year_month
in the
Data Connector
configured_data_connector_name
:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
The Data Asset
taxi_data_year_month
in the above example configuration identifies three parts of a file path:
name
(as in "company name"), year
, and month
. This partitioning affords a rich set of filtering capabilities
ranging from specifying the exact values of the file name structure's components to allowing Python functions for
implementing custom criteria.
To perform experiments supported by this configuration, set data_asset_name
to taxi_data_year_month
in the
following BatchRequest
specification (customize for your own data set, as appropriate):
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={"custom_filter_function": "<YOUR_CUSTOM_FILTER_FUNCTION>"},
)
To obtain the data for the nine months of February through October, apply the following custom filter:
batch_request.data_connector_query["custom_filter_function"] = (
lambda batch_identifiers: batch_identifiers["name"] == "yellow"
and 1 < int(batch_identifiers["month"]) < 11
)
Now, perform the relevant checks: verify that the expected number of Batch
objects was retrieved and confirm the
size of a Batch
:
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 9
assert batch_list[0].data.dataframe.shape[0] == 10000
You can then identify a particular Batch
(e.g., corresponding to the year and month of interest) and retrieve it
for data analysis as follows:
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={
"batch_filter_parameters": {
"<YOUR_BATCH_FILTER_PARAMETER_0_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_0_VALUE>",
"<YOUR_BATCH_FILTER_PARAMETER_1_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_1_VALUE>",
"<YOUR_BATCH_FILTER_PARAMETER_2_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_2_VALUE>",
}
},
)
Note that in the present example, there can be up to three BATCH_FILTER_PARAMETER
key-value pairs, because the
regular expression for the data asset taxi_data_year_month
defines three groups: name
, year
, and month
.
batch_request.data_connector_query["batch_filter_parameters"] = {
"year": "2020",
"month": "01",
}
(Be sure to adjust the above code snippets to match the specifics of your data and configuration.)
Now, perform the relevant checks: verify that the expected number of Batch
objects was retrieved and confirm the
size of a Batch
:
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 1
assert batch_list[0].data.dataframe.shape[0] == 10000
Omitting the batch_filter_parameters
key from the data_connector_query
will be interpreted in the least restrictive
(most broad) query, resulting in the largest number of Batch
objects to be returned.
To view the full script used in this page, see it on GitHub: