Skip to main content
Version: 0.14.13

How to configure a DataConnector for splitting and sampling a file system or blob store

This guide will help you configure Splitting and Sampling for any files type data store (e.g., filesystem, cloud blob storage) using a Configured Asset Data Connector (the same Splitting and Sampling configuration options can be readily applied to an Inferred Asset Data Connector).

The Splitting and Sampling mechanisms provided by Great Expectations serve as additional tools for Partitioning your data at various levels of granularity:

  • Splitting provides the means of focusing the BatchA selection of records from a Data Asset. data on the values of certain dimensions of the data of interest.
  • Sampling provides a means for reducing the amount of data in the retrieved batch to facilitate data analysis.

This guide will use the Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets. named configured_data_connector_name that is part of the following Datasource configuration as an example:

datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name

configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
"""

Preliminary Steps

1. Instantiate your project's DataContext

Import these necessary packages and modules.

import os

from ruamel import yaml

import great_expectations as ge
from great_expectations.core.batch import BatchRequest

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

2. Configure your Datasource

Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Given that the glob_directive in the example configuration is *.csv, if you specified a directory containing CSV files, then you will see them listed as Available data_asset_names in the output of test_yaml_config().

Feel free to adjust your configuration and re-run test_yaml_config() as needed.

3. Save the Datasource configuration to your DataContext

Save the configuration into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

Splitting and Sampling

To configure Splitting, specify a dimension (i.e., column_name or column_names), the method of Splitting, and parameters to be used by the specified Splitting method.

To configure Sampling, specify the method of Sampling and parameters to be used by the specified Sampling method.

batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={
"batch_filter_parameters": {
"<YOUR_BATCH_FILTER_PARAMETER_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_VALUE>",
}
},
batch_spec_passthrough={
"splitter_method": "<YOUR_SPLITTER_METHOD>",
"splitter_kwargs": {
"<YOUR_SPLITTER_OBJECTIVE_NAME>": "<YOUR_SPLITTER_OBJECTIVE_KEYS>",
"batch_identifiers": {
"<YOUR_SPLITTER_OBJECTIVE_0_KEY>": "<YOUR_SPLITTER_OBJECTIVE_0_VALUE>",
"<YOUR_SPLITTER_OBJECTIVE_1_KEY>": "<YOUR_SPLITTER_OBJECTIVE_1_VALUE>",
"<YOUR_SPLITTER_OBJECTIVE_2_KEY>": "<YOUR_SPLITTER_OBJECTIVE_2_VALUE>",
# ...
},
},
"sampling_method": "<YOUR_SAMPLING_METHOD>",
"sampling_kwargs": {
"<YOUR_SAMPLING_ARGUMENT_0_NAME>": "<YOUR_SAMPLING_ARGUMENT_0_VALUE>",
"<YOUR_SAMPLING_ARGUMENT_1_NAME>": "<YOUR_SAMPLING_ARGUMENT_1_VALUE>",
"<YOUR_SAMPLING_ARGUMENT_2_NAME>": "<YOUR_SAMPLING_ARGUMENT_2_VALUE>",
# ...
},
},
)
info

Currently, the configuration of Splitting and Sampling as part of the YAML configuration is not supported; it must be done using batch_spec_passthrough as illustrated above.

To customize the configuration for the present example, first, specify the data_connector_query to select the Batch at the Partitioning level of granularity.

batch_request.data_connector_query["batch_filter_parameters"] = {
"year": "2020",
"month": "01",
}

Next, specify Splitting and Sampling directives.

For the present example, we can split according to the "passenger_count" column with the focus on two-passenger rides:

batch_request.batch_spec_passthrough["splitter_method"] = "_split_on_column_value"
batch_request.batch_spec_passthrough["splitter_kwargs"] = {
"column_name": "passenger_count",
"batch_identifiers": {"passenger_count": 2},
}

We can then obtain a random 10% of the rows in the batch:

batch_request.batch_spec_passthrough["sampling_method"] = "_sample_using_random"
batch_request.batch_spec_passthrough["sampling_kwargs"] = {"p": 1.0e-1}

Finally, confirm the expected number of batches was retrieved and the reduced size of a batch (due to Sampling):

batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 1
assert batch_list[0].data.dataframe.shape[0] < 200

Additional Notes

Available Splitting methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_whole_table | N/A | identical to original |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_column_value | column_name='col', batch_identifiers={ 'col': value } | rows where value of column_name are equal to value specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_converted_datetime | column_name='col', date_format_string=<'%Y-%m-%d'>, batch_identifiers={ 'col': matching_string } | rows where value of column_name converted to datetime using the given date_format_string are equal to matching string provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_divided_integer | column_name='col', divisor=<int>, batch_identifiers={ 'col': matching_divisor } | rows where value of column_name divided (using integral division) by the given divisor are equal to matching_divisor provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_mod_integer | column_name='col', mod=<int>, batch_identifiers={ 'col': matching_mod_value } | rows where value of column_name divided (using modular division) by the given mod are equal to matching_mod_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_multi_column_values | column_names='<list[col]>', batch_identifiers={ 'col_0': value_0, 'col_1': value_1, 'col_2': value_2, ... } | rows where values of column_names are equal to values corresponding to each column name as specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_hashed_column | column_name='col', hash_digits=<int>, hash_function_name=<'md5'> batch_identifiers={ 'hash_value': value } | rows where value of column_name hashed (using specified has_function_name) and retaining the stated number of hash_digits are equal to hash_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +

Available Sampling methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_random | p=fraction | rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_mod | column_name='col', mod=<int> | take the mod of named column, and only keep rows that match the given value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_a_list | column_name='col', value_list=<list[val]> | match the values in the named column against value_list, and only keep the matches |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_hash | column_name='col', hash_digits=<int>, hash_value=<str>, hash_function_name=<'md5'> | hash the values in the named column (using specified has_function_name), and only keep rows that match the given hash_value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +

To view the full script used in this page, see it on GitHub: