Version: 0.14.13

How to configure a DataConnector to introspect and partition a file system or blob store

This guide will help you introspect and partition any file type data store (e.g., filesystem, cloud blob storage) using an Active Data Connector. For background on connecting to different backends, please see the Datasource specific guides in the "Connecting to your data" section.

File-based introspection and partitioning are useful for:

Exploring the types, subdirectory location, and filepath naming structures of the files in your dataset, and
Organizing the discovered files into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the identified structures.

Partitioning enables you to select the desired subsets of your dataset for Validation.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Configured and loaded a Data Context
Configured a Datasource and Data Connector

We will use the "Yellow Taxi" dataset to walk you through the configuration of Data Connectors. Starting with the bare-bones version of either an Inferred Asset Data Connector or a Configured Asset Data Connector, we gradually build out the configuration to achieve the introspection of your files with the semantics consistent with your goals.

To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides a standard API for accessing and interacting with data from a wide variety of source systems., and Batch(es)A selection of records from a Data Asset., please see our Datasources Core Concepts Guide in the Core Concepts reference guide.

Preliminary Steps

1. Instantiate your project's DataContext

Import Great Expectations.

import great_expectations as ge

2. Obtain DataContext

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

Configuring Inferred Asset Data Connector and Configured Asset Data Connector

Inferred Asset Data Connector
Configured Asset Data Connector

1. Configure your Datasource

Start with an elementary Datasource configuration, containing just one general Inferred Asset Data Connector component:

datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
  module_name: great_expectations.execution_engine
  class_name: PandasExecutionEngine
data_connectors:
    default_inferred_data_connector_name:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
"""

Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Given that the glob_directive in the example configuration is *.csv, if you specified a directory containing CSV files, then you will see them listed as Available data_asset_names in the output of test_yaml_config().

Feel free to adjust your configuration and re-run test_yaml_config() to experiment as pertinent to your case.

An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility

context.test_yaml_config(
    yaml_string, pretty_print: bool = True,
    return_mode: str = "instantiated_class",
    shorten_tracebacks: bool = False,
)

to ensure the correctness of the proposed YAML configuration prior to incorporating it and trying to use it.

For instance, try the following erroneous DataConnector configuration as part of your Datasource (you can paste it directly underneath -- or instead of -- the default_inferred_data_connector_name configuration section):

buggy_data_connector_yaml = f"""
    buggy_inferred_data_connector_name:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:  # required "data_asset_name" reserved group name for "InferredAssetFilePathDataConnector" is absent
            - nonexistent_group_name
"""

Then add in the path to a directory that contains your data, and again run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports only one data_asset_name, called DEFAULT_ASSET_NAME, signaling a misconfiguration.

Now try another erroneous DataConnector configuration as part of your Datasource (you can paste it directly underneath -- or instead of -- your existing DataConnector configuration sections):

another_buggy_data_connector_yaml = f"""
    buggy_inferred_data_connector_name:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: <PATH_TO_BAD_DATA_DIRECTORY_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
"""

where you would add in the path to a directory that does not exist; then run this code again to test your configuration:

context.test_yaml_config(datasource_yaml)

You will see that the list of Data Assets is empty. Feel free to experiment with the arguments to

context.test_yaml_config(
    yaml_string, pretty_print: bool = True,
    return_mode: str = "instantiated_class",
    shorten_tracebacks: bool = False,
)

For instance, running

context.test_yaml_config(yaml_string, return_mode="report_object")

will return the information appearing in standard output converted to the Python dictionary format.

Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running

context.test_yaml_config(yaml_string, shorten_tracebacks=True)

showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this case InferredAssetFilesystemDataConnector) from being successfully instantiated.

2. Save the Datasource configuration to your DataContext

Once the basic Datasource configuration is error-free and satisfies your requirements, save it into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

3. Get names of available Data Assets

Getting names of available data assets using an Inferred Asset Data Connector affords you the visibility into types and naming structures of files in your filesystem or blob storage:

available_data_asset_names = context.datasources[
    "taxi_datasource"
].get_available_data_asset_names(
    data_connector_names="default_inferred_data_connector_name"
)[
    "default_inferred_data_connector_name"
]
assert len(available_data_asset_names) == 36

1. Add Configured Asset Data Connector to your Datasource

Set up the bare-bones Configured Asset Data Connector to gradually apply structure to the discovered assets and partition them according to this structure. To begin, add the following configured_data_connector_name section to your Datasource configuration (please feel free to change the name as you deem appropriate for your use case):

datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
  module_name: great_expectations.execution_engine
  class_name: PandasExecutionEngine
data_connectors:
    configured_data_connector_name:
        class_name: ConfiguredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        assets: {{}}
"""

Now run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

The message Available data_asset_names (0 of 0), corresponding to the configured_data_connector_name Data Connector, should appear in standard output, correctly reflecting the fact that the assets section of the configuration is empty.

2. Add a Data Asset for Configured Asset Data Connector to partition only by file name and type

You can employ a data asset that reflects a relatively general file structure (e.g., taxi_data_flat in the example configuration) to represent files in a directory, which contain a certain prefix (e.g., yellow_tripdata_sample_) and whose contents are of the desired type (e.g., CSV).

configured_data_connector_yaml = f"""
    configured_data_connector_name:
        class_name: ConfiguredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        assets:
          taxi_data_flat:
            base_directory: samples_2020
            pattern: (yellow_tripdata_sample_.+)\\.csv
            group_names:
              - filename
"""

Now run test_yaml_config() as part of evolving and testing components of Great Expectations YAML configuration:

context.test_yaml_config(datasource_yaml)

Verify that exactly one Data Asset is reported for the configured_data_connector_name Data Connector and that the structure of the file names corresponding to the Data Asset identified, taxi_data_flat, is consistent with the regular expressions pattern specified in the configuration for this Data Asset.

3. Add a Data Asset for Configured Asset Data Connector to partition by year and month

In recognition of a finer observed file path structure, you can refine the partitioning strategy. For instance, the taxi_data_year_month in the following example configuration identifies three parts of a file path: name (as in "company name"), year, and month:

configured_data_connector_yaml = f"""
    configured_data_connector_name:
        class_name: ConfiguredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        assets:
          taxi_data_flat:
            base_directory: samples_2020
            pattern: (yellow_tripdata_sample_.+)\\.csv
            group_names:
              - filename
          taxi_data_year_month:
            base_directory: samples_2020
            pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
            group_names:
              - name
              - year
              - month
"""

and run

context.test_yaml_config(datasource_yaml)

Verify that now two Data Assets (taxi_data_flat and taxi_data_year_month) are reported for the configured_data_connector_name Data Connector and that the structures of the file names corresponding to the two Data Assets identified are consistent with the regular expressions patterns specified in the configuration for these Data Assets.

This partitioning affords a rich set of filtering capabilities ranging from specifying the exact values of the file name structure's components to allowing Python functions for implementing custom criteria.

Finally, once your Data Connector configuration satisfies your requirements, save the enclosing Datasource into your DataContext using

context.add_datasource(**yaml.load(datasource_yaml))

Consult the How to get a Batch of data from a configured Datasource guide for examples of considerable flexibility in querying Batch objects along the different dimensions materialized as a result of partitioning the dataset as specified by the taxi_data_flat and taxi_data_year_month Data Assets.

To view the full scripts used in this page, see them on GitHub:

yaml_example_gradual.py

Prerequisites: This how-to guide assumes you have:

Preliminary Steps​

1. Instantiate your project's DataContext​

2. Obtain DataContext​

Configuring Inferred Asset Data Connector and Configured Asset Data Connector​

1. Configure your Datasource​

2. Save the Datasource configuration to your DataContext​

3. Get names of available Data Assets​

1. Add Configured Asset Data Connector to your Datasource​

2. Add a Data Asset for Configured Asset Data Connector to partition only by file name and type​

3. Add a Data Asset for Configured Asset Data Connector to partition by year and month​

Preliminary Steps

1. Instantiate your project's DataContext

2. Obtain DataContext

Configuring Inferred Asset Data Connector and Configured Asset Data Connector

1. Configure your Datasource

2. Save the Datasource configuration to your DataContext

3. Get names of available Data Assets

1. Add Configured Asset Data Connector to your Datasource

2. Add a Data Asset for Configured Asset Data Connector to partition only by file name and type

3. Add a Data Asset for Configured Asset Data Connector to partition by year and month