Skip to main content
Version: 0.14.13

How to configure a DataConnector to introspect and partition a file system or blob store

This guide will help you introspect and partition any file type data store (e.g., filesystem, cloud blob storage) using an Active Data Connector. For background on connecting to different backends, please see the Datasource specific guides in the "Connecting to your data" section.

File-based introspection and partitioning are useful for:

  • Exploring the types, subdirectory location, and filepath naming structures of the files in your dataset, and
  • Organizing the discovered files into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the identified structures.

Partitioning enables you to select the desired subsets of your dataset for Validation.

Prerequisites: This how-to guide assumes you have:

We will use the "Yellow Taxi" dataset to walk you through the configuration of Data Connectors. Starting with the bare-bones version of either an Inferred Asset Data Connector or a Configured Asset Data Connector, we gradually build out the configuration to achieve the introspection of your files with the semantics consistent with your goals.

To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides a standard API for accessing and interacting with data from a wide variety of source systems., and Batch(es)A selection of records from a Data Asset., please see our Datasources Core Concepts Guide in the Core Concepts reference guide.

Preliminary Steps

1. Instantiate your project's DataContext

Import Great Expectations.

import great_expectations as ge

2. Obtain DataContext

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

Configuring Inferred Asset Data Connector and Configured Asset Data Connector

1. Configure your Datasource

Start with an elementary Datasource configuration, containing just one general Inferred Asset Data Connector component:

datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""

Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Given that the glob_directive in the example configuration is *.csv, if you specified a directory containing CSV files, then you will see them listed as Available data_asset_names in the output of test_yaml_config().

Feel free to adjust your configuration and re-run test_yaml_config() to experiment as pertinent to your case.

An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility

context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)

to ensure the correctness of the proposed YAML configuration prior to incorporating it and trying to use it.

For instance, try the following erroneous DataConnector configuration as part of your Datasource (you can paste it directly underneath -- or instead of -- the default_inferred_data_connector_name configuration section):

buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names: # required "data_asset_name" reserved group name for "InferredAssetFilePathDataConnector" is absent
- nonexistent_group_name
"""

Then add in the path to a directory that contains your data, and again run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports only one data_asset_name, called DEFAULT_ASSET_NAME, signaling a misconfiguration.

Now try another erroneous DataConnector configuration as part of your Datasource (you can paste it directly underneath -- or instead of -- your existing DataConnector configuration sections):

another_buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_BAD_DATA_DIRECTORY_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""

where you would add in the path to a directory that does not exist; then run this code again to test your configuration:

context.test_yaml_config(datasource_yaml)

You will see that the list of Data Assets is empty. Feel free to experiment with the arguments to

context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)

For instance, running

context.test_yaml_config(yaml_string, return_mode="report_object")

will return the information appearing in standard output converted to the Python dictionary format.

Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running

context.test_yaml_config(yaml_string, shorten_tracebacks=True)

showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this case InferredAssetFilesystemDataConnector) from being successfully instantiated.

2. Save the Datasource configuration to your DataContext

Once the basic Datasource configuration is error-free and satisfies your requirements, save it into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

3. Get names of available Data Assets

Getting names of available data assets using an Inferred Asset Data Connector affords you the visibility into types and naming structures of files in your filesystem or blob storage:

available_data_asset_names = context.datasources[
"taxi_datasource"
].get_available_data_asset_names(
data_connector_names="default_inferred_data_connector_name"
)[
"default_inferred_data_connector_name"
]
assert len(available_data_asset_names) == 36

To view the full scripts used in this page, see them on GitHub: