Skip to main content
Version: 0.14.13

How to configure a RuntimeDataConnector

This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector allows you to specify a BatchA selection of records from a Data Asset. using a Runtime Batch RequestProvided to a Datasource in order to create a Batch., which is used to create a Validator. A ValidatorUsed to run an Expectation Suite against data. is the key object used to create ExpectationsA verifiable assertion about data. and ValidateThe act of applying an Expectation Suite to a Batch. datasets.

Prerequisites: This how-to guide assumes you have:

A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.

Steps

1. Instantiate your project's DataContext

Import these necessary packages and modules:

import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest

2. Set up a Datasource

All of the examples below assume you’re testing configuration using something like:

datasource_yaml = """
name: taxi_datasource
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
<DATACONNECTOR NAME GOES HERE>:
<DATACONNECTOR CONFIGURATION GOES HERE>
"""
context.test_yaml_config(yaml_config=datasource_config)

If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config

3. Add a RuntimeDataConnector to a Datasource configuration

This basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest is configured:

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
"""

Once the RuntimeDataConnector is configured you can add your DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. using:

context.add_datasource(**datasource_config)

Example 1: RuntimeDataConnector for access to file-system data:

At runtime, you would get a Validator from the Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. by first defining a RuntimeBatchRequest with the path to your data defined in runtime_parameters:

batch_request = RuntimeBatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you
runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"}, # Add your path here.
batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},
)

Next, you would pass that request into context.get_validator:

validator = context.get_validator(
batch_request=batch_request,
create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",
)

Example 2: RuntimeDataConnector that uses an in-memory DataFrame

At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the DataFrame passed into batch_data in runtime_parameters:

import pandas as pd
path = "<PATH TO YOUR DATA HERE>"
df = pd.read_csv(path)

batch_request = RuntimeBatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you
runtime_parameters={"batch_data": df}, # Pass your DataFrame here.
batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},
)

Next, you would pass that request into context.get_validator:

batch_request=batch_request,
expectation_suite_name="<MY EXPECTATION SUITE NAME>",
)
print(validator.head())

Additional Notes

To view the full script used in this page, see it on GitHub: