Version: 0.14.13

How to configure a DataConnector to introspect and partition tables in SQL

This guide will help you introspect and partition tables in an SQL database using SimpleSqlalchemyDatasource, which operates as a proxy to InferredAssetSqlDataConnector and ConfiguredAssetSqlDataConnector. For background, please see the Datasource specific guides in the Connecting to your data section of our documentation.

The SQL database introspection and partitioning are useful for:

Exploring the schema and column metadata of the tables in your SQL database, and
Organizing the tables into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the partitioning considerations informed by this exploration.

Partitioning enables you to select the desired subsets of your dataset for Validation.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Configured and loaded a Data Context
Configured a Datasource and Data Connector

We will use the "Yellow Taxi" dataset to walk you through the configuration of SimpleSqlalchemyDatasource, where the introspection section characterizes InferredAssetSqlDataConnector objects and the tables section characterizes ConfiguredAssetSqlDataConnector objects. Starting with the bare-bones version of either the introspection section or the tables section of the SimpleSqlalchemyDatasource configuration, we gradually build out the configuration to achieve the introspection of your SQL database with the semantics consistent with your goals.

info

Only introspection and tables are the legal top-level keys in the SimpleSqlalchemyDatasource configuration.

To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets., and Batch(es)A selection of records from a Data Asset., please see our Datasources Core Concepts Guide in the Core Concepts reference guide.

Preliminary Steps

1. Instantiate your project's DataContext

Import Great Expectations.

import great_expectations as ge

2. Obtain DataContext

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

Configuring Introspection and Tables

Introspection (InferredAssetSqlDataConnector)
Tables (ConfiguredAssetSqlDataConnector)

1. Configure your SimpleSqlalchemyDatasource for introspection

Start with an elementary SimpleSqlalchemyDatasource configuration, containing just a basic introspection component:

datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>

introspection:  # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
    whole_table: {{}}  # Any alphanumeric key name is acceptable.
"""

Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports the presence of exactly one InferredAssetSqlDataConnector (called whole_table, as per the configuration).

An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility

context.test_yaml_config(
    yaml_string, pretty_print: bool = True,
    return_mode: str = "instantiated_class",
    shorten_tracebacks: bool = False,
)

to ensure the correctness of the proposed YAML configuration prior to incorporating it and trying to use it.

For instance, try the following erroneous SimpleSqlalchemyDatasource configuration (it contains an illegal top-level key):

datasource_yaml = f"""  # buggy datasource_yaml configuration
name: mis_configured_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>

introspecting:  # illegal top-level key name
    whole_table: {{}}
"""

Then specify the connection string for your database, and again run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports an empty Data Connectors list, signaling a misconfiguration.

Feel free to experiment with the arguments to

context.test_yaml_config(
    yaml_string, pretty_print: bool = True,
    return_mode: str = "instantiated_class",
    shorten_tracebacks: bool = False,
)

For instance, running

context.test_yaml_config(yaml_string, return_mode="report_object")

will return the information appearing in standard output converted to the Python dictionary format.

Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running

context.test_yaml_config(yaml_string, shorten_tracebacks=True)

showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this case InferredAssetSqlDataConnector) from being successfully instantiated.

2. Customize the introspection configuration to fit your needs

SimpleSqlalchemyDatasource supports a number of configuration options to assist you with the introspection of your SQL database:

the database views will included in the list of identified Data References (by setting the include_views flag to true)
if any exceptions occur during the introspection operation, then the process will continue (by setting the skip_inapplicable_tables flag to true)
specifying excluded_tables will have the effect of excluding only the tables on this list, while including the rest
specifying included_tables will have the effect of including only the tables on this list, while excluding the rest

The following YAML configurqation example utilizes several of these configuration directives:

datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>

introspection:  # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
    whole_table:
        include_schema_name: true
        introspection_directives:
            include_views: true
        skip_inapplicable_tables: true  # skip and continue upon encountering introspection errors
        excluded_tables:  # a list of tables to ignore when inferring data asset_names
            - main.yellow_tripdata_sample_2019_03  # format: schema_name.table_name
"""

3. Save the Datasource configuration to your DataContext

Once the SimpleSqlalchemyDatasource configuration is error-free and satisfies your requirements, save it into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

4. Get names of available Data Assets

Getting names of available data assets using an InferredAssetSqlDataConnector affords you the visibility into types and naming structures of tables in your SQL database:

available_data_asset_names = context.datasources[
    "taxi_datasource"
].get_available_data_asset_names(data_connector_names="whole_table")["whole_table"]

1. Configure your SimpleSqlalchemyDatasource to characterize tables

Start with an elementary SimpleSqlalchemyDatasource configuration, containing just a basic tables component:

datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>

tables:  # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
    yellow_tripdata_sample_2019_01:  # Must match table name exactly.
        partitioners:  # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
            whole_table: {{}}
"""

Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports the presence of exactly one ConfiguredAssetSqlDataConnector (called whole_table, as per the configuration) and that Available data_asset_names (1 of 1), the name of the single Data Asset being yellow_tripdata_sample_2019_01.

2. Enhance your SimpleSqlalchemyDatasource with ability to attribute metadata to tables

Add Data Asset Name identification attributes (data_asset_name_prefix and data_asset_name_suffix) and set the include_schema_name flag in your ConfiguredAssetSqlDataConnector (named whole_table) configuration section. These directives will result in the reported properties of your table to contain annotations, customized for your purposes:

datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>

tables:  # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
    yellow_tripdata_sample_2019_01:  # Must match table name exactly.
        partitioners:  # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
            whole_table:
                include_schema_name: true
                schema_name: main
                data_asset_name_prefix: taxi__
                data_asset_name_suffix: __asset
"""

Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Finally, once your Data Connector configuration satisfies your requirements, save the enclosing Datasource into your DataContext using

context.add_datasource(**yaml.load(datasource_yaml))

To view the full script used in this page, see it on GitHub:

yaml_example_gradual.py

Prerequisites: This how-to guide assumes you have:

Preliminary Steps​

1. Instantiate your project's DataContext​

2. Obtain DataContext​

Configuring Introspection and Tables​

1. Configure your SimpleSqlalchemyDatasource for introspection​

2. Customize the introspection configuration to fit your needs​

3. Save the Datasource configuration to your DataContext​

4. Get names of available Data Assets​

1. Configure your SimpleSqlalchemyDatasource to characterize tables​

2. Enhance your SimpleSqlalchemyDatasource with ability to attribute metadata to tables​

Preliminary Steps

1. Instantiate your project's DataContext

2. Obtain DataContext

Configuring Introspection and Tables

1. Configure your SimpleSqlalchemyDatasource for introspection

2. Customize the introspection configuration to fit your needs

3. Save the Datasource configuration to your DataContext

4. Get names of available Data Assets

1. Configure your SimpleSqlalchemyDatasource to characterize tables

2. Enhance your SimpleSqlalchemyDatasource with ability to attribute metadata to tables