How to configure a DataConnector for splitting and sampling tables in SQL
This guide will help you configure Splitting
and Sampling
for working with tables in an SQL database using
SimpleSqlalchemyDatasource
, which operates as a proxy to InferredAssetSqlDataConnector
and
ConfiguredAssetSqlDataConnector
.
We will use the tables
section of the SimpleSqlalchemyDatasource
configuration, which exercises the
ConfiguredAssetSqlDataConnector
, to showcase Splitting
and Sampling
(the same Splitting
and Sampling
configuration options can be readily applied to the introspection
section of the SimpleSqlalchemyDatasource
configuration, which exercises the InferredAssetSqlDataConnector
).
The Splitting
and Sampling
mechanisms provided by Great Expectations serve as additional tools for Partitioning
your data at various levels of granularity:
Splitting
provides the means of focusing the BatchA selection of records from a Data Asset. data on the values of certain dimensions of the data of interest.Sampling
provides a means for reducing the amount of data in the retrieved batch to facilitate data analysis.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
- Reviewed How to configure a DataConnector to introspect and partition tables in SQL
This guide will use the tables
section that is part of the following SimpleSqlalchemyDatasource
configuration as an
example:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
introspection: # Each key in the "introspection" section is an InferredAssetSqlDataConnector
whole_table:
include_schema_name: true
introspection_directives:
include_views: true
skip_inapplicable_tables: true # skip and continue upon encountering introspection errors
excluded_tables: # a list of tables to ignore when inferring data asset_names
- main.yellow_tripdata_sample_2019_03 # format: schema_name.table_name
daily:
introspection_directives:
include_views: true
skip_inapplicable_tables: true # skip and continue upon encountering introspection errors
include_schema_name: true
included_tables: # only include tables in this list when inferring data asset_names
- main.yellow_tripdata_sample_2019_01 # format: schema_name.table_name
splitter_method: _split_on_converted_datetime
splitter_kwargs:
column_name: pickup_datetime
date_format_string: "%Y-%m-%d"
hourly:
introspection_directives:
include_views: true
skip_inapplicable_tables: true
include_schema_name: true
included_tables: # only include tables in this list when inferring data asset_names
- main.yellow_tripdata_sample_2019_01 # format: schema_name.table_name
splitter_method: _split_on_converted_datetime
splitter_kwargs:
column_name: pickup_datetime
date_format_string: "%Y-%m-%d %H"
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
# data_asset_name is: concatenate(data_asset_name_prefix, schema_name, table_name, data_asset_name_suffix)
yellow_tripdata_sample_2019_01: # Must match table name exactly.
partitioners: # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table:
include_schema_name: true
schema_name: main
data_asset_name_prefix: taxi__
data_asset_name_suffix: __asset
by_num_riders:
include_schema_name: true
schema_name: main
data_asset_name_prefix: taxi__
data_asset_name_suffix: __asset
splitter_method: _split_on_column_value
splitter_kwargs:
column_name: passenger_count
by_num_riders_random_sample:
include_schema_name: true
schema_name: main
data_asset_name_prefix: taxi__
data_asset_name_suffix: __asset
splitter_method: _split_on_column_value
splitter_kwargs:
column_name: passenger_count
sampling_method: _sample_using_random
sampling_kwargs:
p: 1.0e-1
"""
Preliminary Steps
1. Instantiate your project's DataContext
Import these necessary packages and modules.
import great_expectations as ge
Load your DataContext
into memory using the get_context()
method.
context = ge.get_context()
2. Configure your Datasource
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Feel free to adjust your configuration and re-run test_yaml_config()
as needed.
3. Save the Datasource configuration to your DataContext
Save the configuration into your DataContext
by using the add_datasource()
function.
context.add_datasource(**yaml.load(datasource_yaml))
Splitting and Sampling
To configure Splitting
, specify a dimension (i.e., column_name
or column_names
), the method of Splitting
, and
parameters to be used by the specified Splitting
method. In the present example, the Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets. named
by_num_riders
and by_num_riders_random_sample
split the table yellow_tripdata_sample_2019_01
on the column name
passenger_count
using the _split_on_column_value
method of Splitting
.
To configure Sampling
, specify the method of Sampling
, and parameters to be used by the specified Sampling
method.
In the present example, the Data Connector
named by_num_riders_random_sample
samples the table
yellow_tripdata_sample_2019_01
using the _sample_using_random
method of Sampling
, configured to return 10%
of
the rows sampled at random, which is specified by the parameter p
(stands for "proportion") set to the value 0.1
.
Finally, confirm the expected number of batches was retrieved and the reduced size of a batch (due to Sampling
):
datasource_name="taxi_datasource",
data_connector_name="by_num_riders_random_sample",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
)
(set data_asset_name
to "taxi__yellow_tripdata_sample_2019_01__asset"
for the present example).
assert len(batch_list) == 6 # ride occupancy ranges from 1 passenger to 6 passengers
batch_data = batch_list[1].data # 2-passenger sample of batch data
num_rows = batch_data.execution_engine.engine.execute(
sa.select([sa.func.count()]).select_from(batch_data.selectable)
).scalar()
assert num_rows < 200
Additional Notes
Available Splitting
methods and their configuration parameters:
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_whole_table | N/A | identical to original |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_column_value | table_name='table', column_name='col' | rows where value of column_name are same |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_converted_datetime | table_name='table', column_name='col', date_format_string=<'%Y-%m-%d'> | rows where value of column_name converted to datetime using the given date_format_string are same |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_divided_integer | table_name='table', column_name='col', divisor=<int> | rows where value of column_name divided (using integral division) by the given divisor are same |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_mod_integer | table_name='table', column_name='col', mod=<int> | rows where value of column_name divided (using modular division) by the given mod are same |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_multi_column_values | table_name='table', column_names='<list[col]>' | rows where values of column_names are same |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+
| _split_on_hashed_column | column_name='col', | rows where value of column_name hashed (using "md5" hash function) are same (experimental) |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
Available Sampling
methods and their configuration parameters:
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| _sample_using_limit | n=num_rows | first up to to n (specific limit parameter) rows of batch |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| _sample_using_random | p=fraction | rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| _sample_using_mod | column_name='col', mod=<int> | take the mod of named column, and only keep rows that match the given value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| _sample_using_a_list | column_name='col', value_list=<list[val]> | match the values in the named column against value_list, and only keep the matches |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| _sample_using_hash | column_name='col', hash_digits=<int>, hash_value=<str> | hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
To view the full script used in this page, see it on GitHub: