Version: 0.14.13

How to create a new Expectation Suite using Rule Based Profilers

In this tutorial, you will develop hands-on experience with configuring a Rule-Based ProfilerGenerates Metrics and candidate Expectations from data. to create an Expectation SuiteA collection of verifiable assertions about data.. You will ProfileThe act of generating Metrics and candidate Expectations from data. several BatchesA selection of records from a Data Asset. of NYC yellow taxi trip data to come up with reasonable estimates for the ranges of ExpectationsA verifiable assertion about data. for several numeric columns.

danger

Please note that Rule Based Profiler is currently undergoing development and is considered an experimental feature. While the contents of this document accurately reflect the state of the feature, they are susceptible to change.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Have a basic understanding of MetricsA computed attribute of data such as the mean of a column. in Great Expectations.
Have a basic understanding of Expectation Configurations in Great Expectations.
Have read the overview of ProfilersGenerates Metrics and candidate Expectations from data. and the section on Rule-Based Profilers in particular.

Steps

1. Create a new Great Expectations project

Create a new directory, called taxi_profiling_tutorial
Within this directory, create another directory called data
Navigate to the top level of taxi_profiling_tutorial in a terminal and run great_expectations init

2. Download the data

Download this directory of yellow taxi trip csv files from the Great Expectations GitHub repo. You can use a tool like DownGit to do so
Move the unzipped directory of csv files into the data directory that you created in Step 1

3. Set up your Datasource

Follow the steps in the How to connect to data on a filesystem using Pandas. For the purpose of this tutorial, we will work from a yaml to set up your DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. config. When you open up your notebook to create and test and save your Datasource config, replace the config docstring with the following docstring:

example_yaml = f"""
name: taxi_pandas
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  monthly:
    base_directory: ../<YOUR BASE DIR>/
    glob_directive: '*.csv'
    class_name: ConfiguredAssetFilesystemDataConnector
    assets:
      my_reports:
        base_directory: ./
        group_names:
          - name
          - year
          - month
        class_name: Asset
        pattern: (.+)_(\d.*)-(\d.*)\.csv
"""

Test your YAML config to make sure it works - you should see some of the taxi csv filenames listed
Save your Datasource config

4. Configure the Profiler

Now, we'll create a new script in the same top-level taxi_profiling_tutorial directory called profiler_script.py. If you prefer, you could open up a Jupyter Notebook and run this there instead.
At the top of this file, we will create a new YAML docstring assigned to a variable called profiler_config. This will look similar to the YAML docstring we used above when creating our Datasource. Over the next several steps, we will slowly add lines to this docstring by typing or pasting in the lines below:

profiler_config = """

"""

First, we'll add some relevant top level keys (name and config_version) to label our Profiler and associate it with a specific version of the feature.

name: My Profiler
config_version: 1.0

Config Versioning

Note that at the time of writing this document, 1.0 is the only supported config version.

Then, we'll add in a Variables key and some variables that we'll use. Next, we'll add a top level rules key, and then the name of your rule:

variables:
  false_positive_rate: 0.01
  mostly: 1.0

After that, we'll add our Domain Builder. In this case, we'll use a TableDomainBuilder, which will indicate that any expectations we build for this Domain will be at the Table level. Each Rule in our Profiler config can only use one Domain Builder.

domain_builder:
        class_name: TableDomainBuilder

Next, we'll use a NumericMetricRangeMultiBatchParameterBuilder to get an estimate to use for the min_value and max_value of our expect_table_row_count_to_be_between Expectation. This Parameter Builder will take in a Batch RequestProvided to a Datasource in order to create a Batch. consisting of the five Batches prior to our current Batch, and use the row counts of each of those months to get a probable range of row counts that you could use in your ExpectationConfiguration.

parameter_builders:
      - name: row_count_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: table.row_count
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 0
        truncate_values:
          lower_bound: 0

A Rule can have multiple ParameterBuilders if needed, but in our case, we'll only use the one for now.

Finally, you would use an ExpectationConfigurationBuilder to actually build your expect_table_row_count_to_be_between Expectation, where the Domain is the Domain returned by your TableDomainBuilder (your entire table), and the min_value and max_value are Parameters returned by your NumericMetricRangeMultiBatchParameterBuilder.

expectation_configuration_builders:
      - expectation_type: expect_table_row_count_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        min_value: $parameter.row_count_range.value[0]
        max_value: $parameter.row_count_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.row_count_range.details

You can see here that we use a special $ syntax to reference variables and parameters that have been previously defined in our config. You can see a more thorough description of this syntax in the docstring for ParameterContainer here.

When we put it all together, here is what our config with our single row_count_rule looks like:

name: My Profiler
config_version: 1.0

variables:
  false_positive_rate: 0.01
  mostly: 1.0

rules:
  row_count_rule:
    domain_builder:
        class_name: TableDomainBuilder
    parameter_builders:
      - name: row_count_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: table.row_count
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 0
        truncate_values:
          lower_bound: 0
    expectation_configuration_builders:
      - expectation_type: expect_table_row_count_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        min_value: $parameter.row_count_range.value[0]
        max_value: $parameter.row_count_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.row_count_range.details

5. Run the Profiler

Now let's use our config to Profile our data and create a simple Expectation Suite!

First we'll do some basic set-up - set up a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. and parse our YAML

data_context = DataContext()

# Instantiate RuleBasedProfiler
full_profiler_config_dict: dict = yaml.load(profiler_config)

Next, we'll instantiate our Profiler, passing in our config and our Data Context

rule_based_profiler: RuleBasedProfiler = RuleBasedProfiler(
    name=full_profiler_config_dict["name"],
    config_version=full_profiler_config_dict["config_version"],
    rules=full_profiler_config_dict["rules"],
    variables=full_profiler_config_dict["variables"],
    data_context=data_context,
)

Finally, we'll run profile() and save it to a variable.

suite = rule_based_profiler.run(expectation_suite_name="test_suite_name")

Then, we can print our Expectation Suite so we can see how it looks!

{
        "meta": {"great_expectations_version": "0.13.19+58.gf8a650720.dirty"},
        "data_asset_type": None,
        "expectations": [
            {
                "kwargs": {"min_value": 10000, "max_value": 10000, "mostly": 1.0},
                "expectation_type": "expect_table_row_count_to_be_between",
                "meta": {
                    "profiler_details": {
                        "metric_configuration": {
                            "metric_name": "table.row_count",
                            "metric_domain_kwargs": {},
                        }
                    }
                },
            }
        ],
        "expectation_suite_name": "tmp_suite_Profiler_e66f7cbb",
    }

6. Add a Rule for Columns

Let's add one more rule to our Rule-Based Profiler config. This Rule will use the DomainBuilder to populate a list of all of the numeric columns in one Batch of taxi data (in this case, the most recent Batch). It will then use our NumericMetricRangeMultiBatchParameterBuilder looking at the five Batches prior to our most recent Batch to get probable ranges for the min and max values for each of those columns. Finally, it will use those ranges to add two ExpectationConfigurations for each of those columns: expect_column_min_to_be_between and expect_column_max_to_be_between. This rule will go directly below our previous rule.

As before, we will first add the name of our rule, and then specify the DomainBuilder.

column_ranges_rule:
    domain_builder:
      class_name: ColumnDomainBuilder
      include_semantic_types:
        - numeric
      # BatchRequest yielding exactly one batch (March, 2019 trip data)
      batch_request:
        datasource_name: taxi_pandas
        data_connector_name: monthly
        data_asset_name: my_reports
        data_connector_query:
          index: -1

In this case, our DomainBuilder configuration is a bit more complex. First, we are using a SimpleSemanticTypeColumnDomainBuilder. This will take a table, and return a list of all columns that match the semantic_type specified - numeric in our case.

Then, we need to specify a Batch Request that returns exactly one Batch of data (this is our data_connector_query with index equal to -1). This tells us which Batch to use to get the columns from which we will select our numeric columns. Though we might hope that all our Batches of data have the same columns, in actuality, there might be differences between the Batches, and so we explicitly specify the Batch we want to use here.

After this, we specify our ParameterBuilders. This is very similar to the specification in our previous rule, except we will be specifying two NumericMetricRangeMultiBatchParameterBuilders to get a probable range for the min_value and max_value of each of our numeric columns. Thus one ParameterBuilder will take the column.min metric_name, and the other will take the column.max metric_name.

parameter_builders:
      - name: min_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: column.min
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 2
      - name: max_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: column.max
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 2

Finally, we'll put together our Domains and Parameters in our ExpectationConfigurationBuilders:

expectation_configuration_builders:
      - expectation_type: expect_column_min_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        column: $domain.domain_kwargs.column
        min_value: $parameter.min_range.value[0]
        max_value: $parameter.min_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.min_range.details
      - expectation_type: expect_column_max_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        column: $domain.domain_kwargs.column
        min_value: $parameter.max_range.value[0]
        max_value: $parameter.max_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.max_range.details

Putting together our entire config, with both of our Rules, we get:

name: My Profiler
config_version: 1.0

variables:
  false_positive_rate: 0.01
  mostly: 1.0

rules:
  row_count_rule:
    domain_builder:
        class_name: TableDomainBuilder
    parameter_builders:
      - name: row_count_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: table.row_count
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 0
        truncate_values:
          lower_bound: 0
    expectation_configuration_builders:
      - expectation_type: expect_table_row_count_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        min_value: $parameter.row_count_range.value[0]
        max_value: $parameter.row_count_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.row_count_range.details
  column_ranges_rule:
    domain_builder:
      class_name: ColumnDomainBuilder
      include_semantic_types:
        - numeric
      # BatchRequest yielding exactly one batch (March, 2019 trip data)
      batch_request:
        datasource_name: taxi_pandas
        data_connector_name: monthly
        data_asset_name: my_reports
        data_connector_query:
          index: -1
    parameter_builders:
      - name: min_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: column.min
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 2
      - name: max_range
        class_name: NumericMetricRangeMultiBatchParameterBuilder
        batch_request:
            datasource_name: taxi_pandas
            data_connector_name: monthly
            data_asset_name: my_reports
            data_connector_query:
              index: "-6:-1"
        metric_name: column.max
        metric_domain_kwargs: $domain.domain_kwargs
        false_positive_rate: $variables.false_positive_rate
        round_decimals: 2
    expectation_configuration_builders:
      - expectation_type: expect_column_min_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        column: $domain.domain_kwargs.column
        min_value: $parameter.min_range.value[0]
        max_value: $parameter.min_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.min_range.details
      - expectation_type: expect_column_max_to_be_between
        class_name: DefaultExpectationConfigurationBuilder
        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
        column: $domain.domain_kwargs.column
        min_value: $parameter.max_range.value[0]
        max_value: $parameter.max_range.value[1]
        mostly: $variables.mostly
        meta:
          profiler_details: $parameter.max_range.details

And if we re-instantiate our Profiler with our config which now has two rules, and then we re-run the Profiler, we'll have an updated Expectation Suite with a table row count Expectation for our table, and column min and column max Expectations for each of our numeric columns!

🚀Congratulations! You have successfully Profiled multi-batch data using a Rule-Based Profiler. Now you can try adding some new Rules, or running your Profiler on some other data (remember to change the BatchRequest in your config)!🚀

Additional Notes

To view the full script used in this page, see it on GitHub:

multi_batch_rule_based_profiler_example.py

Prerequisites: This how-to guide assumes you have:

Steps​

1. Create a new Great Expectations project​

2. Download the data​

3. Set up your Datasource​

4. Configure the Profiler​

5. Run the Profiler​

6. Add a Rule for Columns​

Additional Notes​