Skip to main content
Version: 0.14.13

Initialize a Data Context

Prerequisites
  • You need a Python environment where you can install Great Expectations and other dependencies, e.g. a virtual environment.

Set up your machine for the tutorial

For this tutorial, we will use a simplified version of the NYC taxi ride data.

Clone the ge_tutorials repository to download the data and directories with the final versions of the tutorial, which you can use for reference:

git clone https://github.com/superconductive/ge_tutorials
cd ge_tutorials

The repository you cloned contains several directories with final versions for our tutorials. The final version for this tutorial is located in the getting_started_tutorial_final_v3_api/ folder. You can use the final version as a reference or to explore a complete deploy of Great Expectations, but you do not need it for this tutorial.

Install Great Expectations and dependencies

Great Expectations requires Python 3 and can be installed using pip. If you haven’t already, install Great Expectations by running:

pip install great_expectations

You can confirm that installation worked by running

great_expectations --version

This should return something like:

great_expectations, version 0.13.43

For detailed installation instructions, see How to install Great Expectations locally.

Other deployment patterns

This tutorial deploys Great Expectations locally. Note that other options (e.g. running Great Expectations on an EMR Cluster) are also available. You can find more information in the Reference Architectures section of the documentation.

Create a Data Context

In Great Expectations, your Data Context manages your project configuration, so let’s go and create a Data Context for our tutorial project!

When you installed Great Expectations, you also installed the Great Expectations command line interface (CLI). It provides helpful utilities for deploying and configuring Data Contexts, plus a few other convenience methods.

To initialize your Great Expectations deployment for the project, run this command in the terminal from the ge_tutorials/ directory:

great_expectations init

You should see this:

Using v3 (Batch Request) API

___ _ ___ _ _ _
/ __|_ _ ___ __ _| |_ | __|_ ___ __ ___ __| |_ __ _| |_(_)___ _ _ ___
| (_ | '_/ -_) _` | _| | _|\ \ / '_ \/ -_) _| _/ _` | _| / _ \ ' \(_-<
\___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
|_|
~ Always know what to expect from your data ~

Let's create a new Data Context to hold your project configuration.

Great Expectations will create a new directory with the following structure:

great_expectations
|-- great_expectations.yml
|-- expectations
|-- checkpoints
|-- plugins
|-- .gitignore
|-- uncommitted
|-- config_variables.yml
|-- data_docs
|-- validations

OK to proceed? [Y/n]: <press Enter>
About the great_expectations/ directory structure

After running the init command, your great_expectations/ directory will contain all of the important components of a local Great Expectations deployment. This is what the directory structure looks like

  • great_expectations.yml contains the main configuration of your deployment.
  • The expectations/ directory stores all your Expectations as JSON files. If you want to store them somewhere else, you can change that later.
  • The plugins/ directory holds code for any custom plugins you develop as part of your deployment.
  • The uncommitted/ directory contains files that shouldn’t live in version control. It has a .gitignore configured to exclude all its contents from version control. The main contents of the directory are:
    • uncommitted/config_variables.yml, which holds sensitive information, such as database credentials and other secrets.
    • uncommitted/data_docs, which contains Data Docs generated from Expectations, Validation Results, and other metadata.
    • uncommitted/validations, which holds Validation Results generated by Great Expectations.