Skip to main content
Version: 0.16.16

How to connect to in-memory data using Pandas

In this guide we will demonstrate how to connect to an in-memory Pandas DataFrame. Pandas can read many types of data into its DataFrame class, but in our example we will use data originating in a parquet file.

Prerequisites

Steps

1. Import the Great Expectations module and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Create a Datasource

To access our in-memory data, we will create a Pandas Datasource:

datasource = context.sources.add_pandas(name="my_pandas_datasource")

3. Read your source data into a Pandas DataFrame

For this example, we will read a parquet file into a Pandas DataFrame, which we will then use in the rest of this guide.

The code to create the Pandas DataFrame we are using in this guide is defined with:

import pandas as pd

dataframe = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-11.parquet"
)

4. Add a Data Asset to the Datasource

A Pandas DataFrame Data Asset can be defined with two elements:

  • name: The name by which the Datasource will be referenced in the future
  • dataframe: A Pandas DataFrame containing the data

We will use the dataframe from the previous step as the corresponding parameter's value. For the name parameter, we will define a name in advance by storing it in a Python variable:

name = "taxi_dataframe"

Now that we have the name and dataframe for our Data Asset, we can create the Data Asset with the code:

data_asset = datasource.add_dataframe_asset(name=name)

For dataframe Data Assets, the dataframe is always specified as the argument of exactly one API method:

my_batch_request = data_asset.build_batch_request(dataframe=dataframe)

Next steps

Now that you have connected to your data, you may want to look into:

Additional information

External APIs

For more information on Pandas read methods, please reference the official Pandas Input/Output documentation.