How to connect to a SQL database
In this guide we will demonstrate how to connect Great Expectations to SQL databases. GX uses SQLAlchemy to connect to SQL data, and therefore supports most SQL dialects that SQLAlchemy does. For more information on the SQL dialects supported by SQLAlchemy, please see SQLAlchemy's official documentation on dialects.
Prerequisites
- An installation of GX set up to work with SQL
- Source data stored in a SQL database
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Determine your connection string
GX supports a variety of different SQL source data systems. However, most SQL dialects have their own specifications for how to define a connection string. You should reference the corresponding dialect's official documentation to determine the connection string for your SQL Database.
The following are examples of connection strings for different SQL dialects:
- AWS Athena:
awsathena+rest://@athena.<REGION>.amazonaws.com/<DATABASE>?s3_staging_dir=<S3_PATH>
- BigQuery:
bigquery://<GCP_PROJECT>/<BIGQUERY_DATASET>?credentials_path=/path/to/your/credentials.json
- MSSQL:
mssql+pyodbc://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?driver=<DRIVER>&charset=utf&autocommit=true
- MySQL:
mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>
- PostGreSQL:
postgresql+psycopg2://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>
- Redshift:
postgresql+psycopg2://<USER_NAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?sslmode=<SSLMODE>
- Snowflake]:
snowflake://<USER_NAME>:<PASSWORD>@<ACCOUNT_NAME>/<DATABASE_NAME>/<SCHEMA_NAME>?warehouse=<WAREHOUSE_NAME>&role=<ROLE_NAME>&application=great_expectations_oss
- SQLite:
sqlite:///<PATH_TO_DB_FILE>
- Trino:
trino://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<CATALOG>/<SCHEMA>
For purposes of this guide's examples, we will connect to a PostGreSQL database. Here is an example of our connection string, stored in the variable connection_string
with plain text credentials:
connection_string = "postgresql+psycopg2://username:my_password@localhost/test"
You can use either environment variables or a key in config_variables.yml
to safely store any passwords needed by your connection string. After defining your password in one of those ways, you can reference it in your connection string like this:
connection_string = (
"postgresql+psycopg2://<username>:${MY_PASSWORD}@<host>:<port>/<database>"
)
In the above example MY_PASSWORD
would be the name of the environment variable or the key to the value in config_variables.yml
that corresponds to your password.
If you include a password as plain text in your connection string when you define your Datasource, GX will automatically strip it out, add it to config_variables.yml
and substitute it in the Datasource's saved configuration with a variable as was shown above.
3. Create a SQL Datasource
Creating a SQL Datasource is as simple as providing the add_sql(...)
method a name
by which to reference it in the future and the connection_string
with which to access it.
datasource = context.sources.add_sql(
name="my_datasource", connection_string=connection_string
)
Next steps
Now that you have connected to a SQL database, next you will want to: