Skip to main content
Version: 0.16.9

How to create and edit Expectations with the User Configurable Profiler

This guide will help you create a new Expectation SuiteA collection of verifiable assertions about data. by profiling your data with the User Configurable ProfilerGenerates Metrics and candidate Expectations from data..

Prerequisites: This how-to guide assumes you have:
note

The User Configurable Profiler makes it easier to produce a new Expectation Suite by building out a bunch of ExpectationsA verifiable assertion about data. for your data.

These Expectations are deliberately over-fitted on your data e.g. if your table has 10,000 rows, the Profiler will produce an Expectation with the following config:

{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"min_value": 10000,
"max_value": 10000
},
"meta": {}
}

Thus, the intention is for this Expectation Suite to be edited and updated to better suit your specific use case - it is not specifically intended to be used as is.

Steps

1. Load or create your Data Context

In this guide we will use an on-disk data context with a pandas DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. and a csv data asset. If you don't already have one you can create one:

import great_expectations as gx

context = gx.data_context.FileDataContext.create(full_path_to_project_directory)

# data_directory is the full path to a directory containing csv files
datasource = context.sources.add_pandas_filesystem(
name="my_pandas_datasource", base_directory=data_directory
)

# The batching_regex should max files in the data_directory
asset = datasource.add_csv_asset(
name="csv_asset",
batching_regex=r"yellow_tripdata_sample_(?P\d{4})-(?P\d{2}).csv",
order_by=["year", "month"],
)

If a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. and data asset already exist, you can load an on-disk Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. via:

import great_expectations as gx
import pathlib

context = gx.get_context(
context_root_dir=(
pathlib.Path(full_path_to_project_directory) / "great_expectations"
)
)
asset = context.datasources["my_pandas_datasource"].get_asset("csv_asset")

2. Set your expectation_suite_name and create your Batch Request

The Batch RequestProvided to a Datasource in order to create a Batch. specifies which

BatchA selection of records from a Data Asset. of data you would like toProfileThe act of generating Metrics and candidate Expectations from data. in order to create your Expectation Suite. We will pass it into a ValidatorUsed to run an Expectation Suite against data. in the next step.
expectation_suite_name = "insert_the_name_of_your_suite_here"
expectation_suite = context.add_expectation_suite(
expectation_suite_name=expectation_suite_name
)
batch_request = asset.build_batch_request({"year": "2019", "month": "02"})

3. Instantiate your Validator

We use a Validator to access and interact with your data. We will be passing the Validator to our Profiler in the next step.

validator = context.get_validator(
batch_request=batch_request, expectation_suite_name=expectation_suite_name
)

After you get your Validator, you can call validator.head() to confirm that it contains the data that you expect.

4. Instantiate a UserConfigurableProfiler

Next, we instantiate a UserConfigurableProfiler, passing in the Validator with our data:

from great_expectations.profile.user_configurable_profiler import (
UserConfigurableProfiler,
)

profiler = UserConfigurableProfiler(profile_dataset=validator)

5. Use the profiler to build a suite

Once we have our Profiler set up with our Batch, we call profiler.build_suite(). This will print a list of all the Expectations created by column, and return the Expectation Suite object.

suite = profiler.build_suite()

6. (Optional) Running validation, saving your suite, and building Data Docs

If you'd like, you can ValidateThe act of applying an Expectation Suite to a Batch. your data with the new Expectation Suite, save your Expectation Suite, and build Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. to take a closer look at the output:

from great_expectations.checkpoint.checkpoint import SimpleCheckpoint

# Review and save our Expectation Suite
print(validator.get_expectation_suite(discard_failed_expectations=False))
validator.save_expectation_suite(discard_failed_expectations=False)

# Set up and run a Simple Checkpoint for ad hoc validation of our data
checkpoint_config = {
"class_name": "SimpleCheckpoint",
"validations": [
{
"batch_request": batch_request,
"expectation_suite_name": expectation_suite_name,
}
],
}
checkpoint = SimpleCheckpoint(
f"{validator.active_batch_definition.data_asset_name}_{expectation_suite_name}",
context,
**checkpoint_config,
)
checkpoint_result = checkpoint.run()

# Build Data Docs
context.build_data_docs()

# Get the only validation_result_identifier from our SimpleCheckpoint run, and open Data Docs to that page
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

And you're all set!

Optional Parameters

The UserConfigurableProfiler can take a few different parameters to further hone the results. These parameters are:

  • excluded_expectations: List[str] - Specifies Expectation types which you want to exclude from the Expectation Suite
  • ignored_columns: List[str] - Columns for which you do not want to build Expectations (i.e. if you have metadata columns which might not be the same between tables
  • not_null_only: Bool - By default, each column is evaluated for nullity. If the column values contain fewer than 50% null values, then the Profiler will add expect_column_values_to_not_be_null; if greater than 50% it will add expect_column_values_to_be_null. If not_null_only is set to True, the Profiler will add a not_null Expectation irrespective of the percent nullity (and therefore will not add an expect_column_values_to_be_null)
  • primary_or_compound_key: List[str] - This allows you to specify one or more columns in list form as a primary or compound key, and will add expect_column_values_to_be_unique or expect_compound_column_values_to_be_unique
  • table_expectations_only: Bool - If True, this will only create table-level Expectations (i.e. ignoring all columns). Table-level Expectations include expect_table_row_count_to_equal and expect_table_columns_to_match_ordered_list
  • value_set_threshold: str: Specify a value from the following ordered list - "none", "one", "two", "very_few", "few", "many", "very_many", "unique". When the Profiler runs, each column is profiled for cardinality. This threshold determines the greatest cardinality for which to add expect_column_values_to_be_in_set. For example, if value_set_threshold is set to "unique", it will add a value_set Expectation for every included column. If set to "few", it will add a value_set expectation for columns whose cardinality is one of "one", "two", "very_few" or "few". The default value here is "many". For the purposes of comparing whether two tables are identical, it might make the most sense to set this to "unique".
  • semantic_types_dict: Dict[str, List[str]]. Described in more detail below.

If you would like to make use of these parameters, you can specify them while instantiating your Profiler.

excluded_expectations = ["expect_column_quantile_values_to_be_between"]
ignored_columns = [
"rate_code_id",
"pickup_location_id",
"payment_type",
"pickup_datetime",
]
not_null_only = True
table_expectations_only = False
value_set_threshold = "unique"

validator = context.get_validator(
batch_request=batch_request, expectation_suite_name=expectation_suite_name
)

profiler = UserConfigurableProfiler(
profile_dataset=validator,
excluded_expectations=excluded_expectations,
ignored_columns=ignored_columns,
not_null_only=not_null_only,
table_expectations_only=table_expectations_only,
value_set_threshold=value_set_threshold,
)

suite = profiler.build_suite()

Once you have instantiated a Profiler with parameters specified, you must re-instantiate the Profiler if you wish to change any of the parameters.

Semantic Types Dictionary Configuration

The Profiler is fairly rudimentary - if it detects that a column is numeric, it will create numeric Expectations (e.g. expect_column_mean_to_be_between). But if you are storing foreign keys or primary keys as integers, then you may not want numeric Expectations on these columns. This is where the semantic_types dictionary comes in.

The available semantic types that can be specified in the UserConfigurableProfiler are "numeric", "value_set", and "datetime". The Expectations created for each of these types is below. You can pass in a dictionary where the keys are the semantic types, and the values are lists of columns of those semantic types.

When you pass in a semantic_types_dict, the Profiler will still create table-level expectations, and will create certain expectations for all columns (around nullity and column proportions of unique values). It will then only create semantic-type-specific Expectations for those columns specified in the semantic_types dict.

semantic_types_dict = {
"numeric": ["fare_amount"],
"value_set": ["rate_code_id", "pickup_location_id", "payment_type"],
"datetime": ["pickup_datetime"],
}

validator = context.get_validator(
batch_request=batch_request, expectation_suite_name=expectation_suite_name
)

profiler = UserConfigurableProfiler(
profile_dataset=validator, semantic_types_dict=semantic_types_dict
)
suite = profiler.build_suite()

These are the Expectations added when using a semantics_type_dict:

Table Expectations:

Expectations added for all included columns

Value set Expectations

Datetime Expectations

Numeric Expectations

Other Expectations