Anomaly tests parameters

Anomaly tests parameters

  1. All anomaly detection tests:
  1. Anomaly detection tests with timestamp_column:
  1. all_columns_anomalies test:

Example configurations

login_events.yml

_17
version: 2
_17
_17
models:
_17
- name: login_events
_17
config:
_17
y42:
_17
apiVersion: v3 # the apiVersion does not impact anomaly testing
_17
elementary:
_17
timestamp_column: "loaded_at"
_17
tests:
_17
- elementary.all_columns_anomalies:
_17
where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"
_17
time_bucket:
_17
period: day
_17
count: 1
_17
# optional - change global sensitivity
_17
anomaly_sensitivity: 3.5

src_postgres.yml

_27
version: 2
_27
_27
sources:
_27
- name: src_postgres
_27
config:
_27
y42_source:
_27
type: source-postgres
_27
connection: Postgres data
_27
y42:
_27
apiVersion: v3 # the apiVersion does not impact anomaly testing
_27
tables:
_27
- name: orders
_27
config:
_27
y42_table:
_27
import: orders
_27
columns:
_27
- id
_27
- order_date
_27
# ..
_27
group: public
_27
supported_sync_modes:
_27
- full_refresh
_27
elementary:
_27
timestamp_column: order_date
_27
tests:
_27
- elementary.all_columns_anomalies:
_27
exclude_prefix: "id"


Parameters configuration

timestamp_column

timestamp_column: [column name]

Anomaly detection tests utilize a specified column to segment data into time buckets and filter the dataset. It's highly recommended to use a timestamp column such as updated_at, created_at, or loaded_at (date type is also acceptable) for optimal performance.

  • With a timestamp column: Specifying a timestamp_column enables the test to divide the data into time-based buckets using this column's timestamps. It calculates the metric for each bucket and identifies anomalies among them. This approach allows immediate test operation if the table has sufficient historical data.
  • Without a timestamp column: If a timestamp_column is not specified, the test will compute the metric for the entire table data at each run and compare it with metrics from previous runs to detect anomalies. In this case, the test requires the training_period duration to accumulate necessary metrics before it becomes effective.

If a timestamp column is not defined, the default behavior is to not create time buckets (default is null).

Default: none

example.yml

_10
models:
_10
- name: login_events
_10
config:
_10
elementary:
_10
timestamp_column: loaded_at

where_expression

where_expression: sql expression

Filter the tested data using a valid SQL expression.

Default: none

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"

anomaly_sensitivity

anomaly_sensitivity: [int]

This configuration defines how the expected range is calculated. A sensitivity setting of 3 implies that the expected range is within three standard deviations from the average of the training set. A smaller sensitivity value will decrease this range, potentially flagging more values as anomalies. Conversely, larger values increase the expected range, likely reducing the number of detected anomalies.

Default: 3

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
column_anomalies:
_10
- null_count
_10
- missing_count
_10
- zero_count
_10
anomaly_sensitivity: 4

anomaly_direction

anomaly_direction: [both | spike | drop]

This setting determines how data points are compared to the expected range, specifically whether anomalies are identified when data points are above, below, or in both directions relative to this range. This is particularly useful when monitoring metrics where only one type of deviation is considered problematic. For instance, in freshness monitoring, the focus might be solely on detecting delays (data appearing later than expected) rather than early data. The anomaly_direction configuration allows for specifying the direction of interest—both for both deviations, spike for above-the-range anomalies, or drop for below-the-range anomalies.

Default: both

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
column_anomalies:
_10
- null_count
_10
- missing_count
_10
- zero_count
_10
anomaly_direction: spike

ignore_small_changes


_10
ignore_small_changes:
_10
spike_failure_percent_threshold: [int]
_10
drop_failure_percent_threshold: [int]

This configuration allows an anomaly test to fail only if all the following conditions are met:

  • The z-score of the metric within the detection period is considered anomalous.
  • Additionally, one of the following conditions must hold:
    • The metric within the detection period exceeds the spike_failure_percent_threshold percentage of the mean value from the training period, if this threshold is defined.
    • The metric within the detection period is below the drop_failure_percent_threshold percentage of the mean value from the training period, if this threshold is defined.

These settings are useful for situations where metrics are stable, and minor fluctuations result in disproportionately high z-scores, leading to false positives in anomaly detection.

If these thresholds are not defined, the default behavior does not consider small changes, with both spike_failure_percent_threshold and drop_failure_percent_threshold being null.

Default: none

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.volume_anomalies:
_10
ignore_small_changes:
_10
spike_failure_percent_threshold: 2
_10
drop_failure_percent_threshold: 50

anomaly_exclude_metrics

anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]

This parameter allows for the exclusion of certain metrics from the training set to enhance test accuracy. By default, all data points in the training set are used for comparison. However, specific metrics can be excluded by applying a filter based on an SQL where expression.

The filter can target the following fields:

  • metric_date - The date associated with the relevant bucket, applicable even for non-daily buckets.
  • metric_time_bucket - The precise time bucket.
  • metric_value - The metric's value.

To use this feature, specify a valid SQL where expression focusing on the columns metric_date, metric_time_bucket, and metric_value. This approach helps refine the training set by removing outliers or irrelevant data points, thereby improving the precision of anomaly detection.

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
column_anomalies:
_10
- null_count
_10
- missing_count
_10
- zero_count
_10
anomaly_exclude_metrics: metric_time_bucket >= '2023-10-01 06:00:00' and metric_time_bucket <= '2023-10-01 07:00:00'

training_period


_10
training_period:
_10
period: < time period > # supported periods: day, week, month
_10
count: < number of periods >

Defines the maximum duration for data collection, encompassing both the training and detection periods. Should a detection delay be specified, the entire training period is adjusted accordingly.

How it works

The training_period parameter is effective for tests configured with a timestamp_column, influencing how historical data is utilized based on table materialization:

  • Regular tables and views: Each run calculates values across the entire training_period.
  • Incremental models and sources: Initial and full refresh tests calculate the full training_period. Subsequent runs focus on the detection_period.

Changes from Default:

  • Full time buckets: To ensure complete time buckets, the training_period is adjusted as needed. For instance, with a weekly time_bucket (period: weeK), if a 14-day period ends on a Tuesday, the period is extended to include a full week starting from Sunday.
  • Seasonality training set: When seasonality is applied, the training_period is extended to gather sufficient data for each seasonality aspect (e.g., day_of_week) to accurately detect anomalies.

Impact of Adjusting training_period:

  • Increasing training_period: Results in a larger training set, providing a broader data range for establishing the expected range. This generally reduces the test's sensitivity to outliers, decreasing the likelihood of false positives but requiring a higher anomaly threshold.
  • Decreasing training_period: Leads to a smaller training set, limiting the data range for expected range calculation. This may increase test sensitivity to outliers, elevating the risk of false positives but lowering the anomaly threshold for detection.

Default: 14 days

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
training_period:
_10
period: day
_10
count: 30

detection_period


_10
detection_period:
_10
period: < time period > # supported periods: day, week, month
_10
count: < number of periods >

This setting specifies the length of the detection period. For example, if set to 2 days, only data points from the last 2 days are considered for anomaly detection. Similarly, setting it to 7 days means the detection window extends to the last 7 days.

In the context of incremental models, the detection_period also determines how frequently metrics are recalculated. If metrics within this period have been previously calculated, Elementary will update them to account for any recent backfills or data updates. Adjust this configuration based on the typical delays in your data processing to ensure timely and accurate anomaly detection.

How it works

The detection_period defines the timeframe for anomaly detection, with its application varying by the table's materialization type:

  • Regular tables and views: Sets the period for analyzing data for anomalies.
  • Incremental models and sources: Besides detection, it also dictates the timeframe for recalculating metrics to reflect recent data changes or backfills.

Default: 2 days

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
detection_period:
_10
period: day
_10
count: 30

time_bucket


_10
time_bucket:
_10
period: < time period > # supported periods: hour, day, week, month
_10
count: < number of periods >

This parameter sets the granularity of time buckets for data analysis.

Data is segmented into time buckets to track changes and identify anomalies. For instance, with a daily time bucket (period=day, count=1), it assesses daily row count variations.

Adjust this setting based on your data's characteristics and the resolution needed for anomaly detection. For hourly volume anomaly detection, configure it as period=hour, count=1.

How it works
  • The training_period and detection_period of the test might be extended to ensure full time buckets (for example, full week Sunday-Saturday).
  • Weekly buckets start at the day that is configured as week start on the data warehouse.

Default: time_bucket: {period: day, count: 1}

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.all_columns_anomalies:
_10
time_bucket:
_10
period: day
_10
count: 2

seasonality


_10
seasonality: day_of_week | hour_of_day | hour_of_week

The seasonality configuration is crucial for datasets with predictable, repeating patterns over time, enhancing the precision of anomaly detection by taking into account these regular patterns. This approach helps in reducing false positives and avoiding missed anomalies.

Supported seasonality configurations:

  • day_of_week: Aligns daily data buckets for comparison based on the day of the week, ensuring each day is compared with the same weekdays from the past.
  • hour_of_day: For hourly data buckets, it aligns them by the hour of the day, comparing, for example, 10:00-11:00 AM across different days.
  • hour_of_week: Combines both day and hour for a more granular weekly pattern, comparing specific hours on specific days across weeks, like 10:00-11:00 AM on Sundays to the same timeframe on previous Sundays.
How it works
  • The test compares the metric value of a current bucket not to its immediate predecessor but to previous buckets sharing the same seasonality attribute. This means, for instance, a Monday's data is compared against past Mondays, providing a more accurate anomaly detection basis.
  • To ensure a sufficient historical data set for comparison, the training_period is automatically adjusted when seasonality is applied. For example, when seasonality: day_of_week is configured, the training_period is by default multiplied by 7, ensuring there's enough data from each day of the week to form a robust training set.
example use case for seasonality

Different days of the week may show varying activity levels in many datasets, with weekends often seeing lower volumes compared to weekdays. Applying the day_of_week seasonality means the expected range for each day's data is based on historical data from the same weekday, accommodating the normal fluctuations seen throughout the week.

Default: none

Relevant tests: Anomaly detection tests that utilize a timestamp_column and a 1 day time_bucket.

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.volume_anomalies:
_10
seasonality: day_of_week

detection_delay


_10
detection_delay:
_10
period: < time period > # supported periods: hour, day, week, month
_10
count: < number of periods >

Specifies the time to exclude from the end of the detection period. This is beneficial when recent data might not be fully available or reliable, such as in cases of scheduling discrepancies where tests precede data population. Essentially, it's the buffer period post-detection to omit from analysis.

Default: 0

Relevant tests: Anomaly detection tests that utilize a timestamp_column.

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.volume_anomalies:
_10
detection_delay:
_10
period: day
_10
count: 1

column_anomalies

Select which monitors to activate as part of the test.

Default monitors by type:

Data quality metricColumn Type
null_countany
null_percentany
min_lengthstring
max_lengthstring
average_lengthstring
missing_countstring
missing_percentstring
minnumeric
maxnumeric
averagenumeric
zero_countnumeric
zero_percentnumeric
standard_deviationnumeric
variancenumeric

Opt-in monitors by type:

Data quality metricColumn Type
sumnumeric

Default: default monitors

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.column_anomalies:
_10
column_anomalies:
_10
- null_count
_10
- missing_count
_10
- average

exclude_prefix


_10
exclude_prefix: [string]

This parameter is specific to the all_columns_anomalies test, allowing the exclusion of columns from the test based on their prefix. By specifying a prefix, any column whose name starts with this prefix will not be included in the anomaly detection process. This feature is particularly useful for selectively ignoring columns that may not be relevant or could skew the results of the anomaly detection.

Default: none

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.column_anomalies:
_10
exclude_prefix: "id_"

exclude_regexp


_10
exclude_regexp: [regex]

This parameter is specific to the all_columns_anomalies test, allowing the exclusion of columns from the test based on a regular expression match. By providing a regular expression pattern, columns whose names match this pattern will be excluded from the anomaly detection process. This is useful for filtering out columns dynamically based on naming conventions or patterns, ensuring that only relevant data is analyzed for anomalies.

Default: none

example.yml

_10
models:
_10
- name: login_events
_10
tests:
_10
- elementary.column_anomalies:
_10
exclude_regexp: ".*SDC$"