Anomaly tests parameters

All anomaly detection tests:

Anomaly detection tests with timestamp_column:

all_columns_anomalies test:

Example configurations

login_events.yml


_17version: 2
_17
_17models:
_17  - name: login_events
_17    config:
_17      y42:
_17        apiVersion: v3 # the apiVersion does not impact anomaly testing
_17      elementary:
_17        timestamp_column: "loaded_at"
_17    tests:
_17      - elementary.all_columns_anomalies:
_17          where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"
_17          time_bucket:
_17            period: day
_17            count: 1
_17          # optional - change global sensitivity
_17          anomaly_sensitivity: 3.5

src_postgres.yml


_27version: 2
_27
_27sources:
_27  - name: src_postgres
_27    config:
_27      y42_source:
_27        type: source-postgres
_27        connection: Postgres data
_27      y42:
_27        apiVersion: v3 # the apiVersion does not impact anomaly testing
_27    tables:
_27      - name: orders
_27        config:
_27          y42_table:
_27            import: orders
_27            columns:
_27              - id
_27              - order_date 
_27              # ..
_27            group: public
_27            supported_sync_modes:
_27              - full_refresh
_27          elementary:
_27            timestamp_column: order_date
_27        tests:
_27          - elementary.all_columns_anomalies:
_27              exclude_prefix: "id"

Parameters configuration

timestamp_column

timestamp_column: [column name]

Anomaly detection tests utilize a specified column to segment data into time buckets and filter the dataset. It's highly recommended to use a timestamp column such as updated_at, created_at, or loaded_at (date type is also acceptable) for optimal performance.

With a timestamp column: Specifying a timestamp_column enables the test to divide the data into time-based buckets using this column's timestamps. It calculates the metric for each bucket and identifies anomalies among them. This approach allows immediate test operation if the table has sufficient historical data.
Without a timestamp column: If a timestamp_column is not specified, the test will compute the metric for the entire table data at each run and compare it with metrics from previous runs to detect anomalies. In this case, the test requires the training_period duration to accumulate necessary metrics before it becomes effective.

If a timestamp column is not defined, the default behavior is to not create time buckets (default is null).

Default: none

example.yml


_10models:
_10  - name: login_events
_10    config:
_10      elementary:
_10        timestamp_column: loaded_at

where_expression

where_expression: sql expression

Filter the tested data using a valid SQL expression.

Default: none

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"

anomaly_sensitivity

anomaly_sensitivity: [int]

This configuration defines how the expected range is calculated. A sensitivity setting of 3 implies that the expected range is within three standard deviations from the average of the training set. A smaller sensitivity value will decrease this range, potentially flagging more values as anomalies. Conversely, larger values increase the expected range, likely reducing the number of detected anomalies.

Default: 3

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          column_anomalies:
_10            - null_count
_10            - missing_count
_10            - zero_count
_10          anomaly_sensitivity: 4

anomaly_direction

anomaly_direction: [both | spike | drop]

This setting determines how data points are compared to the expected range, specifically whether anomalies are identified when data points are above, below, or in both directions relative to this range. This is particularly useful when monitoring metrics where only one type of deviation is considered problematic. For instance, in freshness monitoring, the focus might be solely on detecting delays (data appearing later than expected) rather than early data. The anomaly_direction configuration allows for specifying the direction of interest—both for both deviations, spike for above-the-range anomalies, or drop for below-the-range anomalies.

Default: both

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          column_anomalies:
_10            - null_count
_10            - missing_count
_10            - zero_count
_10          anomaly_direction: spike

ignore_small_changes


_10ignore_small_changes:
_10  spike_failure_percent_threshold: [int]
_10  drop_failure_percent_threshold: [int]

This configuration allows an anomaly test to fail only if all the following conditions are met:

The z-score of the metric within the detection period is considered anomalous.
Additionally, one of the following conditions must hold:
- The metric within the detection period exceeds the spike_failure_percent_threshold percentage of the mean value from the training period, if this threshold is defined.
- The metric within the detection period is below the drop_failure_percent_threshold percentage of the mean value from the training period, if this threshold is defined.

These settings are useful for situations where metrics are stable, and minor fluctuations result in disproportionately high z-scores, leading to false positives in anomaly detection.

If these thresholds are not defined, the default behavior does not consider small changes, with both spike_failure_percent_threshold and drop_failure_percent_threshold being null.

Default: none

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.volume_anomalies:
_10          ignore_small_changes:
_10            spike_failure_percent_threshold: 2
_10            drop_failure_percent_threshold: 50

anomaly_exclude_metrics

anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]

This parameter allows for the exclusion of certain metrics from the training set to enhance test accuracy. By default, all data points in the training set are used for comparison. However, specific metrics can be excluded by applying a filter based on an SQL where expression.

The filter can target the following fields:

metric_date - The date associated with the relevant bucket, applicable even for non-daily buckets.
metric_time_bucket - The precise time bucket.
metric_value - The metric's value.

To use this feature, specify a valid SQL where expression focusing on the columns metric_date, metric_time_bucket, and metric_value. This approach helps refine the training set by removing outliers or irrelevant data points, thereby improving the precision of anomaly detection.

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          column_anomalies:
_10            - null_count
_10            - missing_count
_10            - zero_count
_10          anomaly_exclude_metrics: metric_time_bucket >= '2023-10-01 06:00:00' and metric_time_bucket <= '2023-10-01 07:00:00'

training_period


_10training_period:
_10  period: < time period > # supported periods: day, week, month
_10  count: < number of periods >

Defines the maximum duration for data collection, encompassing both the training and detection periods. Should a detection delay be specified, the entire training period is adjusted accordingly.

How it works

The training_period parameter is effective for tests configured with a timestamp_column, influencing how historical data is utilized based on table materialization:

Regular tables and views: Each run calculates values across the entire training_period.

Incremental models and sources: Initial and full refresh tests calculate the full training_period. Subsequent runs focus on the detection_period.

Changes from Default:

Full time buckets: To ensure complete time buckets, the training_period is adjusted as needed. For instance, with a weekly time_bucket (period: weeK), if a 14-day period ends on a Tuesday, the period is extended to include a full week starting from Sunday.

Seasonality training set: When seasonality is applied, the training_period is extended to gather sufficient data for each seasonality aspect (e.g., day_of_week) to accurately detect anomalies.

Impact of Adjusting training_period:

Increasing training_period: Results in a larger training set, providing a broader data range for establishing the expected range. This generally reduces the test's sensitivity to outliers, decreasing the likelihood of false positives but requiring a higher anomaly threshold.

Decreasing training_period: Leads to a smaller training set, limiting the data range for expected range calculation. This may increase test sensitivity to outliers, elevating the risk of false positives but lowering the anomaly threshold for detection.

Default: 14 days

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          training_period:
_10            period: day
_10            count: 30

detection_period


_10detection_period:
_10  period: < time period > # supported periods: day, week, month
_10  count: < number of periods >

This setting specifies the length of the detection period. For example, if set to 2 days, only data points from the last 2 days are considered for anomaly detection. Similarly, setting it to 7 days means the detection window extends to the last 7 days.

In the context of incremental models, the detection_period also determines how frequently metrics are recalculated. If metrics within this period have been previously calculated, Elementary will update them to account for any recent backfills or data updates. Adjust this configuration based on the typical delays in your data processing to ensure timely and accurate anomaly detection.

How it works

The detection_period defines the timeframe for anomaly detection, with its application varying by the table's materialization type:

Regular tables and views: Sets the period for analyzing data for anomalies.

Incremental models and sources: Besides detection, it also dictates the timeframe for recalculating metrics to reflect recent data changes or backfills.

Default: 2 days

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          detection_period:
_10            period: day
_10            count: 30

time_bucket


_10time_bucket:
_10  period: < time period > # supported periods: hour, day, week, month
_10  count: < number of periods >

This parameter sets the granularity of time buckets for data analysis.

Data is segmented into time buckets to track changes and identify anomalies. For instance, with a daily time bucket (period=day, count=1), it assesses daily row count variations.

Adjust this setting based on your data's characteristics and the resolution needed for anomaly detection. For hourly volume anomaly detection, configure it as period=hour, count=1.

How it works

The training_period and detection_period of the test might be extended to ensure full time buckets (for example, full week Sunday-Saturday).

Weekly buckets start at the day that is configured as week start on the data warehouse.

Default: time_bucket: {period: day, count: 1}

Relevant tests: Anomaly detection tests that utilize a timestamp_column

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.all_columns_anomalies:
_10          time_bucket:
_10            period: day
_10            count: 2

seasonality


_10seasonality: day_of_week | hour_of_day | hour_of_week

The seasonality configuration is crucial for datasets with predictable, repeating patterns over time, enhancing the precision of anomaly detection by taking into account these regular patterns. This approach helps in reducing false positives and avoiding missed anomalies.

Supported seasonality configurations:

day_of_week: Aligns daily data buckets for comparison based on the day of the week, ensuring each day is compared with the same weekdays from the past.
hour_of_day: For hourly data buckets, it aligns them by the hour of the day, comparing, for example, 10:00-11:00 AM across different days.
hour_of_week: Combines both day and hour for a more granular weekly pattern, comparing specific hours on specific days across weeks, like 10:00-11:00 AM on Sundays to the same timeframe on previous Sundays.

How it works

The test compares the metric value of a current bucket not to its immediate predecessor but to previous buckets sharing the same seasonality attribute. This means, for instance, a Monday's data is compared against past Mondays, providing a more accurate anomaly detection basis.

To ensure a sufficient historical data set for comparison, the training_period is automatically adjusted when seasonality is applied. For example, when seasonality: day_of_week is configured, the training_period is by default multiplied by 7, ensuring there's enough data from each day of the week to form a robust training set.

example use case for seasonality

Different days of the week may show varying activity levels in many datasets, with weekends often seeing lower volumes compared to weekdays. Applying the day_of_week seasonality means the expected range for each day's data is based on historical data from the same weekday, accommodating the normal fluctuations seen throughout the week.

Default: none

Relevant tests: Anomaly detection tests that utilize a timestamp_column and a 1 day time_bucket.

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.volume_anomalies:
_10          seasonality: day_of_week

detection_delay


_10detection_delay:
_10  period: < time period > # supported periods: hour, day, week, month
_10  count: < number of periods >

Specifies the time to exclude from the end of the detection period. This is beneficial when recent data might not be fully available or reliable, such as in cases of scheduling discrepancies where tests precede data population. Essentially, it's the buffer period post-detection to omit from analysis.

Default: 0

Relevant tests: Anomaly detection tests that utilize a timestamp_column.

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.volume_anomalies:
_10          detection_delay:
_10            period: day
_10            count: 1

column_anomalies

Select which monitors to activate as part of the test.

Default monitors by type:

Data quality metric	Column Type
`null_count`	any
`null_percent`	any
`min_length`	string
`max_length`	string
`average_length`	string
`missing_count`	string
`missing_percent`	string
`min`	numeric
`max`	numeric
`average`	numeric
`zero_count`	numeric
`zero_percent`	numeric
`standard_deviation`	numeric
`variance`	numeric

Opt-in monitors by type:

Data quality metric	Column Type
`sum`	numeric

Default: default monitors

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.column_anomalies:
_10          column_anomalies:
_10            - null_count
_10            - missing_count
_10            - average

exclude_prefix


_10exclude_prefix: [string]

This parameter is specific to the all_columns_anomalies test, allowing the exclusion of columns from the test based on their prefix. By specifying a prefix, any column whose name starts with this prefix will not be included in the anomaly detection process. This feature is particularly useful for selectively ignoring columns that may not be relevant or could skew the results of the anomaly detection.

Default: none

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.column_anomalies:
_10          exclude_prefix: "id_"

exclude_regexp


_10exclude_regexp: [regex]

This parameter is specific to the all_columns_anomalies test, allowing the exclusion of columns from the test based on a regular expression match. By providing a regular expression pattern, columns whose names match this pattern will be excluded from the anomaly detection process. This is useful for filtering out columns dynamically based on naming conventions or patterns, ensuring that only relevant data is analyzed for anomalies.

Default: none

example.yml


_10models:
_10  - name: login_events
_10    tests:
_10      - elementary.column_anomalies:
_10          exclude_regexp: ".*SDC$"

All columns anomalies Run pipelines on a schedule