Blog

Announcing the New mPulse Alert Feature: Anomaly Detection

May 6, 2020 · by Julia Yang ·
Categories:

This blog post will give you an intro to real user monitoring, walk you through how to create an anomaly detection alert, and different models that can help express the information you need. 

Overview

mPulse is a RUM (Real User Monitoring) engine that enables you to visually inspect the performance of your websites.  It provides tools like alerts to monitor website performance in real-time. We just added a new feature to mPulse alert feature: Anomaly Detection.

Anomaly Detection looks at the historical data gathered on a domain to establish what the normal behavior is - and then it generates a model.  As new data is received by mPulse, it is compared to this model in real-time.  If the real-time measurements deviate from normal, an alert is sent to notify you.

Our anomaly detection is built to handle dimensions that can be measured and expressed numerically.  In this iteration, it is targeted for time-series data with strong trends and strong (daily or weekly) seasonalities. 

An example of a strong trend would be a website having a 3-day sale.  This site may see a pronounced increase in traffic throughout the sales period, leading to a positive, increasing trend.  An example of a strong weekly seasonality can be seen on most news sites, where traffic increases from Monday to Friday and decreases on the weekends during normal news cycles.  Beacon count (a.k.a. web traffic) often has seasonality based on when people go to a website.  IQR (Interquartile Range) models are especially good for dimensions that have both trend and seasonality and are not too volatile.

An example of a dimension that’s not too volatile would be an online store monitoring the page load time of their home page.  If the page load time for more than half the users gets too slow on the home page, you would want to be notified by mPulse. You can be notified what values caused the alert to trigger or link a customized mPulse dashboard to narrow in on causes.

How To Create an Anomaly Detection Alert

To create an anomaly detection alert, go to Alerts from the mPulse side menu within Central, and open the Alert creation dialog.  Select the Anomaly Detection option under Choose a Data Event.  This process has been simplified to a few required fields for an anomaly detection alert to be created: Name, App (ex. mPulse Demo), Timer/Metric (ex. Timer), and Dimension (ex. Page Load Time).

UI Dialog:

UI Dialog

Additional filters can be added to hone in on what to monitor (i.e. Page group, device type, etc.).

UI

A default email and message are provided with each alert, but these actions are all customizable to your preferences.  

Note: We do require that some form of notification action is chosen.  If no action is chosen then the alert cannot be saved.

When you have filled out all necessary information, make sure to save your preferences.

Choosing a Data Event

Anomaly must persist for is a value you pick from 1 to 10 minutes.  It indicates how many consecutive minutes of anomalous behavior must persist before you receive a notification.  If the default value is 5 minutes, you'll receive a notification after 5 consecutive minutes of bad behavior.  Keep in mind that you will not get additional alerts until the incoming data goes back to normal for at least 1 minute.  

Alert Sensitivity is an option that allows you to control the sensitivity of the model (least to most sensitive). The idea is to allow you to control if they want their models to detect more anomalies (more sensitive) or fewer anomalies (less sensitive).

Filters allow you to narrow down what you want to monitor.  This way you can separate mobile from desktop, and US consumers from UK consumers, for example.

Specify Time Options for when to monitor the event

You can choose the days an alert is active. Let’s say Monday to Friday. If you want to monitor a specific backend timer, then you can specify the day and the times the alert is active.

Choose Action(s)

This is required. You can choose one or multiple actions. When an alert is triggered, you can get an email notification, indicate a webhook, send something to PagerDuty, receive a Slack message, or any of the above combinations.

Models

This initial release uses IQR (interquartile range) to build a model for detecting anomalous measurements.  The model also incorporates trend and seasonality.  The IQR models are trained on either the last 14 days of data or the last 30 days, depending on how much data you have available at the time of the alert creation.  If you’re new and don’t have enough data, the alert will automatically update itself with a model when the data becomes available.  The preferred is the last 30 days of data.

Once a model is created, it is viable for the next 30 days before an automated system will update the model.  You can get an updated model by opening and saving the alert if 14+ days of new data has been gathered since the last save.  However, the system will automatically renew the model every 30 days.  The existing model is used until a new model replaces it.

How data determines the quality of alerts

These models are built on web property data.  The alert looks at the last 14 or 30 days of data when a new model request goes through.  If you go to your Summary dashboard, you can determine if your new alert will perform well or poorly based on what the last 30 days of data looks like.

There are 2 ways of thinking about what makes an alert successful:

  1. Is there enough data for what is being monitored to build a model?  And,

  2. For those 14 or 30 days, did data continuously come in without large dearths of data? (See the 2nd, 3rd and 4th screenshots below to get a visual on large dearths of data.)

The first part can be thought of as a general data population and the second can be thought of as the data density.  Also, be aware if the incoming data of the next 30 days is vastly different from the measurements of the last 30 days, then the model built on the previous 30 days is not going to represent what is current.

No Models
No data (which will not generate a model):

graph

Not enough data (10 data points) and with large dearths for Page Load (s):

graph

Bad Models
Here are some examples of data that would create models, but those models will not be very good at detecting good or bad behavior.

Enough data (1.5+ million) but large dearths of data most days for Page Load(s):

graph

Last 24 hours view especially telling that whatever model will be generated will not be very effective as the data is only gathered for a small percentage of the day:

graph

Something a customer will more likely see is a bit more subtle.  This is shown with Front-End Time(s) below for the last 30 days:

graph

Using the 75th percentile (the percentile used in earlier beta), we see that there is data every day for this model.

Switch to the last 24 hours view and look at the green line.  It is sporadic with seemingly no seasonality for the front-end time and a lot of data volatility:

graph

Data volatility means the data spikes sporadically, in a way that’s hard to predict mathematically.  This kind of volatility can make it difficult for the model, especially if the alerting is set to high sensitivity.  We suggest using a longer persists-for value (i.e. 10 minutes) to counter high-volatility situations like this.  It allows us to be less reactionary and only trigger an alert when data is showing consistently bad behavior for a period of time.

OK Models
This is for domains with enough data volume but low seasonality and lacks a trend.  This can mean that if the data is volatile, the model will not be able to have a good handle on it.  However, these cases are looking at 30 days of data to define “normal” and these alerts are updated every 30 days without user input.

There is not enough trend or seasonality for the Page Load (s) shown below:

graph

Proves there’s data for every day.  There does seem to be a bit of a downward trend.

The last 24 hours view of Page Load (s):

graph

As you can see, as the number of page views goes down, the page load time (at the 50th percentile) exhibits more volatility in behavior.

Good Models
Good models are built from high volumes of data with obvious trends and seasonalities on the daily or weekly level.

The last 30 days view for Page Views (K):

graph

The last 24 hours for Page Views:

graph

You can clearly see that there is a daily seasonality for page views above.  In general, the data varies along that seasonality with low variations.  This is a perfect example of data with a specific pattern and low volatility that can be captured by our model.  If there are anomalies occurring in the current page views, we’d be accurate in finding it.

Pre-processing

We check the existing data and see if there exists trend and seasonality to help IQR work with the time-series data.

Trending
If the data is showing a pattern of increasing or decreasing trend over a specific time frame (generally around 24 to 72 hours), we’d capture it and store it.

Seasonality
If the data shows a daily or weekly pattern, we will capture that pattern.  However, since our models are based on the last 30 days of data, currently we ignore monthly and yearly seasonalities.

Here’s a daily seasonality that’s discernible to the eye for Page Views:

graph

Percentile

We currently use the 50th percentile, to better capture when issues have large impact and to better control data volatility.  Right now the aggregation to use is fixed.  For example, if we’re looking at page load time we would aggregate each minute to the 50th percentile.

IQR

Simplistically, IQR gives a lower and upper bound.  We evaluate each minute and see where that minute of aggregated data falls within the bounds.  We have added a normal-lower bound and normal-upper bound to use after an anomaly is detected; these two values determine when the anomaly is back to normal.

IQR also incorporates seasonality and trend, which gives our models a sense of time.  This allows the models to better define normality.  For example, if every day from 11AM to 2PM the page views drop in volume, then the lower and upper bounds during this period will be lower in value than the page views during other time periods.  This allows us to improve alerting dimensions with inherent time-series traits.

Below is an example of an IQR model where the data has no trend nor seasonality:

graph

Unlike the above page load time dashboard, we do have certain metrics (i.e. Beacon Count) that usually exhibit clearer seasonalities and trends.  The widgets in the dashboard below shows that with our new option: Sensitivity, included.  The widgets below visualized the difference between low and high sensitivity.  You can adjust it based on the volatility of the data being monitored.

In our experience, some metrics are more volatile than others and you may want to adjust the Sensitivity to be less sensitive.  On the other hand, most timers are less volatile and the Sensitivity can be kept at normal or changed to be more sensitive, if preferred.

IQR model with trend and seasonality:

graph

Sensitivity
The chart above illustrates sensitivity.

IQR has 5 sets of lower and upper boundaries for each of the 5 sensitivity settings.  Least sensitive means the lower value is very low and the upper value is very high, so we’re less likely to alert.  Most sensitive means the lower value is not as low and the upper value is not as high, narrowing the fenced-in region of normal.  More sensitivity causes more alerts to fire as it’s easier to stray outside of the lower and upper boundaries.

This allows you to have more control over when alerts happen, especially since the models are automatically generated off of collected data.  If you find that the default setting is not sensitive enough to what is going on or too sensitive, the combination of persist for and sensitivity should allow you to better orchestrate when alerts trigger.

We hope the combination of using up-to-date data, automatic updates and modeling will allow anomaly detection to be more accurate and easily accessible.