September 10th, 2024
In Part 1, we explored how to handle significant noise in large time-series datasets in order to extract meaningful signals. In particular, we explored how we could utilize MATCH_RECOGNIZE
with Apache Flink to identify repeated patterns in time-series data and extract the relevant snapshots of data. Whilst this goes a long way to capturing the relevant areas of data, it is not a sufficient tool in its own right. We still have to manage outliers that conform to our patterns, and then engage with statistical processes in order to estimate informative signals that are present in the underlying distribution of data.
The metrics that we develop are based on live data coming from different factory systems. As such, the quality and reliability of our metrics are heavily influenced by the data itself. Although in Part 1, we demonstrated how to use MATCH_RECOGNIZE
to encourage good data collection, some data points can still be erroneous, such as if the checkweigher tares incorrectly. Luckily, we can use statistics to detect these errors, and reject grossly erroneous measurements. Here, we highlight two methods for statistically identifying outliers.
Outlier rejection using the Z-score method is a popular technique to identify and eliminate outliers in a dataset by measuring how far each data point deviates from the mean in terms of standard deviations. Assuming data that is approximately Normally distributed, the higher the Z-score of a point of data, the more statistically unlikely it would be to sample that value. As data is collected, a mean and standard deviation can be calculated and used to determine Z-scores.
The Z-score itself is calculated by subtracting the mean of the dataset from the individual data point, and then dividing the result by the standard deviation:
If the Z-score is higher or lower than a predetermined threshold (commonly set at ±3), it is considered an outlier and is subsequently rejected. This method assumes that the data follows a normal distribution, where most values lie close to the mean, and only a few deviate significantly.
The Z-score method is important because it offers a standardized way to detect and remove outliers, which can otherwise skew the results of statistical analyses. This leads to more reliable and accurate conclusions. However, it is essential to keep track of the assumptions used in Z-score outlier rejection before implementing this method. An approximately Normal probability distribution is assumed for the underlying samples. A significantly non-Normal underlying probability distribution could lead to rejection of valid point, and biased population mean estimates. For example, consider non-symmetric data; the Z-score method will bias out estimates in the opposite direction to the skew of the data.
Outlier rejection using the Interquartile Range () is a method that is more robust skewed distributions. This method relies on the spread of the middle 50% of data, and thus can adjust to non-symmetric distributions of data. To calculate the , first identify the 25th percentile of data (), and the 75th percentile (). Then the is just
This range captures the spread of the central portion of the data. Outliers are typically defined as data points that fall below or above . These points are considered anomalous because they are either too far below or above the majority of the data.
The method is less sensitive to skewed distributions compared to the Z-score, making it more reliable in datasets that do not follow a normal distribution. However, this method comes with its own disadvantages and challenges. In contexts where outlier rejection is followed by mean estimates with a confidence interval, none of the calculations used in outlier rejection can be reused. This leads to additional calculations that might not be possible in a high-frequency setting.
The importance of outlier rejection lies in its ability to prevent distorted interpretations and misleading conclusions. Outliers can heavily influence statistical measures such as mean, standard deviation, and correlation, leading to biased results. By excluding these anomalies, the analysis becomes more robust, improving the accuracy of predictive models and the overall quality of insights derived from the data. Different methods can be employed depending on the context, and assumptions must be carefully considered.
In the yield management system, we want to provide an estimate of the mean weight of a packaged product. We also want to quantify the reliability of our mean estimate, and therefore, we also construct confidence intervals in conjunction with our mean estimates. At Ferry, we emphasize the importance of updating metrics as data is measured. Updating confidence intervals in real-time presents its own challenges, and we cover these here.
Let us start with constructing confidence intervals for estimates of the mean of a population, given a sample of points from the population. First, we calculate the mean, , and the standard deviation, , of our sampled data. The central limit theorem tells us that as the sample size approaches infinity, the distribution of approaches that of a normal distribution with mean (population density) and standard deviation , where n is the number of points in the sample. From here, the confidence interval is constructed as
where is known as the critical value, and is chosen based on the level of confidence desired. In this article, we work with 95% confidence intervals, which have .
To update our mean estimate as data is received, we keep track of our total number of samples, as well as the current estimate. Then the mean can be updated incrementally as follows:
where is the measured value at increment , and is the estimated average after measurements.
Updating the standard deviation incrementally is more involved. A proven and well-studied method for doing so is Welford’s algorithm. For numerically stable results, it is best to use Welford’s algorithm to update the sum of the squared differences from the mean, , and then calculate the standard deviation:
Notice that is divided by for an unbiased calculation of the standard deviation.
Figure 2: Welford’s algorithm applied to data with population variance 100. Note that although the data is sampled from a population with variance 100, the variance of the sample itself need-not be 100.
To construct the confidence interval for population mean, we can just substitute our estimated values at iteration :
The careful observer will notice that the weight measurements from Part 1 are not all measurements of the same random variable. Those measurements focus on the stability of the checkweigher, and not on the number of bags on the conveyor belt. Therefore, some measurements could be the average of 2 bags, others the average of 3 bags, etc. Given the inconsistent variables being measured by the checkweigher, how can we calculate the confidence interval for weights of singular bags?
Without being overly pedantic, let's begin by quantitatively modelling the situation. We assume that the weight of each bag is independently sampled from some probability distribution of weights. Without making any additional assumptions about this underlying distribution, we assume that it has a numerical mean () and variance (), and denote it as . Let be the random variable corresponding to the total weight of bags sampled from . Since the bags are assumed to be sampled independently, the covariance between bag weights is 0. Using random variable algebra, we see that:
Similarly, let denote the random variable corresponding to the average weight of bags, i.e., . Multiplying a random variable by a constant multiplies its expected value by the constant and its variance by the square of the constant. It follows that
Now we can define the metrics we are looking for. Given several instances of for different values of , we want to construct a confidence interval for the expected value of . Clearly the sample mean will correspond to our estimate for , however, defining the standard deviation poses a more involved calculation. We would like to be able to do this in an online manner, so let us keep track of the usual parameters for Welford's online algorithm for sum of squared differences. However, instead of the total number of samples, we keep track of the total number of instances of for each value of , whose sum of course represents the total number of samples. Now we apply Welford's algorithm with the provided data points, from which we can calculate the sample variance and sample standard deviation. Let us denote this mixed sample variance as .
Furthermore, we define as the probability that a measurement will be an instance of . This can be estimated from the number of instances of divided by the total number of measurements.
The problem faced is that we can calculate directly from the data points collected, but we are interested in finding the variance of individual bag weights, . Fortunately, we have everything needed to do so.
First, we write as a weighted sum of variances (see Appendix for proof):
where denotes the variance of .
Now recall that . Substituting, we find
Solving for , our solution is
The expression derived here offers a powerful tool in estimating bag weights. Despite potentially not ever measuring any individual bag’s weight, we can still estimate the variance of single bag weights. Furthermore, our calculation can be incorporated into an online setting by modifying Welford’s algorithm.
As noted, our mean estimate is still the mean of the collected samples, but now we can modify Welford’s algorithm to accommodate for mixed bag measurements. Let be the ticker for the number of instances of . We can write the updates to the sample sum of squared differences exactly as above:
In addition to adjustments to the mean and the squared sum of differences, we calculate estimates for the mixture probabilities. Let be the total of data number of measurements taken, then the probability of measuring after total measurements is calculated via:
Finally, our sample standard deviation after t points is now
Support that most of the collected measurements correspond to the average weight of bags, with only extraneous measurements being the average of some other number of bags. This implies that . Let us separate the summation:
where . In this limit, is expected to be small, so let us Taylor expand the expression for to first order around :
For sufficiently small values of , the summation can be safely ignored, and
Therefore, if a majority of points are the average weight of bags, we can calculate the simply calculate the variance of single bags using Welford’s algorithm, and multiply our result by . In this limit, our expression for the sample standard deviation is
By being able to calculate mean and standard deviation in real-time, we are able to provide live updates on yield, within a degree of confidence that fluidly updates as more data is gathered. From this, deviations from trend can be identified from breaches within the confidence interval, as well as mean drift towards unfavourable outcomes. In this example, we were focused on yield management, but the same statistical process control framework can be applied to a wide variety of use cases within manufacturing, from temperature control to utility monitoring and more.
Let us denote the random variable for the mean of the aggregated measurements as . By definition, its variance is
By the linearity of expectation,
and
Recall that all have mean , and the sum over all must equal one. Therefore,
Substituting into our calculation of variance, we have
regrouping gives us
Finally, we use the definition of variance to simplify this expression to