logoRequest a demo
logo

Real-Time Confidence Intervals & Mean Estimation

September 10th, 2024

Author
Max Lahlou

Introduction

In Part 1, we explored how to handle significant noise in large time-series datasets in order to extract meaningful signals. In particular, we explored how we could utilize MATCH_RECOGNIZE with Apache Flink to identify repeated patterns in time-series data and extract the relevant snapshots of data. Whilst this goes a long way to capturing the relevant areas of data, it is not a sufficient tool in its own right. We still have to manage outliers that conform to our patterns, and then engage with statistical processes in order to estimate informative signals that are present in the underlying distribution of data.

Outlier Rejection

The metrics that we develop are based on live data coming from different factory systems. As such, the quality and reliability of our metrics are heavily influenced by the data itself. Although in Part 1, we demonstrated how to use MATCH_RECOGNIZE to encourage good data collection, some data points can still be erroneous, such as if the checkweigher tares incorrectly. Luckily, we can use statistics to detect these errors, and reject grossly erroneous measurements. Here, we highlight two methods for statistically identifying outliers.

Z-Score

Outlier rejection using the Z-score method is a popular technique to identify and eliminate outliers in a dataset by measuring how far each data point deviates from the mean in terms of standard deviations. Assuming data that is approximately Normally distributed, the higher the Z-score of a point of data, the more statistically unlikely it would be to sample that value. As data is collected, a mean and standard deviation can be calculated and used to determine Z-scores.

The Z-score itself is calculated by subtracting the mean of the dataset from the individual data point, and then dividing the result by the standard deviation:

Z=xXˉσ Z = \frac{x-\bar{X}}{\sigma}

If the Z-score is higher or lower than a predetermined threshold (commonly set at ±3), it is considered an outlier and is subsequently rejected. This method assumes that the data follows a normal distribution, where most values lie close to the mean, and only a few deviate significantly.

The Z-score method is important because it offers a standardized way to detect and remove outliers, which can otherwise skew the results of statistical analyses. This leads to more reliable and accurate conclusions. However, it is essential to keep track of the assumptions used in Z-score outlier rejection before implementing this method. An approximately Normal probability distribution is assumed for the underlying samples. A significantly non-Normal underlying probability distribution could lead to rejection of valid point, and biased population mean estimates. For example, consider non-symmetric data; the Z-score method will bias out estimates in the opposite direction to the skew of the data.

Median and IQR

Outlier rejection using the Interquartile Range (IQR\text{IQR}) is a method that is more robust skewed distributions. This method relies on the spread of the middle 50% of data, and thus can adjust to non-symmetric distributions of data. To calculate the IQR\text{IQR}, first identify the 25th percentile of data (Q1Q_1), and the 75th percentile (Q3Q_3). Then the IQR\text{IQR} is just

IQR=Q1Q3 \text{IQR} = Q_1-Q_3

This range captures the spread of the central portion of the data. Outliers are typically defined as data points that fall below Q11.5IQRQ_1 - 1.5 * \text{IQR} or above Q1+1.5 IQRQ_1 + 1.5 \text{ IQR}. These points are considered anomalous because they are either too far below or above the majority of the data.

The IQR\text{IQR} method is less sensitive to skewed distributions compared to the Z-score, making it more reliable in datasets that do not follow a normal distribution. However, this method comes with its own disadvantages and challenges. In contexts where outlier rejection is followed by mean estimates with a confidence interval, none of the calculations used in IQR\text{IQR} outlier rejection can be reused. This leads to additional calculations that might not be possible in a high-frequency setting.

Summary

The importance of outlier rejection lies in its ability to prevent distorted interpretations and misleading conclusions. Outliers can heavily influence statistical measures such as mean, standard deviation, and correlation, leading to biased results. By excluding these anomalies, the analysis becomes more robust, improving the accuracy of predictive models and the overall quality of insights derived from the data. Different methods can be employed depending on the context, and assumptions must be carefully considered.

Live Updating Confidence Interval and Mean Estimates

In the yield management system, we want to provide an estimate of the mean weight of a packaged product. We also want to quantify the reliability of our mean estimate, and therefore, we also construct confidence intervals in conjunction with our mean estimates. At Ferry, we emphasize the importance of updating metrics as data is measured. Updating confidence intervals in real-time presents its own challenges, and we cover these here.

Offline Construction

Let us start with constructing confidence intervals for estimates of the mean of a population, given a sample of points from the population. First, we calculate the mean, Xˉ\bar{X}, and the standard deviation, σ\sigma, of our sampled data. The central limit theorem tells us that as the sample size approaches infinity, the distribution of Xˉ\bar{X} approaches that of a normal distribution with mean μ\mu (population density) and standard deviation σn\frac{\sigma}{\sqrt{n}}, where n is the number of points in the sample. From here, the confidence interval is constructed as

Xˉ±Zσn \bar{X}\pm Z^*\frac{\sigma}{\sqrt{n}}

where ZZ^* is known as the critical value, and is chosen based on the level of confidence desired. In this article, we work with 95% confidence intervals, which have Z=1.96Z^* = 1.96.

Live Construction and Welford’s Algorithm

To update our mean estimate as data is received, we keep track of our total number of samples, as well as the current estimate. Then the mean can be updated incrementally as follows:

Xˉt=Xˉt1+xtXˉt1t, \bar{X}{t} = \bar{X}{t-1} +\frac{x_{t}-\bar{X}_{t-1}}{t},

where xtx_t is the measured value at increment tt , and Xtˉ\bar{X_t} is the estimated average after tt measurements.

Updating the standard deviation incrementally is more involved. A proven and well-studied method for doing so is Welford’s algorithm. For numerically stable results, it is best to use Welford’s algorithm to update the sum of the squared differences from the mean, M2,t=i=1t(xiXˉt)2M_{2,t} = \sum_{i=1}^t (x_i-\bar{X}_{t})^2, and then calculate the standard deviation:

M2,t=M2,t1+(xtXˉt1)(xtXˉt1)σt=M2,tt1 M_{2, t} = M_{2, t-1} + (x_t - \bar{X}{t-1})(x_t - \bar{X}{t-1}) \\ \sigma_t = \sqrt{\frac{M_{2,t}}{t-1}}

Notice that M2,tM_{2,t} is divided by t1t-1 for an unbiased calculation of the standard deviation.

Figure 2: Welford’s algorithm applied to data with population variance 100. Note that although the data is sampled from a population with variance 100, the variance of the sample itself need-not be 100.

To construct the confidence interval for population mean, we can just substitute our estimated values at iteration tt:

Xˉt±Zσtt \bar{X}_t\pm Z^*\frac{\sigma_t}{\sqrt{t}}

Calculating Standard Deviation with Mixed Sample Numbers

The careful observer will notice that the weight measurements from Part 1 are not all measurements of the same random variable. Those measurements focus on the stability of the checkweigher, and not on the number of bags on the conveyor belt. Therefore, some measurements could be the average of 2 bags, others the average of 3 bags, etc. Given the inconsistent variables being measured by the checkweigher, how can we calculate the confidence interval for weights of singular bags?

Model Setup

Without being overly pedantic, let's begin by quantitatively modelling the situation. We assume that the weight of each bag is independently sampled from some probability distribution of weights. Without making any additional assumptions about this underlying distribution, we assume that it has a numerical mean (μ\mu) and variance (σ2\sigma^2), and denote it as D(μ,σ2)D(\mu, \sigma^2). Let YlY_l be the random variable corresponding to the total weight of ll bags sampled from DD. Since the bags are assumed to be sampled independently, the covariance between bag weights is 0. Using random variable algebra, we see that:

YlD(lμ,lσ2)  Y_l \sim D(l\mu, l\sigma^2) 

Similarly, let XlX_l denote the random variable corresponding to the average weight of ll bags, i.e., Yll\frac{Y_l}{l}. Multiplying a random variable by a constant multiplies its expected value by the constant and its variance by the square of the constant. It follows that

XlD(μ,σ2l) X_l \sim D\left(\mu, \frac{\sigma^2}{l}\right)

Now we can define the metrics we are looking for. Given several instances of XlX_l for different values of ll, we want to construct a confidence interval for the expected value of X1X_1. Clearly the sample mean will correspond to our estimate for μ\mu, however, defining the standard deviation poses a more involved calculation. We would like to be able to do this in an online manner, so let us keep track of the usual parameters for Welford's online algorithm for sum of squared differences. However, instead of the total number of samples, we keep track of the total number of instances of XlX_l for each value of ll, whose sum of course represents the total number of samples. Now we apply Welford's algorithm with the provided data points, from which we can calculate the sample variance and sample standard deviation. Let us denote this mixed sample variance as σˉ2\bar{\sigma}^2.

Furthermore, we define plp_l as the probability that a measurement will be an instance of XlX_l. This can be estimated from the number of instances of XlX_l divided by the total number of measurements.

Metric Evaluation

The problem faced is that we can calculate σˉ2\bar{\sigma}^2 directly from the data points collected, but we are interested in finding the variance of individual bag weights, σ2\sigma^2. Fortunately, we have everything needed to do so.

First, we write σˉ2\bar{\sigma}^2 as a weighted sum of variances (see Appendix for proof):

σˉ2=i=1piσi2, \bar{\sigma}^2 = \sum_{i=1}^\infty p_i\sigma_i^2,

where σl2\sigma_l^2 denotes the variance of XlX_l.

Now recall that σl2=σ2/l\sigma_l^2 = \sigma^2/l. Substituting, we find

σˉ2=i=1piiσ2 \bar{\sigma}^2 = \sum_{i=1}^\infty \frac{p_i}{i}\sigma^2

Solving for σ2\sigma^2, our solution is

σ2=σˉ2i=1pii \sigma^2 = \frac{\bar{\sigma}^2}{\sum_{i=1}^\infty \frac{p_i}{i}}

The expression derived here offers a powerful tool in estimating bag weights. Despite potentially not ever measuring any individual bag’s weight, we can still estimate the variance of single bag weights. Furthermore, our calculation can be incorporated into an online setting by modifying Welford’s algorithm.

Tying Back to Welford’s Algorithm

As noted, our mean estimate is still the mean of the collected samples, but now we can modify Welford’s algorithm to accommodate for mixed bag measurements. Let tlt_l be the ticker for the number of instances of XlX_l. We can write the updates to the sample sum of squared differences exactly as above:

M2,t=M2,t1+(xtXˉt1)(xtXˉt1) M_{2, t} = M_{2, t-1} + (x_t - \bar{X}{t-1})(x_t - \bar{X}{t-1})

In addition to adjustments to the mean and the squared sum of differences, we calculate estimates for the mixture probabilities. Let tt be the total of data number of measurements taken, then the probability of measuring XlX_l after tt total measurements (pl,t)(p_{l,t}) is calculated via:

pl,t=tlt p_{l,t} = \frac{t_l}{t}

Finally, our sample standard deviation after t points is now

σt=M2,t(t1)i=1pii \sigma_t = \sqrt{\frac{M_{2,t}}{(t-1)\sum_{i=1}^\infty \frac{p_i}{i}}}

Special Case: Dominant $X_l$

Support that most of the collected measurements correspond to the average weight of ll bags, with only extraneous measurements being the average of some other number of bags. This implies that pl1p_l \approx 1. Let us separate the summation:

i=1pii1l+ϵ, \sum_{i=1}^\infty \frac{p_i}{i} \approx \frac{1}{l} + \epsilon,

where ϵ=i=1pii1l\epsilon = \sum_{i=1}^\infty \frac{p_i}{i} - \frac{1}{l}. In this limit, ϵ\epsilon is expected to be small, so let us Taylor expand the expression for σ2\sigma^2 to first order around ϵ=0\epsilon = 0:

σ2l(1lϵ)σˉ2 \sigma^2 \approx l(1-l\epsilon)\bar{\sigma}^2

For sufficiently small values of ϵ\epsilon, the summation can be safely ignored, and

σ2lσˉ2 \sigma^2 \approx l\bar{\sigma}^2

Therefore, if a majority of points are the average weight of ll bags, we can calculate the simply calculate the variance of single bags using Welford’s algorithm, and multiply our result by ll. In this limit, our expression for the sample standard deviation is

σt=lM2,tt1 \sigma_t = \sqrt{l*\frac{M_{2,t}}{t-1}}

Why this matters

By being able to calculate mean and standard deviation in real-time, we are able to provide live updates on yield, within a degree of confidence that fluidly updates as more data is gathered. From this, deviations from trend can be identified from breaches within the confidence interval, as well as mean drift towards unfavourable outcomes. In this example, we were focused on yield management, but the same statistical process control framework can be applied to a wide variety of use cases within manufacturing, from temperature control to utility monitoring and more.

Appendix

Let us denote the random variable for the mean of the aggregated measurements as X^\hat{X}. By definition, its variance is

σˉ2=E[X2^]E[X^]2 \bar{\sigma}^2 = \mathbb{E}[\hat{X^2}] - \mathbb{E}[\hat{X}]^2

By the linearity of expectation,

E[X^]2=(i=1piE[Xi])2 \mathbb{E}[\hat{X}]^2 = \left(\sum_{i=1}^\infty p_i\mathbb{E}[X_i]\right)^2

and

E[X2^]=i=1piE[X^i2] \mathbb{E}[\hat{X^2}] = \sum_{i=1}^\infty p_i\mathbb{E}[\hat{X}i^2]

Recall that all XiX_i have mean μ\mu, and the sum over all pip_i must equal one. Therefore,

(i=1piE[Xi])2=μ2=i=1piμ2=i=1piE[Xi]2 \left(\sum_{i=1}^\infty p_i\mathbb{E}[X_i]\right)^2=\mu^2=\sum_{i=1}^\infty p_i\mu^2 =\sum_{i=1}^\infty p_i\mathbb{E}[X_i]^2

Substituting into our calculation of variance, we have

σˉ2=i=1piE[X^i2]i=1piE[X^i]2 \bar{\sigma}^2 = \sum_{i=1}^\infty p_i\mathbb{E}[\hat{X}i^2]- \sum_{i=1}^\infty p_i\mathbb{E}[\hat{X}i]^2

regrouping gives us

σˉ2=i=1pi(E[X^i2]E[X^i]2) \bar{\sigma}^2 = \sum_{i=1}^\infty p_i\left(\mathbb{E}[\hat{X}_i^2] - \mathbb{E}[\hat{X}i]^2\right)

Finally, we use the definition of variance to simplify this expression to

σˉ2=i=1piσi2 \bar{\sigma}^2 = \sum_{i=1}^\infty p_i \sigma_i^2