Peter Chng

Error Bars

When reading articles that include graphs, those graphs may often include “Error Bars” that attempt to characterize either the error, uncertainty, or variability in the data that is being plotted. This is often represented as an interval surrounding a data point. Here, I’ve used “Error Bars” in a very general sense to cover a multitude of different techniques that attempt to convey different information, but may look visually similar.

It’s important to understand what each type of “Error Bar” means to better understand the data that’s being visualized, and to understand which one is being used in the article you are reading.

I’ve summarized some of the most common types of such intervals you’ll see, along with an explanation of each. These explanations are mostly taken from my own notes, and as such are not comprehensive nor authoritative. I’ve provided reference links for further reading.

Box Plot

A Box Plot is a visualization of the distribution of the values of a dataset. It provides less information than a histogram, with the tradeoff of being easier to plot multiple datasets to compare them. This plot consists entirely of descriptive statistics, that is, a summary of the dataset. There are no parameters and no model, and the box plot is purely a statement of the values in the dataset. The plot is not trying to estimate nor infer any values.

Here is an example of a box plot, plotted using Matplotlib’s boxplot function with the following code, illustrating the datasets used:

import matplotlib.pyplot as plt
plt.boxplot([[i for i in range(91, 101)], [80, 81, 92, 93, 94, 95, 96, 97, 98, 99, 100, 109, 110]])

Example box plots
Example box plots

  1. The outer edges of the box are the first ($Q_1$ or the 25th percentile) and third quartiles ($Q_3$, 75th percentile) values. The distance between the first and third quartiles is known as the Interquartile Range (IQR).
  2. The line through the box is the median (50th percentile) value.
  3. The “whiskers” (smallest, largest values) can be defined in many ways:
    • Min/Max: The most straightforward way to define the whiskers is as the minimum and maximum values of the dataset.
    • IQR method: Another common way is to have the whiskers defined in terms of $Q_1 - 1.5 \times IQR$ (lower whisker) and $Q_3 + 1.5 \times IQR$ (upper whisker).
      • If there is no value in the dataset at exactly $Q_1 - 1.5 \times IQR$, then the lowest value in the dataset greater than this is chosen as the lower whisker. Similarly, if there is no value at exactly $Q_3 + 1.5 \times IQR$, then the greatest value less than this is chosen as the upper whisker.
      • This is is the approach that Matplotlib’s boxplot function uses by default.
      • Values beyond the whiskers are termed “outliers” or “fliers” and are plotted as individual points.

References

  1. NIST - Engineering Statistics Handbook 1.3.3.7 - Box Plot
  2. Matplotlib matplotlib.pyplot.boxplot function
  3. Khan Academy - Box plot review

Standard Error

The Standard Error is a way to calculate the variance incurred when you sample from a population and try to calculate some statistic based on this. It is a measure of how much uncertainty there is in the values obtained from your sample vs. the “true” values you’d get if you sampled the entire population. In other words, the sampling itself is a random process that produces a distribution of values. The standard deviation of this distribution is the standard error.

Usually this is done with the sample mean, in which case we are dealing with the standard error of the mean. It is defined this way:

  1. Suppose we sampled $n$ values from some population, and calculated the mean. This would be one sample mean.
  2. If we repeated this process, sampling another $n$ values, we’d get another (different) sample mean.
  3. If this sampling itself is done many times, we get many different sample means.
  4. These sample means will form their own dataset with a mean and variance.

Because the sampling process is random, it produces a sample mean that is a random variable. When we compute many such sample means we can calculate the variance of this dataset. The variance of this dataset (or its square root, the standard deviation), is a measure of how much uncertainty there is the sample mean.

Instead of repeating the sampling process many times, we can instead calculate the standard deviation we would obtain based on the number of samples ($n$) in it. This is known as the standard error of the mean ($\sigma_{\bar{x}}$) and is defined as follows:

$$ \sigma_{\bar{x}} = {\sigma\over{{\sqrt{n}}}} $$

  • $n$: The sample size, i.e. how many data points we have in our sample
  • $\sigma$: The population standard deviation

As $n$ increases, the standard error will approach zero, though this is a relatively slow process due to the square root in the denominator. (i.e. increasing the number of samples by 100x only reduces the standard error by 10x)

This definition can be obtained from first principles: The sampling process is defined as taking $n$ independent and identically distributed samples from the same distribution/population. Each of these samples can be represented as a random variable $X_i$. The sample mean is then:

$$ {1\over{n}}{\sum_{i=1}^n{X_i}} $$

The variance of the sample mean can then be written as: $$ \sigma_{\bar{x}}^2 = \mathrm{Var}({1\over{n}}{\sum_{i=1}^n{X_i}}) $$

We can simplify the expression by applying the properties of variance: $$ \sigma_{\bar{x}}^2 = {1\over{n^2}}{\mathrm{Var}}({\sum_{i=1}^n{X_i}}) = {1\over{n^2}}{\sum_{i=1}^n{\mathrm{Var(X_i)}}} = {n\over{n^2}}{\mathrm{Var(X_i)}} = {1\over{n}}{\mathrm{Var(X_i)}} $$

$\mathrm{Var}(X_i)$ is just the variance of the underlying distribution ($\sigma^2$), so we can rewrite this as:

$$ \sigma_{\bar{x}}^2 = {\sigma^2\over{n}} \longrightarrow \sigma_{\bar{x}} = {\sigma\over{{\sqrt{n}}}} $$

The square root of the variance is the standard deviation, and from this we arrive at the previous formula.

However, we typically do not know the true population standard deviation, so often we can only come up with an estimate of the standard error: ($\hat{\sigma}_{{\bar{x}}}$):

$$ \hat{\sigma}_{{\bar{x}}} = {s\over{\sqrt{n}}} $$

  • $s$: The standard deviation of the sample

This is only an estimator of the standard error and not the true standard error itself, but in practice is most often used since it is more readily available.

Once we have calculated the standard error of the mean (SEM) (or its estimate), the error bars are typically constructed as being $\mu \pm SEM$, that is, one standard error above and below the mean.

References

  1. SAS - What statistic should you use to display error bars for a mean?
  2. Standard deviations and standard errors
  3. Standard error for the mean of a sample of binomial random variables

Confidence Intervals

Confidence Intervals (CIs) are concerned with the uncertainty in trying to estimate a the true parameter value of either a model or a population, through the process of statistical inference. For example, when a study is done to measure the effectiveness of a drug treatment, due to randomness, the measured value of effectiveness might not be the true value.

Consider a simple coin flip example: Suppose we are given a coin, and we want to estimate the probability of it landing on heads. If we flip this coin 100 times, and it comes up heads 52 times, does that mean the coin has a probability of heads of $p = 0.52$?

It’s probably not exactly that value; instead this is only our estimate of the true value. Confidence intervals allow for a systematic way to quantify the uncertainty in our estimate due to sampling. (This is closely related to the standard error, as we’ll see shortly)

The most common confidence interval level (for better or worse) is the 95% CI. The meaning of this is not intuitive (since it is a Frequentist concept), but in simple terms it means:

A 95% CI means that if the experiment was run 100 times, we can expect that 95 of those experiments will have a 95% CI that includes the true parameter value.

In our coin flip example this would mean:

  1. Flip the coin 100 times, record down the number of heads. This is one experiment.
  2. Calculate the 95% CI based on (1), (using the Binomial Proportion Confidence Interval), and record down the bounds of this interval.
  3. Repeat the experiment 99 more times.
  4. You can expect that ~95 of the times you have run the experiment, the 95% CI would have included the true value. (But you still don’t know exactly what the true value is!)

If this seems confusing, it’s most likely because it is. But I haven’t found a better way to explain it. It is simply not an intuitive concept, but this excellent visualization may help.

Calculation of a confidence interval

The exact calculation of a confidence interval depends on the underlying distribution from which the samples were drawn, and the parameter we are trying to estimate.

For example, to calculate the confidence interval for the mean of samples taken from a normal distribution, the bounds of the confidence interval would be:

$$ \bar{x} \pm t{s\over{\sqrt{n}}} $$

  • $\bar{x}$: The sample mean.
  • $t$: The t-value from the $t$-distribution based on the sample size $n$ and the confidence level. This is typically looked up from a table of values or from a calculator. A $t$-value is used when the population standard deviation is unknown, and we only have an estimate of it based on the sample.
  • ${s\over{\sqrt{n}}}$: The estimated standard error, based on the sample standard deviation $s$ and number of samples $n$.

(This formula applies under certain conditions)

In this case the standard error (SE) of the mean is part of the calculation for the confidence interval. The confidence interval ranges from one SE times the $t$-value below the sample mean to one SE times the $t$-value above it.

Interpretation of a confidence interval

Confidence intervals have equivalence with the concept of statistical significance. For example, a 95% CI is associated with the 5% significant level. That is, suppose we are doing an A/B test to estimate the effect size (difference) between two treatments. We can then state these equivalences:

  • If the 95% CI overlaps with 0, the result (effect size) is not statistically significant at the 5% level. This implies $p > 0.05$.
  • If the 95% CI does not overlap with 0, then the result (effect size) is statistically significant at the 5% level. This implies $p \leq 0.05$.

It’s important to understand what confidence intervals do not mean:

  • A 95% CI does not mean the true value has a 95% probability of being within the interval. In Frequentist inference, once you have done the experiment and computed the CI, the true value is either within the CI, or it’s not. It is improper to speak of the true value having some probability of being within a fixed interval. In Frequentist statistics, we can only speak of probabilities as long-run outcomes, as in our example above where we did the experiment 100 times.
  • A 95% CI does not mean that if we repeated the experiment, there would be a 95% chance of the result falling within this interval.

The only valid interpretation of a confidence interval is the one given above, that is, it is a statement about the long-run outcomes.

References

  1. Statistical Smorgasbord - Confidence Intervals for Means
  2. Statistics 101 - Confidence Intervals
  3. Statistics at square one - Statements of probability and confidence intervals

Credible Intervals

Like a confidence interval, a credible interval is also concerned with the estimation of an unknown model’s parameter, but it is arguably a bit easier to interpret (but not necessarily calculate) than a confidence interval.

In this interpretation, the model parameter is a random variable (RV) with a certain probability distribution. (PDF for a continuous RV, PMF for a discrete RV)

From this interpretation, a 95% credible interval is any interval of the probability distribution under which 95% of the area of the lies. For example, if a model parameter was normally distributed with mean 0 and standard deviation $\sigma$, then a 95% credible interval would be approximately bounded by $-2\sigma$ and $2\sigma$, according to the 68-95-99.7 rule.

The probability distribution used is usually the posterior distribution, which is a Bayesian concept. Very roughly, how a posterior distribution is determined is:

  1. We have a prior distribution over some model parameter, which represents our current belief in what the model parameter’s value might be.
    • For example, if we have absolutely no knowledge about what the value might be, other than the value must be between $[a, b]$, we would represent this belief as a uniform distribution between $[a, b]$.
  2. After we observe some samples produced by the model (the evidence), we determine how likely it was to observe that evidence given our prior distribution, and use that in a systematic way to update our beliefs about the model parameter’s value.
  3. After updating our beliefs, we transform our prior into a posterior distribution.

As an example, suppose I am handed a coin, and want to know whether it is fair or not, that is, what the probability $p$ of landing on heads after a flip is. Without knowing anything about the coin, but also knowing that most coins tend to be roughly fair, we can assume that the most likely value for $p$ is probably $1\over{2}$, but other values of $p$ may also be possible.

Our process for systematically determining the posterior distribution on $p$ is as follows:

  1. We represent our prior belief as $p$ being beta-distributed with a mean centered on $1\over{2}$. This is our prior distribution.

Prior distribution
Prior distribution

  1. After many coin flips, I notice that the coin is coming up heads much more often than tails. After observing such evidence, my belief has probably changed so that I believe values of $p$ closer to $1$ are much more likely than those closer to $0$.
  2. This would be represented by a posterior distribution shifted to the right compared to our prior.

Posterior distribution
Posterior distribution

The exact process of transforming a prior distribution into a posterior distribution can be quite complicated, and is beyond the scope of this article. (The process is greatly simplified when we can make assumptions allowing for the use of conjugate priors)

Once we have the posterior distribution, a 95% credible interval is any interval that covers 95% of the area under the posterior distribution. In our above example, we used a $Beta(4, 2)$ distribution as the posterior. One possible 95% credible interval is illustrated below:

A 95% credible interval for a Beta(4, 2) distribution
A 95% credible interval for a Beta(4, 2) distribution

The area shaded under the curve represents the 95% credible interval. The interval is approximately $[0.28, 0.95]$. For our example, this would mean that the parameter $p$ has a 95% chance of being between those bounds.

Because a 95% credible interval is any interval that covers 95% of the area under the distribution, this particular credible interval is not the only solution. I have picked this interval because it goes from the 2.5% quantile to the 97.5% quantile, so that the two “tails” outside the interval have equal size.

Arguably, Bayesian Credible Intervals are easier to interpret than Frequentist Confidence Intervals. But they have the drawback of requiring you to have a prior distribution (and be able to justify your choice in priors), and computing the posterior distribution in order to determine the credible interval is often not as straightforward as computing a confidence interval.

References

  1. bayestestR - Credible Intervals (CI)
  2. Understanding and interpreting confidence and credible intervals around effect estimates

Prediction Intervals

A Prediction Interval is an estimate (or forecast) of what the value of some future observation might be. Unlike confidence intervals and credible intervals, which are statements about the estimation of a model parameter, a prediction interval aims to estimate the range in which the next observation of some random process will be, given past observations. Here, we are not trying to estimate a model parameter but instead the next output of a random process that model represents.

In other words, while confidence intervals and credible intervals are trying to estimate an unobservable model parameter (or immeasurable population parameter), prediction intervals are trying to estimate where the next observation will lie.

This is inherently tougher because you also have to account for individual variance between observations. For example, suppose we were trying to predict the next sample from a presumed normal distribution - even if we collected enough samples such that we could be very certain about the mean and variance of that normal distribution (the model’s parameters), predicting the next single output value will always incur more uncertainty due to the randomness inherent in sampling.

To better understand this, let’s look at the formulas for a confidence interval for the mean and prediction interval when sampling from a normal distribution. Suppose we have collected a sample of $n$ values from this distribution. We can then compute these intervals as:

Confidence Interval: (rewritten slightly for clarity) $$ \bar{x} \pm t{s\over{\sqrt{n}}} $$

$$ = \bar{x} \pm ts{\sqrt{1\over{n}}} $$

Prediction Interval: $$ \bar{x} \pm ts{\sqrt{1 + {1\over{n}}}} $$

For both:

  • $\bar{x}$: The sample mean
  • $t$: The t-value as previously described
  • $s$: The sample standard deviation
  • $n$: The number of samples

The prediction interval formula has an extra constant term ($1 + {1\over{n}}$ instead of just $1\over{n}$) in the square root that increases its width. A further comparison of the formulas for the confidence interval and prediction interval yields the following implications:

  1. A prediction interval is always wider than a confidence interval for a given confidence level. That is, a 95% prediction interval will always be wider than the associated 95% confidence interval for the mean.
  2. A prediction interval will not converge to a single value with more samples. While a confidence interval will converge to a single value as the number of samples $n$ increases, the same is not true for a prediction interval.

The increased width of prediction intervals as compared to confidence intervals is due to extra uncertainty:

  • For a confidence interval, the uncertainty is only in the estimation of a model’s parameter, like the mean.
  • For a prediction interval, there is uncertainty in not only the model’s parameter, but also in the individual variation between sampled data points. This individual variation increases the uncertainty, and is also why the prediction interval will not converge to a single value.

The interpretation of a prediction interval can be taken in a Frequentist sense:

  1. Collect $n$ samples of data.
  2. Calculate the 95% prediction interval from those samples.
  3. Sample the next value and see whether it lies within the prediction interval previously calculated.

If you repeat this procedure many times, you can expect that the value sampled in (3) will fall in the prediction interval 95% of the time. In practice, this might not be the case, due to other sources of uncertainty.

Since prediction intervals are usually used in forecasting based on previous data, you can also incorporate predictor values into the calculation, which is a form of regression.

A prediction interval provides the bounds for a single future sample. If one is looking to get bounds on multiple future samples (or essentially the entire population), a tolerance interval is probably more appropriate. With a tolerance interval, you specify both the confidence level (0-100%) and what percentage of the future samples you want to cover. Like a prediction interval, a tolerance interval covering $X$% of the population will never converge to a single point, but as the certainty in the model’s parameters increases (i.e. the number of observed samples increases), an $X$% tolerance interval will approach the actual interval that covers $X$% of the area under the actual probability distribution.

References

  1. Confidence Interval vs Prediction Interval
  2. What Is a Prediction Interval?
  3. STAT 501 - 3.3 - Prediction Interval for a New Response
  4. GraphPad - The distinction between confidence intervals, prediction intervals and tolerance intervals
  5. Statistics By Jim - Confidence Intervals vs Prediction Intervals vs Tolerance Intervals
  6. Cross Validated - Prediction and Tolerance Intervals

Conclusion

I’ve summarized some of the ways that error, uncertainty, or variability in plotted data might be visualized, and how to interpret these intervals. Whenever you are looking at data that includes intervals, it’s important to understand exactly what type of interval or error is being represented, and what the interpretation of such an interval is.