Peter Chng

Gaining a better understanding of statistical inference

Note: This is mostly an overview of what I’ve learned on my own about statistics (particularly frequentist inference), and an attempt to organize it into a guide that will be useful for future reference. As such, most of the knowledge is introductory and focuses more on concepts than the actual calculations.

I will have to confess that I did not have a strong understanding of statistics coming out of university and for much of my career. I didn’t have much formal training; I only took one course on probability during my undergraduate studies and only vaguely remember questions about urn experiments.

So, when I encountered the world of A/B testing the terminology was rather daunting. What were p-values and confidence intervals and how did they relate to the A/B test? And just what the heck was the “null hypothesis”? I set out to learn the fundamentals and fill in this gap in my knowledge. I’ll present what I’ve learned in the hopes that it might be helpful for others in a similar situation.

Here, I’ll go over the basics of null hypothesis significance testing, an important tool in statistical inference, and frequently the method by which A/B tests are done. But first, an introduction to statistics is in order.

A two-minute overview of statistics

Statistics is a huge discipline that covers a wide range of topics, so it’s useful to categorize various statistical methods by the main goals they are trying to accomplish:

  1. Descriptive Statistics:
    This is concerned with summarizing a collection of data points (a dataset) by using a smaller set of values that describe some characteristic of that dataset, such as the mean/median/mode, the standard deviation, various percentiles, etc. This is usually done to make the information easier to understand. In purely descriptive statistics, no conclusions would be made; the work would be limited to just the calculation of those descriptive statistics. An example of this would be the five-number summary of a dataset.

  2. Exploratory Data Analysis:
    This can be seen as an extension of descriptive statistics but here the focus is on looking for relationships in the data to aid in the formulation of hypotheses that can be tested in future experiments. For example, if one had a dataset consisting of the observations of various patients' measurements (blood pressure, heart rate, etc.) through time, it may be interesting to explore relationships among those variables. Any relationships found would have to be confirmed by subsequent studies however, since testing a hypothesis on the dataset it was suggested from would be circular reasoning.

  3. Inferential Statistics:
    This refers to the method of inferring the properties of a population by only observing a sample (subset) of that population. An example would be opinion polling: By getting the opinions of only a subset of the population (sampling), we attempt to infer something about the entire population. We can modify this method do conduct a hypothesis test: Assume a given population distribution, then take a sample and see if the sample is what you’d expect from that distribution. (I’ll go into this in more detail later on)

  4. Predictive Statistics:
    The aim here is to make predictions about future (unknown) events, based on some data you have and based on historical events. In most cases, you need to use inferential statistics in order to build some model that you will use to make these predictions.

In practice, when trying to accomplish a given task, you’ll rarely use methods from just one of these areas, but it’s helpful to separate them out so that the purpose of each set of statistical methods can be highlighted.

For the rest of this article, I’ll mostly be talking about Inferential Statistics.

Clarifying Inference

Suppose I give you a coin, and tell you that coin is fair. How often do you expect to get exactly 5 heads when flipping the coin 10 times? This is a straightforward probability problem, which concerns calculating the likelihood of some future event. (The probability here can be calculated exactly using the binomial distribution)

Now, suppose I give you a coin, but instead of telling you that it’s fair, I ask you to figure out how often the coin will come up heads. This is a statistical inference problem, and unlike our previous example, we can’t do a simple calculation to arrive at the solution. How would you go about this? One of the most obvious ways would be to flip the coin a bunch of times and determine what fraction of the time it came up heads.

In a sense this is the inverse problem to the previous one: Instead of being asked to predict the coin’s behaviour given some characteristic of it, we’re being asked to determine the characteristic of the coin based on its behaviour.

In this example, statistical inference would provide a systematic way to help us determine how many times we would need to flip the coin in order to get a certain level of accuracy in our estimation of its bias, as well as a way to calculate that estimation from the sample data. Through this method, we aim to infer some truth about the world through a limited set of observations on a process whose randomness makes it tricky to discover that truth.

A final note: In Machine Learning (ML) inference is typically taken to mean “to make a prediction, given a model”, which is more like our straightforward probability calculation. By contrast, statistical inference is more akin to training in ML: Given a bunch of samples (training data), estimate the parameters of your model. In this article, I’ll be using inference to refer to statistical inference, not ML inference.

A coin example

Continuing with our coin example, let’s modify the task a bit: Instead of trying to determine how often the coin will come up heads, let’s try to answer a yes/no question: Is this coin fair? (i.e. does it have exactly a 50/50 chance of coming up heads or tails?) This seems like a simpler question than determining how often the coin will come up heads, but it’s not so straightforward. (I will be referring back to this coin example throughout this article, so keep it in mind)

Let’s say you have a hypothetical coin that you know is fair. If you flipped this hypothetical coin 10 times, would you expect to always get 5 heads? What if you flipped it 100 times or 1000 times? How many heads should you expect? Most folks would realize that you can’t always expect to get exactly 50% of the flips coming up heads even if the coin is fair. But how much deviation should be allowed? For example, if we did 100 flips, would 49 heads be OK? What about 48 or 52? At what point do we declare that the coin is not fair?

There’s no single “correct” answer, but we can again look at the binomial distribution to help us answer this question. If we look at the distribution (PMF) for p = 0.5 (probability of heads coming up for a single flip) and n = 100 (number of flips), we can see that most of the mass is concentrated between [41, 59], which we’ll call the lower and upper bounds respectively.

Binomial Distribution
If flipping a fair coin 100 times, most of the time you'll get between 41 and 59 heads. Values outside of this range will occur infrequently.

In fact, the outcomes (number of heads) between the lower and upper bounds cover about 94.3% of the distribution, meaning that outcomes outside of these bounds only have a 5.7% chance of occurring. Instead of picking bounds this way, we could instead choose the bounds based on excluding outcomes that would only occur, say 5% of the time or less. Although picking bounds this way is arbitrary, it at least provides a framework to guide our decisions rather than making things up as we go.

Then, if we flip the coin 100 times and observe a number of heads outside of our bounds, we can deem the result to have been so unlikely to happen from a fair coin that we can act as if the coin is not fair. Conversely, if the number of heads lies within the bounds, we treat this as unsurprising to have come from a fair coin. This quantifies our belief that “extreme” outcomes (a lot of heads or very few) are evidence against the coin being fair.

Frequentist hypothesis testing and the p-value

The example above is a roundabout way to introduce you to the concept of hypothesis testing. The most common form of this (and the one outlined above) is known as a null hypothesis significance test (NHST), which is a fancy way of saying that testing procedure we specify can produce some “significant” outcome that allows us to reject some assumption, i.e. the null hypothesis. (NHST is a form of frequentist inference, which is probably the most common form of statistical inference you’ll find, but there are other forms of statistical inference)

Usually, the null hypothesis is some reasonable default position, e.g. for a clinical trial of a drug, the null hypothesis would be than there is no difference (no effect) in some outcome being measured between the group that received the placebo (control) and the group that got the real drug. (treatment) This would be a reasonable default assumption if the trial used a proper study design, such as randomizing the selection of participants into each of the groups.

In our coin example, the assumed (or null) hypothesis is that the coin is fair: We then design a test that may allow us to reject this hypothesis. For example, if we flip the coin 100 times and observe a number of heads that appears inconsistent with a fair coin, we can reject the null hypothesis (or just the null, and typically designated as \(H_0\), pronounced “H-nought”) and conclude that the coin has some bias and will not come up heads 50% of the time.

This is where the p-value comes in: It is a way to take the observed data (number of heads that occurred) and calculate what the probability of observing data at least as extreme as that observed, assuming that the null hypothesis was true.

The p-value is only useful when used with some decision rule. Typically this means setting some threshold (known as \(\alpha\)) and saying that any p-value below this threshold will be considered “statistically significant”, meaning you reject the null in this case.

The p-value and error control

When thinking about the p-value, always go back to the definition: It is the probability of observing data at least as extreme as that actually observed, given the null hypothesis was true. It is a conditional probability of roughly the form: (For a one-sided test)

$$ p = P(D \geq d|H_0) $$

  • \(D \geq d\): Observing a difference at least as large as the observed difference, \(d\)

Thus, a low p-value indicates that the data we observed is unlikely given the null hypothesis. In our coin example, this would be something like observing less than 41 or more than 59 heads when flipping a coin 100 times. This is a result that would have a low p-value. However, we can’t just declare the coin to be unfair because of a low p-value, because different people might have different opinions on what constitutes a “low” value. Instead, to be systematic about things, we need to have a clearly-defined decision rule, which is roughly of the form:

  • If the p-value is less than some threshold (called \(\alpha\)), then reject the null hypothesis.
    This is known as a statistically significant result.
  • Otherwise, you fail to reject the null hypothesis and either accept it or remain in doubt.

For example, if we set \(\alpha\) = 0.05, that means we will only reject the null hypothesis if the observed outcome (or one more extreme than it) had less than a 5% chance of happening assuming the null hypothesis was true. Going back to our coin example, that would roughly mean rejecting the coin as being fair (our \(H_0\)) when we get less than 41 or more than 59 heads after 100 flips.

The purpose of combining a p-value with a decision threshold is to control our error rate, specifically what is known as the Type 1 or false positive error rate. You will note that it is entirely possible to get less than 41 or more than 59 heads when flipping a fair coin 100 times! It’s just unlikely. But when this happens, we will wrongly declare the coin to be unfair when it is indeed fair. This is a false positive because we have declared something to be there when it actually is not, because we were unlucky.

When we properly use an \(\alpha\) threshold to guide our decisions based on p-values, this ensures that in the long run (over many experiments), our false positive (Type 1) error rate will be no higher than the value we picked for \(\alpha\). (Proper Type 1 error control for real-world experiments, not just coin flips, can be much more complicated than this, and is a whole other topic.)

Effect Size and Statistical Power

In our coin example, the effect size can be seen as the difference between the coin bias we estimated from our experiment and the coin bias of the null hypothesis, i.e. the difference in bias from 0.50.

Recall that a Type 1 error is a false positive, i.e. wrongly rejecting the null hypothesis. This would mean that there is no true effect size (no difference from a fair coin), but we wrongly declared there to be a difference.

In our case, that would mean wrongly declaring a fair coin to be unfair because it happened to produce a lot of (or very few) heads on a bunch of coin tosses. As you’ll recall, we can set our \(\alpha\) level to control how often we make Type 1 errors; all else being equal, a lower \(\alpha\) will result in fewer Type 1 errors.

This raises the question: Why not set \(\alpha\) to a really low value, or even zero? Wouldn’t that reduce or even remove the possibility of a Type 1 error? The answer is yes, it would - but as with many things in life, this is a trade off.

Besides a Type 1 error, the other type of error we can commit during hypothesis testing is a Type 2 error, or a false negative. This is where there is a true effect size, but in an experiment, it didn’t produce enough of an observed difference to be statistically significant, or in other words, the p-value was not less than \(\alpha\). The Type 2 error rate is also known as \(\beta\). All else being equal, a lower \(\alpha\) level will increase your Type 2 error rate, and so you are trading off between Type 1 and Type 2 errors.

Just like a fair coin will sometimes appear to be unfair (Type 1 error), an unfair coin can sometimes appear to be fair. An unfair coin appearing to be fair would be a Type 2 error, or a false negative.

Conversely, the statistical power (or just power) of the test is the probability of detecting a difference, given that there is a true effect size. This is simply \(1 - \beta\). Consequently, your power will go down if you make your \(\alpha\) level lower or more strict, all else being equal. You can only increase your power (while holding \(\alpha\) and the assumed true effect size constant) by increasing your sample size.

Here’s an example of power using our coin toss experiment. Suppose the coin is unfair and has a bias of 0.65; as before, when testing the coin, we’ll only consider it unfair if it turns up heads less than 41 or more than 59 times after 100 flips. (This is a two-sided test) These bounds reflect our threshold for significance and are equivalent to an \(\alpha\) level.

However, a coin with a bias of 0.65 won’t always yield 60 heads or more. It’s entirely possible that we could get less than 60 flips from this coin, and that would reflect a false negative or Type 2 error.

Power in the coin toss experiment
A coin with a bias of 0.65 will only produce 60 or more heads from 100 flips ~87.5% of the time

You can see from the above plot that there is quite an overlapping range of outcomes from the fair coin and our biased coin; at times, they can appear to act the same. However, most of the time, we are able to tell our unfair coin (bias of 0.65) apart from a fair coin after 100 flips. In fact, roughly 87.5% of the time, our unfair coin will yield a statistically significant outcome in our experiment, represented by all the outcomes with >= 60 heads. This is a pretty high level of statistical power and not all experiments have such high power.

Taken altogether, we have four factors that define a NHST:

  1. Sample size:
    In our coin flip experiment, this would be the number of times we flipped the coin. All else being equal, the larger the sample size, the more accurate estimation we will have and the higher power we will have for detecting any effect size, should it exist.

  2. Type 1 error rate or \(\alpha\):
    This is our threshold on the p-value; only values less than this will be declared statistically significant. It is a bound on the long-run Type 1 (false positive) error rate, or the rate of wrongly declaring an effect to exist where it does not.

  3. Type 2 error rate or \(\beta\):
    This is the long-run Type 2 (false negative) error rate and is the fraction of times you will get a non-statistically significant result (p-value > \(\alpha\)) when there is a true effect. It depends upon the chosen \(\alpha\) level, the sample size, and the assumed true effect size.

    The lower the \(\alpha\), the higher the Type 2 error rate (\(\beta\)). This is because by lowering \(\alpha\) you are only accepting smaller and smaller p-values as being significant, which (all else being equal) also makes it harder to for a true effect size to produce a statistically significant outcome.

    The greater the sample size, the higher the statistical power and the lower the Type 2 error rate. This is intutive: If a coin has a bias of 0.75, we may not be able to tell after flipping it 10 times, but we’d be more likely to do so after flipping it 100 times.

    The greater the effect size, the higher the statistical power as well: It’s easier to tell that a coin with a bias of 0.90 is not fair than a coin with a bias of only 0.51.

  4. True effect size:
    This is a little tricky, because you typically don’t know the true effect size. (In the coin example, that would be the difference in the coin’s bias from that of a fair coin) If you did know the true effect size, you would not need to do the test!

    However, calculation of statistical power is based on an assumed true effect size. So, in most cases, it is an estimated power, assuming some effect size.

    If we can’t estimate the effect size before doing the experiment (i.e. because we have no prior information), then usually it’s better to speak of the Minimum Detectable Effect, (MDE). MDE is always quoted with a certain power, but the most frequently used power is 80%. An MDE at 80% would be the true effect size that would have 80% power to be detected given your chosen \(\alpha\) and sample size. You can have an effect size less than the MDE (at 80% power) and still be able to get a statistically significant result, it’s just that you’ll have less than an 80% chance of detecting it.

For a great visualization of these four factors, see this one from R Psychologist or this article about A/B testing from Twitter.

Although there are four factors (\(\alpha\), \(\beta\), sample size, assumed true effect size) there are only three degrees of freedom here. By picking any three, you have fully specified the remaining one and it can be calculated from the first three. Consequently, if all four are defined, changing one will require at least a change in one other factor.

What the p-value is not

p-values have been much maligned, because it’s easy to misinterpret what they mean. This is because a p-value is a frequentist concept, and much of frequentist statistics is, in my opinion, not intuitive at all. A p-value is not any of the following:

  1. The p-value is not the probability that the null hypothesis is true, i.e. that there was no effect.
  2. A high p-value does not prove that the null is true.
  3. A p-value below your threshold (i.e. a low p-value) does not prove that there is a true effect.
  4. The p-value is not the probability of the single test that produced it being a false positive.

Let’s go into each of these in detail.

  1. The p-value is not the probability that the null hypothesis is true
    This is an easy mistake to make. Because low p-values (below the value we set for \(\alpha\)) mean that we will reject the null hypothesis, and because a p-value ranges between [0, 1], it’s easy to conflate the meaning into a direct probability on whether the null hypothesis is true. But this is not the case. A p-value makes no statement as to whether the null hypothesis is true or not.

    Instead, it is a statement on how likely the data observed (or data more extreme) would be, assuming the null hypothesis is true. Going back to our coin example: Let’s say we observed less than 41 or more than 59 heads after flipping the coin 100 times. This would happen roughly 5.7% of the time, but this does not imply that the probability of the coin being fair is 5.7%!

    The easiest way to remember this is that the p-value is the probability of the data, given a hypothesis, or \(P(D|H)\) and not the probability of the hypothesis given the data, or \(P(H|D)\). Thinking back to elementary probability theory, you’ll recall that in general, \(P(A|B) \neq P(B|A)\)

  2. A high p-value does not prove that the null is true
    A low p-value implies that the data is inconsistent with the null hypothesis, and so we reject the null hypothesis. Following that reasoning, it’s easy to assume that a high p-value (say, close to 1.0) might mean that this “proves” that the null hypothesis is true. However, this isn’t the case. A NHST is not really designed to “prove” the null hypothesis is true, because you start out assuming it is true and only reject it if the data turns out to be inconsistent with this assumption.

    Just because you got a high p-value, it does not mean that the null hypothesis is true. Your experiment may not have had enough power to detect the effect size. The only conclusion you can draw from a high p-value is that the data you observed are not surprising, assuming that the null was true. That means you can’t really use a NHST to test for no effect; instead you should use something like an equivalence test.

  3. A p-value below your threshold does not prove that the null hypothesis is false
    You’ll note the language that I’ve used here when describing the decision rule on a p-value: A value below your specified \(\alpha\) level means you should reject the null hypothesis and act as if it was false. This does not prove it was false, and in fact, as we discussed earlier, in the long run you may have a false positive rate up to your \(\alpha\) level.

    This error control is one of the aims of frequentist inference. The false positive rate, or Type 1 error rate is defined as how often we reject the null when it was actually true. The other type is the Type 2 error rate (or false negative rate).

  4. The p-value is not the probability of the single test that produced it being a false positive
    This is another thing that’s easy to misinterpret because of the frequentist interpretation of probability. Frequentist interpretations are all about long-run outcomes or the frequencies of events. By setting \(\alpha\) = 0.05, and only rejecting the null when the p-value falls below this threshold, we are ensuring that in the long-run, we won’t make false positives more than 5% of the time. That is, if we tested many fair coins with this approach, we would wrongly declare about 5% of them to be unfair.

    This process makes no statement about whether any single experiment may be a false positive or not. In a strict frequentist interpretation once you’ve done the experiment and you get p < \(\alpha\) you either have a false positive (fair coin labeled unfair) or a true positive (unfair coin properly labeled). But you don’t know which one it is, that’s why you were doing the test! More importantly, there is no probability of an event occuring because it has already happened.

Estimating a parameter from data using MLE

While p-values are useful to quantify how surprising the observed data was, assuming the null was true, we often want to estimate some value (a model parameter) from our experiment. In the case of our coin toss experiment, this goes back to our original question: After observing n tosses of a coin, how can we estimate what the true bias of the coin is?

The frequentist approach to this is called Maximum Likelihood Estimation (MLE) and as you can tell from the name, it aims to maximize some likelihood function by picking the values for the parameters that maximize the probability of observing the actual data. The process works roughly as follows:

  1. Formulate your likelihood function as the probability of observing your data (typically a joint probability since there are multiple events) given the model parameter(s).
  2. Determine the values of the model parameters that maximize the value of this likelihood function.

For the case of a coin toss experiment, where we toss a coin n times to estimate its bias, the likelihood function is closely related to the binomial distribution, which makes sense since that is the process which produced the sequence of coin flips.

For example, let’s say we flip the coin n = 10 times, and observe 7 heads. Our likelihood function, plotted against various values of the coin’s bias, p, (the model parameter) would look like this:

Likelihood function for 7 heads from 10 flips
Flipping a coin 10 times and getting 7 heads yields a maximum likelihood estimation of 0.70

The maximum value occurs at p = 0.70, so this is our maximum likelihood estimate of the coin’s bias. For this simple example, the MLE is intuitive: It is simply the fraction of heads you got after flipping the coin a number of times. All other values of the coin’s bias have a lower likelihood of producing 7 out of 10 heads than a bias of 0.70.

Though MLE is conceptually simple, it suffers from being overly sensitive to the data. We already know that a fair coin, when flipped a number of times, will not always come up heads 50% of the time. I could flip a fair coin 100 times and get 48 heads, and someone else could do the same with the same coin and get 52 heads. Not only will our estimations not match, but neither will reflect the true value. We can think up an even more extreme example: If we only flipped the coin once, a MLE estimation of its bias would either say that the coin will always come up heads, or will never come up heads at all! This is because the estimate from MLE is a point-estimate, that is, the estimate produces just a single value.

The frequentist approach to deal with this is the concept of confidence intervals (CI), which quantify the uncertainty in an estimation derived from MLE. They work as follows:

  • A CI will specify a lower and upper bound around the estimated value. All else being equal, the width of the confidence interval defines the uncertainty in the estimation; if there is more uncertainty, the confidence interval will be wider.
  • Separately, the CI’s width can be controlled to specify how often it will contain the true value being estimated. For example, a 95% CI will contain the true value 95% of the time. (See below for the proper interpretation, since this can be easily misunderstood)

I won’t go into the details about calculating a confidence interval (because it depends on the distribution; see here for a binomial confidence interval), but here are some key takeaways for interpreting them:

  1. A confidence interval (CI) is a frequentist concept:
    This means it is tricky to interpret. When we calculate a single 95% CI, it does not mean that this single CI has a 95% chance of containing the true value. A CI is a frequentist concept, which again means it is concerned with long-run frequencies of events. The only correct interpretation is that over many experiments, 95 percent of the 95% CIs will contain the true value. According to a frequentist interpretation, once you’ve done the experiment and calculated a CI, it either contains the true value or it does not, since it is meaningless to discuss the “probability” of an event that has already occurred.

    The simplest way to remember this is: A 95% CI will contain the true value 95% of the time in the long run, but a single CI either does or does not contain the true value.

    For a good visualization of confidence intervals, see the one from R Psychologist.

  2. Confidence intervals get narrower/smaller as the sample size goes up:
    Thankfully, this is somewhat intuitive. All else being equal, as we increase our sample size, the uncertainty in our estimation will go down, which means any given CI (say, the 95% CI) will get narrower. In our coin example, we could imagine that after flipping the coin 1,000,000 times, we will likely have a very precise estimate of its bias.

  3. Relationship between a CI and a statistically significant result:
    If we have a 95% CI around our estimated value, and that CI does not overlap with the parameter value associated with the null hypothesis, then this result is statistically significant at the \(\alpha\) = 0.05 level. This means that the associated p-value would be < 0.05. In our coin example, the null hypothesis is that the coin is fair, i.e. the bias is 0.50. Suppose we do an experiment with a sufficient number of coin flips (samples) and come up with an estimate whose 95% confidence interval is [0.52, 0.60]. Because this CI does not include 0.5, that means it is statistically significant at the 5% level and the p-value would be < 0.05.

When looking at an estimation produced from MLE, you should also look at the associated confidence interval. This will help you gauge how much uncertainty there is in that estimate, and thus how reliable of an inference the estimate is. There’s no use having an estimate that is precise down to the fourth decimal place if the CI is super wide.

An effect size can be statistically significant but not practically significant. For example, if you flipped the coin millions of times, you may discover that it has a very slight bias of 0.50001 and this may be statistically significant. However, this level of bias is probably not practically significant in most cases! Whether or not an effect size is practically significant depends entirely on what is being measured and the context in which it’s being used - this question can’t be answered by statistics, only by specific domain knowledge in your field.

An estimate of the effect size (along with a CI) and its associated p-value is the best way to present the results of a hypothesis test, as this gives more context than just a p-value alone. Don’t be overly reliant on p-values!

Frequentist Dilemma

If all this talk about frequentist interpretation, p-values and CIs is mind-boggling, you’re not alone. Frequentist hypothesis testing is often derided because its concepts are all too easy to misinterpret. As pointed out, it’s very easy to misinterpret what a p-value is telling you, and what a confidence interval truly means. Additionally, often times confidence intervals aren’t considered - and since MLE produces a point-estimate, without a confidence interval it may give the wrong impression about the precision of that estimation.

There are also concerns that frequentist inference doesn’t give you what you want, i.e. it the p-value is a probability on the data, not on the hypothesis, and often people want to know what the probability is of a hypothesis being true, i.e. the probability of the model parameters being this or that.

I personally believe that NHST (and the associated concepts of p-values and CIs) are useful tools, even if they are frequently misused. The solution is to help others be better informed about these concepts. To that end, I recommend reading, The practical alternative to the p-value is the correctly used p-value (Lakens, 2019) and also a response to that paper.

Conclusion

Statistical inference is the tool used in many real-world applications, from clinical drug trials to A/B testing. Having a better understanding of the tools of statistical inference will help you better understand those applications. While I’m not an expert, I hope that the basics I’ve covered might be useful for someone.

In this article, I’ve gone over some key concepts with regards to frequentist inference: The p-values, confidence intervals, and maximum-likelihood estimation you’ll get from doing a hypothesis test. It’s important to understand what these exactly mean (since frequentist concepts are often unintuitive) as well as the limitations of each.

I also covered the four factors that define any null-hypothesis significance test (NHST): The Type 1 error rate (\(\alpha\)), the Type 2 error rate (\(\beta\)), the sample size, and the true or assumed effect size. Understanding how these interact with each other will improve your interpretation of the results of any NHST.

Although all of these methods are part of frequentist inference, and frequentist inference currently appears to be the most popular form of statistical inference, it is not the only type of statistical inference that can be done. In particular, Bayesian Inference has grown in popularity recently and is increasingly used in fields such as medical device testing. But Bayesian Inference is a topic for perhaps another time!

If you’re interested in learning more about statistical inference, I recommend the Coursera courses Improving Your Statistical Inferences and Improving Your Statistical Questions, both by DaniĆ«l Lakens.