November 29, 2020

The friendship paradox and why your friends are more popular than you

The friendship paradox states that on average your friends have more friends than you do, or put another way, your friends are more popular than you are. This seemingly paradoxical observation has implications for how we perceive the world and also for how we define what a “typical” experience is. But first, let’s explain how this paradox arises, since it’s not obvious.

Friendship as a graph

This paradox arises because it seems counterintuitive that even on average, a person’s friends would have more friends than them. Even if there are a lot of people with few friends, surely this would be balanced out by all the people who have lots of friends? To gain a better understanding, let’s think of things in terms of a graph.

If we model friendship as a graph, where each person is a node, and friendship between two people is indicated by an edge between the two respective nodes, then the number of friends a person has is indicated by the degree of (number of edges incident to) the node that represents them. Indeed, if we randomly select two nodes from this graph, there’s no reason to believe that the degree of one should be larger than the other.

But that’s not what’s happening here. The friendship paradox defines a procedure like this:

Randomly select a node (person) in the graph. (This could be you)
Iterate over all the edges (friendships) from this node (person) to discover all the nodes directly connected to it. (These are your friends)
Compute the average number of edges (friendships) incident to each of the nodes discovered in (2).

According to the friendship paradox, (for most social graphs) the average computed in (3) will be greater than the average degree of a random node, thus implying that on average, your friends have more friends than you.

This arises from the fact that in (2) above, we are not sampling nodes uniformly; we are selecting them based on their connection to the node chosen in (1). Nodes that have a higher degree (are more highly connected) will be overrepresented during this step just by virtue of being connected to more nodes.

Thus, iterating over your friends can be seen as a form of non-uniform sampling that is biased towards people that have a lot of friends. This non-uniformity is introduced because in most social graphs, edges are not distributed uniformly over all nodes, that is, there is sufficient variance in the number of friends a person has.

We can think of two extreme examples to illustrate this point. In the first, we have a star graph, where there is a single central node that all other nodes are connected to. The other nodes have no connections other than the one to the central node.

For a graph with \(N\) nodes (i.e. \(|V| = N\)), this means the central node has \(N-1\) edges while all others have only one edge.

In such a star graph, randomly selecting a single node will almost always yield one of the “leaf” nodes. All of these have only a single edge to the central node, so for all of these leaf nodes, the only node connected to it will be the central node, which has many more connections than a leaf node.

In the second example we have a fully connected (or a complete) graph where every node is connected by an edge to every other node. In such a graph consisting of \(N\) nodes, each node has exactly \(N-1\) edges connected to it; there is no variation in the degrees of various nodes.

In this graph, the friendship paradox does not exist. In fact, it can be shown that the friendship paradox only exists where the variance in the degree of nodes (number of edges each node has) is positive. Since most social graphs have varying degrees across the nodes (i.e. the number of friends each person has is not the same), the friendship paradox almost always exists.

As I noted in the beginning, this has implications for how we perceive the world. Taken quite literally, you might observe that you have fewer friends than your friends (e.g. on Facebook, and assuming Facebook friendship is a proxy for real-life friendship). This however doesn’t mean you have fewer friends than the average person! Your sample of the world has just been biased.

Unintentional Oversampling

The same principles behind the friendship paradox also show up in other scenarios. Dr. Allen Downey talks about the paradox of class size. In this paradox, you survey students about how large their class sizes are, and from the responses, you compute an average. In most cases, this average will be larger than the average of all class sizes. How could this happen?

Let’s go over a simple example: Suppose there are 150 students, who are divided into two classes, one of size 100 and one of size 50. (To keep things simple a student can only be in one class) In this example the average class size is 75; nothing is complicated about this.

But if we uniformly sample students and survey their class size, on average we will get about 2/3rds that belong to the class of 100, and only 1/3rd that belong to the class of 50. If we take the average of their responses, we will get around 83, which is simply \((2/3) \times 100 + (1/3) \times 50\). This overestimates the true value of 75, because when we sample students uniformly, we will oversample those that belong to larger class sizes.

Classes and Students as a Graph — If we randomly sample student nodes (S) then we are more likely to encounter the larger class size C1 than C2

This scenario can also be understood as a type of graph, with students as one type of node and classes as another. In this case, edges are between student nodes and class nodes, with an edge representing student membership in a class. As in the friendship paradox, classes that have more students will be overrepresented when we uniformly sample students to determine what class sizes are. Similarly, the divergence between the “true” class size average and the observed average is proportional to the variance in the class sizes.

However, we have to be careful about what we are measuring. If we are just trying to measure the arithmetic mean of class sizes, then clearly the value of 75 here is the correct one. However, if we are trying to measure the average class size that a student experiences, then arguably the value of 83 is a better representation.

We can think up an extreme example to demonstrate this. Suppose there are 101 students, each divided into two classes:

Class A: 100 students
Class B: 1 student

Here, the average class size is about 50; however, no student will actually experience this class size! Furthermore, the “biased” sampling method outlined previously would yield a value close to 100, which is indeed what almost all students would experience. In this case, giving the average across class sizes (50.5) would arguably be less informative than saying “most students experienced a class size of around 100.”

Bias is subjective

This form of sampling bias, (termed the Inspection Paradox) occurs whenever the variable being measured is correlated with the frequency of certain values being observed. In the friendship paradox, the sampling method turns up people who have more friends on average than a person chosen randomly. This bias can creep in unexpectedly, even when it appears that you are sampling uniformly.

Let’s say you wanted to measure the average tenure of an employee at some company. If you just randomly surveyed all the current employees at the company, this would again overestimate the true value because you’d be less likely to observe the employees who had an extremely short tenure and more likely to observe those with very long ones. A better approach would be to use exit surveys to track employee tenure at the time of departure, and combine this with data on the tenure of current employees.

However, this sampling bias is not necessarily a bad thing, so as long as you know that it exists and understand how it relates to the question you are trying to answer. For example, if your question is, “What is the average length of time a current employee has been with the company?”, then just surveying existing employees will give you the answer you want. You should always be specific about what you are trying to measure.

As another example, take the well-known PageRank graph traversal algorithm. In its simplest form, you essentially do a random walk along the graph and count the frequency of landing on each node during these repeated random walks. Because this sampling method is based on sampling edges, it will over-index on nodes that are highly-connected - but that is the entire point of the algorithm: To discover nodes that are more popular than others!

Correcting for the bias

When you need to conduct a survey across a population, collecting enough responses can be time-consuming. Using a method called Social Sampling, where respondents reply with a summary of their friends responses to the survey can cover a larger percentage of the population with the same number of respondents, making it attractive from a cost perspective.

However, this method obviously introduces bias as the sampling procedure is exactly the same as that underlying the friendship paradox: People with lots of friends will be overrepresented. Can this bias be corrected for? That is exactly the subject of this 2012 KDD paper by Dasgupta et al.

In that paper, the authors introduce a method to produce an unbiased estimator that uses social sampling, based on the following:

Since high-degree nodes are overrepresented proportional to their degree, the values from these nodes should be weighted by the reciprocal of their degree. This means that respondents should weigh the value of each of their friend’s responses by the number of friends that friend has.
To make the sampling more efficient, don’t use uniform sampling in selecting respondents. Instead, prefer high-degree nodes since this will allow for greater coverage of the network/graph for a given number of respondents. However, this introduces bias if there is correlation between the node degrees and the value being measured, but this can also be corrected for under certain conditions.

Since this “ideal” estimator relies on knowledge of the network structure (e.g. the degrees of every node) that may not be practically available, the authors also outline several approximate estimators that don’t rely on knowing the full network structure. These have most applicability for conducting polls and I encourage you to read the paper if you want to learn more. While the full explanation and proof of the algorithms is a little heavy on graph theory, the rest of the paper is pretty accessible.

Conclusion

The friendship paradox, or more generally, the inspection paradox, can significantly affect how we perceive the world, leading to observations that seem to contradict the aggregate data. These can be amplified by other biases (such as confirmation bias) to produce perceptions that are in conflict with reality. However, the bias introduced by this paradox is not necessarily a bad thing: Instead, it is something you should be aware of, and understand how it affects what you are specifically trying to measure.