March 20, 2020

Avoid accidental library functions

Have you ever seen a function like this?

def getResultsFromFooService(param1: str, param2: str, param3: str = "param3Default",\
  sort: bool = False, raiseException: bool = False, version: int = 1) -> List[str]:
  """Gets the results from the Foo Service and returns it as a list of strings.

  This function works with both version 1 and version 2 of the Foo Service API.

  Args:
    param1: The first API parameter
    param2: The second API parameter
    param3: The third API parameter; ONLY applies for version 2 of the API!
    sort: Whether to sort the results by their timestamp
    raiseException: Whether to raise an exception if an error is encountered; if False, then an empty List would be returned instead.
    version: The version of the Foo API to use. Currently supports either 1 or 2.

  Returns:
    List[str]: A list of strings representing the results
  """
  ...

This function suffers from trying to do too much, and trying to be too flexible in how it does it. It has multiple parameters, not all of which may be used (param3 only applies if you use version=2), and the flexibility it offers via these parameters creates confusion as to how it should be called: Should you suppress exceptions or not? Why would I want to sort the results? Can I sort by something other than timestamp?

I call these accidental library functions. They are functions that were extended piecemeal over time to become a piece of common code, but probably should have never been a common function in the first place.

Frequently, such accidental library functions are difficult to call properly, and become a source of bugs, because they are difficult to understand. They are also not easy to change, as they are called from multiple places in different ways, so it’s not straightforward to see how a change might break the calling code.

How do these happen?

The unintended consequences of reducing code duplication

We are all taught early on in our careers that code duplication is a Bad Thing: “Thou shalt not duplicate code, lest thy violate the principle of Do not Repeat Yourself”. It’s obvious what the downsides of code duplication are, so I won’t go over them.

However, I am going to posit that eliminating code duplication by making common functions is not necessary a net positive, and is in fact a trade-off. Bear with me, since I know I’ve violated the sacred orthodoxy.

In my experience, accidental library functions (like the one above) usually come from a dogmatic approach that treats code duplication as a “must avoid” without regard to the overall flow of the code, and how it might evolve. Let me give you an example.

Suppose we have two flows, labeled X and Y here. They could represent different API endpoints. For example, X could be the endpoint where you get widgets of type X, and Y could be the endpoint where you get widgets of type Y. Let us also assume that they each have a section of code that is identical. If we want to ensure there is no code duplication, we would refactor that common code into a function (let’s call it A), and now the flows look like this:

X and Y flows

Everything is fine with this, and at this point, A is truly a common piece of code.

Now suppose a change is made in Flow X that requires changes in A to accommodate this. At this point, you have two options:

Modify A to support both the new X and unchanged Y flows. This will likely require passing in a flag and branching within A, hence A is modified to A'
Break out A into A1 (the original, serving Y) and A2 (serving X)

If Flow X is going to be changing fast, I am going to posit that (2) is better because it allows you to change A2 without having to worry about affecting anything other than Flow X. However, chances are that (2) won’t be taken because it takes extra effort compared to (1), and also knowing whether Flow X is going to be changing fast is often subjective. Hence, for reasons of inertia option (1) is almost always taken, and our system becomes something like this:

X and Y flows changed

Now let’s say we have to add Flow Z, and A' almost does what it needs, but not quite - so we add another flag to A' and it becomes A", with even more branching. Repeat this process a few more times, and soon we have an accidental library function.

X, Y, and Z flows

In this accidental library function, there is actually very little commonality, and instead we just have a single function that is trying to do too much and is essentially handling multiple use cases individually with if/else branches.

The root of the problem here is that when we designed Flow X and Y, we decided to reduce code duplication and create a single function A. This single function became the default position in the code base, and led to subsequent changes trying to preserve its supposed commonality. This resulted in unrelated flows being coupled together through the use of the wrong abstraction.

Code duplication is a trade-off

When you decide to reduce code duplication by refactoring things out into a common function that is called by two or more flows, you are making several assumptions, whether you realize it or not.

The assumptions that you are making are:

The common function will always be used in the same way by all the flows.
If you need to make a change to this common function, it will apply to all flows.
If you need to make a change to only one of the flows, you won’t need to change the common function. (This follows from (2))

If any of these assumptions are violated, the benefits you attempted to get by reducing code duplication may be outweighed by the extra maintenance you will incur from trying to make your common code work for new use cases.

Whether these assumptions are correct depends on a variety of factors, and it doesn’t always come out in favour of reducing code duplication. Some factors that I’ve observed which contribute to whether the assumption is correct or not are:

If you expect the flows to change in different ways and not in lock-step, it’s likely your common function will not be common after some time.
This is especially true if one of the flows is not being actively developed (maintenance mode) and the other is being rapidly iterated on.
How long is the common function? How much logic is encapsulated there?
The longer the function, the more specific it is, and thus the less likely it is to survive being common if the flows change.
How many flows are calling this common function? If it’s only two, chances are it’s not truly common but rather just coincidental.
Attempting to produce a common function on the basis of only two examples is almost certainly overfitting and will result in a “common” function that eventually itself needs refactoring. Instead, you should follow the Rule of Three, which roughly states that two duplications are OK, but when you have three, it’s probably time to refactor.

These are some things to think about before you decide on whether to refactor and reduce code duplication.

Library functions require a higher standard of care

Another nice way to avoid accidental library functions is to think about what an ideal library function should be. An ideal library function is one that can be reused not just across the current code, but is also likely to be useful for future use cases. Obviously this is harder than it sounds, so let’s go through some examples.

An example of an ideal library function might be a function that returns the Top-K values from a list. This function would take at most three parameters: A list, a value for K, and perhaps a function that defines how to order the items in the list.

We can think of some aspects of what makes a good library function, based on this example:

It does one thing
It doesn’t have a huge parameter list, so it is straightforward to use
Once written, it’s unlikely that you’ll have to change it (unless there was a bug)

By contrast, the example I presented at the beginning violates all three of the points above. I’ve found that point (3) above is the most salient; if you find yourself frequently changing a piece of “common” code, it probably shouldn’t be common!

Conclusion

I’ve harped on this topic a little excessively, but I wanted to provide a counterpoint to the default assumption that code duplication is always a bad thing. I’ve demonstrated how a knee-jerk reaction to eliminate all code duplication can result in unintended and unnecessary code coupling and complexity down the line, but this isn’t to say that the pendulum should swing all the way over to always preferring code duplication.

For example, there are some cases where code duplication is almost certainly a Bad Thing, such as wholesale copying of entire classes and packages; this should almost always be avoided.

The point here is that we shouldn’t base our decisions on rigid rules relating to code duplication. Instead, we should consider whether the abstraction being used is the correct one, after taking into account the particular situation or scenario in your application.