As probably most of you are aware, I’m actually a software developer. I don’t talk about it here much1, as this is mostly a newsletter about the rest of my interests, but I’ve been experimenting with writing about whatever I feel like writing about, and today I feel like writing about binary search.
If you’re not a programmer, don’t worry. Most of this should be highly accessible to you anyway. There will be some code later, but it’s not the main point of this article. I just think it’s worth understanding the technique either way. If you are a programmer, this will still teach you something new, because I think you’ve probably been taught binary search wrong.
Part of why I’m thinking about binary search is that it came up in a recent issue of London Centric, in the context of bike theft. This isn’t the first time it’s come up, and it’s a story that makes programmers (and mathematicians apparently) wince in pain when they hear it.
Here’s the problem: You leave your bike somewhere at 9AM. You come back at 5PM, and discover to your horror that your bike has been stolen. Fortunately, there’s a CCTV camera. You go to the police and report the issue.
“Sorry”, they say, “that would require us to watch eight hours of CCTV footage to spot the theft. Bike theft just isn’t a high enough priority to justify that kind of effort.”
At this point you might feel outraged, but depending on the sort of person you are your outrage might take a different form. On the one hand, you might be outraged that your bike has been stolen and that you’re not getting it back. If, on the other hand, you are a Cambridge professor or otherwise similarly inclined, you will be outraged at what you see as a failure of basic common sense, whip your blackboard out of your pocket, and proceed to give the police a lecture on what you believe to be one of the fundamental rights of humankind: Replacing O(n) operations with O(log(n)) ones.
Here’s the idea: OK, sure. There are eight hours in which your bike could have been stolen. That’s a lot of footage, makes sense that you don’t want to watch that.
But… you could just check the footage for 1PM. Is the bike there? If so, great. Now you know the crime was done between 1PM and 5PM. Otherwise, you know it was done between 9AM and 1PM. Either way, that’s four hours of footage to watch. Still too long, but a lot less than eight!
Let’s say the bike is still there at 1PM. Now, you check whether the bike was there at 3PM. If it was, the crime happened between 3PM at 5PM, otherwise between 1PM and 3PM. Again, now you’ve only got two hours to check…
You can repeat this process, each time cutting the time you have to check in half. Once you’re down to a few minutes where you know the bike was there at the start and wasn’t at the end, you just watch those few minutes of footage to see the crime being committed.2
This strategy is called binary search. Search meaning, well, you’re looking for something, and binary meaning “relating to two”. It’s binary search because you’re searching by repeatedly reducing the size of the space you have to search by a factor of two.
You have to check seven times to get the time down from eight hours to under five minutes. If you exactly halve it each time (you probably won’t, but it works basically the same way if you’re close enough and pick nice round numbers), the amount of footage you have left is eight hours, then four hours, two hours, one hour, half an hour, 15 minutes, 7.5 minutes, 3.75 minutes.
If you instead had sixteen hours of footage to check, you would require one more step, because your first step would cut you down from sixteen hours to eight, and then you’d need the seven more steps as above.
This is what my crack about O(log(n)) operations means: The original strategy is linear - that is, every minute you add to the footage adds a minute3 to the time it takes you to process it.
In contrast, the binary search strategy is logarithmic. Every time you double the size of the search space, you increase the time it takes you by some fixed amount (the amount of time it takes you to check a midpoint).
For large spaces this matters a lot: If you start with eight hours, and each step takes you say 10 seconds, then it takes you a minute and ten seconds (70 seconds) to get yourself down to three minutes and 45 seconds, so the whole task takes you just under five minutes, in contrast to watching the whole eight hours. If on the other hand you started with ten minutes of footage, your binary search step cuts the space in half to five minutes, then you watch five minutes, and it only saves you just under five minutes of time watching (you could of course cut it down further - e.g. if you wanted to only watch two and a half minutes of footage, this could save you a bit over 7 minutes).
If you only had 30 seconds to watch in the first place, there’d be little point in doing any fussy midpoint finding
The point of “linear” versus “logarithmic” is not that logarithmic is faster in general, it’s that logarithmic is always going to be better for large enough problems. For small problems it might still be faster to use a linear operation.4
Now, if you’re a programmer there are two things you should probably have noticed about the above explanation:
It is blindingly obvious if you’ve encountered binary search even once.
It is probably not actually an example of binary search as you’ve been taught it.
The other prompt for writing this post is that I was running a calibration interview question on a colleague who I have a very high opinion of, and at some point he tried to write a slightly custom binary search to solve a problem, and got into a little bit of a mess (I think he’d have sorted it out fine when not under time pressure) and reverted to a linear search. I, meanwhile, was surprised, because I forgot that binary searches were a thing people got wrong despite myself having previously failed an interview by getting a binary search wrong. The reason is that in the meantime I have learned the correct way to think about binary search, and forgot that almost everyone teaches it wrong. So the next part of this post is my attempt to teach it to you right.
In order to get a stereotypical example of how binary search is normally taught, I asked Claude.5 Here’s what it said:
Binary search is an efficient algorithm for finding a target value in a sorted array or list.
How it works:
Start by examining the middle element
If the target equals the middle element, you’re done
If the target is less than the middle element, search the left half
If the target is greater than the middle element, search the right half
Repeat this process on the chosen half until you find the target or the search space is empty
I think this is a perfectly adequate example of how binary search is normally explained, but note that this is very much not what we’ve just done in the bike theft example! We don’t have a list, nothing is sorted, we’re not looking for a specific item.
And yet, they do seem quite similar. The reason for this is that they are the same thing, but Claude’s (and everyone else’s) explanation is only describing a single specific use of binary search rather than the general phenomenon, and as a result contains a lot of distracting details that make it harder to understand and also less useful.
In order to see one of the key differences, let me tell you a story.
Suppose a thief steals your bike at 12:30 PM. Riding away madly, he is suddenly struck by an epiphany: Stealing is wrong.
Guiltily, he cycles back to where he found it, and puts your bike back and locks it up with the somehow magically still intact lock (or maybe you didn’t lock it, at which point frankly this whole saga is on you), returning it to the bike rack at 1:30PM.
Now, at 3PM, some other bastard steals your bike and cycles off with it. He is in no way burdened with a sense of conscience, and keeps it for himself.
What happens now with our binary search?
Well, we check at 1PM, and the bike’s not there. Therefore it’s stolen before 1PM. Now we check at 11 AM, bike’s there, etc. until we eventually find our bike thief at 12:30. Great! We now have a bike theft to prosecute.6
As binary search is normally taught, it works on sorted arrays. In this case, that would mean that your bike has a certain stolenness that only increases over time: It can be stolen, but once it is stolen it is never unstolen. As the above example shows, this method finds a theft just fine if you lose that property and allow for the bike to be unstolen.
The next thing is that this is binary search over a continuous space. If you wanted to, you could run the binary search forever, slicing the time finer and finer.7 This would be dumb, so we bail out when we get to a small enough time to watch it ourselves, so it causes no problems, but the pure binary search as explained by Claude never stops here unless we get lucky and find the exact instant of the theft (as opposed to some point when the theft is in progress).
Neither of these are problems if you think of binary search in the right way, and the bike theft example is a nice illustration of what binary search is actually doing: You have two numbers (or points on a line if you prefer) which might be far apart, and are different in some way. Binary search lets you find two numbers between them that are close together and are still different from each other. That is, it lets you find two nearby points where something changes, where “nearby” means “within five minutes of each other”, and the something that changes is whether there is a bike there.
In the sorted list example, what you’re looking for is a point where the values change from less than your target to greater than or equal to your target, and your notion of “nearby” is “right next to each other”. So, in code, the standard binary search looks like this (if you can’t read the code, don’t worry about it too much, I’ll explain in a minute):
def find_target(elements, target):
“”“
Given a sorted list `elements`, returns the index
of the first position where all elements after that
point are >= target.
“”“
# If the first element is >= the target, all elements
# are, so we return 0.
if elements[0] >= target:
return 0
# If the last element is < the target, then no elements
# in the array are, so we return the length meaning that
# only an empty set of elements are >= the target
if elements[-1] < target:
return len(elements)
# Invariant: elements[low] < target
low = 0
# Invariant: elements[high] >= target
high = len(elements) - 1
while low + 1 < high:
# Take the midpoint
mid = (lo + hi) // 2
# Assign to whichever endpoint would preserve
# the invariant.
if elements[mid] < target:
low = mid
else:
high = mid
# By the invariants, elements[low] < target and
# elements[high] >= target
# We know that at this point low + 1 == high, so
# that means that high is necessarily the first
# point at which elements become >= target.
return highIf you’re a programmer reading this, the main take home you should have from this code is that if you find yourself implementing binary search (which hopefully you won’t too often, but sometimes you gotta), never under any circumstances skip writing those “Invariant” comments at the top. They will save you confusion every time.
If you’re not a programmer, here’s what’s going on in it:
We’re looking for the point where the list changes from elements that are less than the target to the point where they are greater than or equal to the target. We find a pair of indices next to each other where this changes, and thus the larger of the two is necessarily the first point (because it is greater than or equal to the every element before it is less than the target).
If the target is in the list, that first point where they are greater than or equal is necessarily where the target is. Otherwise, you can tell the target is not in the list by looking at that index and checking whether it’s past the end of the list or the value at it is greater than the target.
Let’s try an example to work through it: Suppose you’ve got the list of elements 1, 3, 5, 7, 9, and we’re looking for the number 7 in it. This proceeds as follows:
We start with low = 0, high = 5 (indexes start from 0, so the first element is the element at position 0).
We take the midpoint of this, which is 2 (division rounds down), so we look at the element at position 2, which is 5.
This is less than 7, so we now set low = 2.
Now, low = 2, high = 5, so mid = 3. The element at position 3 is 7, which is greater than or equal to seven, so we set high = 3.
Now, low is 2, high is 3, so lo + 1 equals high, and we stop and return that the first index greater than or equal to 7 is at position 3 (which it is, because 7 is there).
But honestly, if you’re not a programmer, this is probably the least interesting possible example of binary search, this problem just comes up in programming a lot, and it’s much more interesting to understand the general principle.
Why is it important to understand? Well, partly just because I think knowing things like this that give you an easy perspective shift on what is possible is almost always valuable, and partly because binary search is very useful.
We’ve already seen how it takes the cycle theft example from not worth it to to easy,8 by helping you figure out when something happened. Another place I’ve suggested in the past where understanding this principle is useful is estimating unknown quantities. I used the example of guessing the population of Paraguay. You first pick a number that’s obviously too small, and a number that’s obviously too large, and then you progressively narrow that range through binary search.
Except, that again isn’t quite the same thing! The problem there is that while you are doing the same thing (finding a pair of points where one is obviously too small and one is obviously too large), your stopping condition is no longer necessarily when the points are close. The problem is that there’s a large range of values in the middle where your answer to “Is this obviously too large or obviously too small?” is “No. Seems kinda plausible but I’m not sure.”, so you can get stuck on making progress.
One way to solve this is with the same way we did with the ordered list: Our condition for the lower bound is “obviously too small”, and for the upper bound is “not obviously too small”. I think you’ll still get stuck at some point when you find yourself asking “OK, but how obvious is obvious…?”, but you’ll do a lot better. You can then repeat this on the other side with “obviously too large”, and you get a range of values that you think are plausible.
Another way to solve this is that once your midpoint is non-obvious you can just try testing some other points in the range and updating your range on the basis of that. Pick randomly, or near the edges, or whatever, and narrow the range on that basis. There’s not actually anything magical about the exact midpoint - any point you try between the two ends will allow you to update the range if you get an “obviously too small” or “obviously too large” answer.
This sort of flexibility is one of the reasons why I think it’s super helpful to understand binary search as looking for changes in behaviour. It makes it much easier to reason about new variants, because you really understand what’s going on.
And in general this is one of the key purposes of understanding how things work: If you know how to make it, you know how to make something like it that works slightly differently. In the pen cap post, I talked about how “how things work is how you work with them”, and that’s important, but the other reason to understand how things work is that how things work is how other things like them work, and how things work is how you make them.
I’ve run into this principle over and over again in programming, software development, and maths.9 e.g. I did some algorithmic work I’m really proud of recently,10 and it only happened because I spent some time staring at the preliminary material going “Hmm… How on earth does that work?”, in much the same way I found out about pen caps recently. Right now I’m working on some statistics, and I think if I treated statistics the way most people do - as a set of tools to use without understanding them - the task I’m doing would be almost impossible.
I don’t think this is just for abstract subjects either. I see it a lot in other people who better at more physical skills than me: In much the same way I can just whip up a new algorithm, they can just improvise or custom build a physical object where I’d have to ad hoc adapt some existing object that was much less fit for purpose.
And, you know, often that’s fine. It’s impossible to acquire complete knowledge of everything. Some of the people I’m comparing myself to on the physical skills have literal decades of more experience in the subject,11 and I’m unlikely to put in the effort to acquire those. I expect that will be the same for most people on most of the specific things where I understand them in this way too. This isn’t an invitation to try to learn everything, it’s just me drawing your attention to what happens with the things you do decide to learn.
And part of that is that when you find yourself in the position of acquiring a new technique or skill, I think that if you can it’s worth taking the time to make sure you really understand it, and to find the right way of looking at it, so you can stop treating it as a narrow and well-defined thing you can do,12 and start treating it as a lesson in how to interact with the world in a variety of related ways.
There’s more technical posting on my notebook blog. I used to do technical blogging at my “main” blog, but that’s as the last post says “on indefinite hiatus”.
There’s also the possibility of you happening to get lucky and catching the criminal mid-act in one of your repeatedly checking points in the video.
In general, linear/O(n) actually means “some fixed multiple of a minute”. For example if you had to watch the footage twice, every minute would add two minutes of time for you. If you were able to watch the video at 10x speed, every minute would add six seconds. Both of these are still O(n) operations, because every minute of footage adds a fixed amount of time to the task.
And, indeed, you can see that this is the case in this example! Once we get under the five minute mark we switch to a linear search. This is partly because of the problem - we actually want to catch the thief in the act - but also for viewing a continuous chunk of footage, whenever you get to under the time it takes you to fiddle with your video controls, it’s always going to be easier and faster just to watch it.
Claude should of course not be taken as authoritative, but I think it’s pretty safe to rely on it being clichéd.
Of course, it’s not very useful for getting your bloody bike back, but details details.
Well you couldn’t because the video has a fixed frame rate, and maybe there’s a quantum time consideration even if that weren’t the case, but ignore this detail.
Or, at least, reduced to the problem of getting the police to follow basic instructions and care about doing their jobs.
Even with binary search! As the expression goes, if I had a nickel for every time I’ve invented a novel variant of binary search, I’d have… Well, two nickels, but it’s weird that it’s happened twice.
The first is the observation that binary search gets much faster (improved complexity even!) if you have a good guess about where the point you’re looking for is. The second is that if what you’re actually looking for is a random sample in some ordered range, you can do binary search incrementally to get the desired result efficiently.
I don’t think I’d have figured out either of these without the perspective on binary search I’m telling you about here.
The GAWRS and RAWRS variants in section H.3 of Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling if you’re interested. Sadly I only did this work quite late in the publication, so it only made it into an appendix, but these are implemented in genlm-control as the main algorithm used for AWRS.
Hi dad.


This is an elegant proof of the value of logarithmic approaches in real life, and not just abstract math examples. I'm not sure I've seen one quite so compelling before. Please let me know if you ever write about statistics for people who don't understand them at the mechanistic level, ha
What an amazing read, thanks for sharing. I will be thinking of this!