How to think about estimation strategy
This continues my series on estimation, the full index of which is available here.
If you haven’t read the rest of the series yet, I’d encourage reading them first, especially the first one, but it’s not strictly required.
In this post I want to elaborate further on how to reason about estimates - what makes a good estimate, or a bad one? How do you know when you estimated it wrong?
When is an estimate wrong?
In How to think about task estimation, I introduced the following toy model: Flip a coin. If it’s heads, you’re done. If it’s tails, you repeat the task tomorrow. You will keep doing this until you get heads. You’re asked to estimate how long it will take to complete this task.
As I talked about then, what estimate you give depends heavily on what question you’re asking. An estimate of one day on our toy task is a good estimate for the average cost of such tasks, and a bad estimate for guaranteeing that it will be done by some deadline.
But regardless of what estimate you give, it will be “wrong” at least half the time. If you estimate zero days, you’ll be right half the time when the coin flips heads on the first go, but you’ll also be wrong half the time. If you estimate one day (which is probably a better estimate for most purposes, though as we saw it’s not the only good estimate), you’ll only be “right” a quarter of the time!
This is because estimation is always to a question, and none of the questions we get to ask are “What is the true number of days to do this task?”, because answering that would require perfect knowledge of the future.
There is only one estimation procedure that will allow you to know the true number of days that a task will take, and that’s to actually do the task. It’s a very effective estimation procedure, but has the downside of being hard to predict how long it will take to estimate the task (or rather, it’s easy to predict how long it will take to estimate the task! You just do the task).
The point of making an estimate is to quickly answer some question about the future - much faster than waiting for it to come there on your own - and in doing so an estimate has to collapse uncertainty. But all that uncertainty is still there, it’s just not represented in your estimate.
As a result, judging an estimate as wrong when it differs from the true length of the task is an unreasonable standard. By this standard, your estimates will almost always be wrong. Suppose you estimate a task will take seven days, and you finish in six and a half. Wrong estimate! You were off by an entire half day!
This is, of course, ridiculous, nobody would give you a hard time for that error.
But often they would give you a hard time if you estimated seven days and it took you nine. You underestimated the task, and that’s something you can get blamed for.
But did you, or was this just the intrinsic uncertainty of the problem?
The problem of resulting
There is a concept from poker called “resulting”, which is the fallacy of concluding that a decision was bad because it got bad results. Poker is a game with a lot of uncertainty, where you can play perfectly and still lose, so avoiding resulting is particularly important for developing your poker strategy.
For a simpler example than poker, let’s consider another coin tossing game. Suppose you have the opportunity to bet £1 to bet on the flip of a coin. If you guess the coin right, you get £100, otherwise you walk away with nothing. Do you take the bet?
Unless you have some deep seated moral or practical prohibition on gambling (which is reasonable, but if so imagine this is some business investment or something. Finance definitely isn’t gambling honest), the answer is obviously that you should take the bet. It’s a really good bet.
But, 50% of the time you’ll walk away with nothing. When that happens, it’s very tempting to conclude that you should have guessed heads instead of tails (or vice versa), or not taken the bet, but both of those are the wrong conclusion. If you’d guessed the other face of the coin, you’d have been equally likely to lose. If you’d walked away, you’ve lost out (on average) on £49 of profit.
It’s particularly important that you don’t conclude that you shouldn’t have taken the bet, because if you make many small bets like this you will, on average, make significant amounts of money. If you took 100 bets like this, you’re overwhelmingly likely to make at least £3000 of profit, because even though you win some and you lose some, when you win you make a lot more money than you lose, and the law of large numbers strikes again and you win on average pretty close to half the time.
So, the correct conclusion with this game is simple: You played perfectly, and still lost. It happens.
Avoiding resulting isn’t a refusal to learn from your errors though. Real life is more complicated than this toy model, and you do need to learn from your errors rather than just always concluding that they’re fine. For example, if you played a game like this and tossed the coin 10 times in a row and got it wrong each time, you should start to look suspiciously at the coin tosser and conclude that the game is probably rigged and then walk away.1
The point, though, is to not treat a discrepancy between the outcome and the ideal outcome as automatically a sign that you made the wrong call in the first place. When working with uncertainty, it’s typically impossible to guarantee a win - all you can do is improve your chances, and sometimes you’ll lose anyway. Them’s the breaks.
Estimation, too, has a problem of resulting, and these two coin games have a very similar problem: You’ll frequently get a different number than you estimated, no matter how perfectly you estimate, and this isn’t necessarily a sign that you did anything wrong.
Playing the game of estimation
The way I like to think of estimation is that you’re playing a game: Your move is to name a number, and then reality happens and tells you the true value, and how many points you get depends on the relationship between the two numbers. If they’re close you get lots of points, if they’re far apart you get few points. Also the number of points can depend on whether your estimate was too large or too small - for example you might get more points for overestimating a task slightly (you got it done faster than you expected) than underestimating it slightly (you “failed” to get it done in the time you said it would.
Every time you estimate, you get points, and add these to a total, which is your number that you want to make go up.2
In fact, this isn’t quite right. Usually we want to think of it not in terms of getting points, but incurring penalties3. It’s not that getting close to the right answer is good, it’s that getting far from it is bad. These two views are basically equivalent if you think of it in terms of being given points for playing and losing some of those points for playing badly, but the penalty view turns out to be more convenient to work with. Points are more emotionally satisfying though, which is why I encourage thinking about the points view.
Generally speaking when estimation you're not going to literally be keeping score, the score is just a nice abstract way to think about how well you’re doing.
This isn’t a perfect view of estimation, because the actual consequences of a bad estimate can’t really be reduced to a simple score (they’re actual business consequences) and sometimes you’ll need to take that more seriously, but for most estimation it’s fine to just think of it in this simplified way.
Importantly, there’s no question of whether you can play this game. If someone comes to you for an estimate you can just say “4” “4 what?” “IDK, what units do you want? 4 days? 4 story points?4 Whatever it is, it’s 4.”. I don’t recommend doing this, but it’s a perfectly valid move, it’s just a bad strategy.
Your goal in learning to estimate is to improve on this strategy, so that you consistently get a higher total score over time. Of course, in order to need to do this, you need to actually know how to score points.5
The way to think about the penalty in a game like this is that it’s an abstract measure of badness, and the penalty score you accumulate can only be interpreted in relation to other scores. A penalty of three doesn’t mean anything, except that it corresponds to an error that is three times as bad as an error with a penalty of one.
Depending on the shape of the penalty rule, the optimal strategy is to choose a different sort of estimate. For example:
The median estimate is what you get if every day you are off by, in either direction, costs one unit.
The 99%-ile estimate is what you get if you pay a cost of 99 for every day you are late, and a cost of one for every day you are early (Don’t be fooled - there’s a nice numerical coincidence going on here that makes this work out this neatly. For other percentiles it’s different. To get the P%-ile, every day late costs P / (100 - P) as much as a day early. For example you get the 90%-ile when every day late costs nine times as much as every day early, and the median is the 50%-ile so every day late costs 50/50 = 1 times as much as a day early). Equivalently, if being late is T times as bad as being early, you get the 100T/(T + 1) %-ile. So e.g. being late being 3 times as bad as being early gets you the 75%-ile.
You get the mean out of a slightly complicated scoring rule, which is that you take the difference between the estimate and the true value and square it. So for example a day late or a day early costs one, but two days off costs four, three days off costs nine, etc. Don’t worry too much about why this is the case - it’s for mathematical details that make sense but that are almost never going to be relevant to you - but the important thing to think about here is that large errors cost disproportionately more than small ones.
This is going to depend on the problem you are trying to solve.
A good starting point is to think about how bad it is to be a day late is (or an hour, or a week, whatever convenient unit of time you want). Now ask:
Is it better or worse to be a day early? It’s usually better (it’s only worse when there’s a strong social pressure to underestimate, which is generally a very bad sign for your estimation process). How much better?
How much worse is it to be two days late than it is to be one day late?
How much worse is it to be two days early than one day early?
If it’s hard to reason about the second two questions, don’t worry about them for now and just assume it’s linear (i.e. it’s twice as bad to be two days late/early as it is to be one/day early). This gives you percentile estimates.
If it seems weird that it’s bad to be early, remember that this means that your estimate might have been too pessimistic. If it’s hard to imagine feeling bad about being one day early, try thinking about being ten days early, or a hundred. There is clearly a degree of overestimation where it starts to look bad.
So what strategy should I use?
Unfortunately, this is a hard question to answer. You should use a strategy that makes you win more, same as any other game, and same as almost any other game there is no straightforward description of such a strategy6.
If you don’t know the shape of the scoring - what you’re trying to do - you cannot possibly win, and the question of “What strategy should I use?” is akin to “How do I win at card games?” - you’re going to get very different answers for Poker, Bridge, and Solitaire.
Once you have a scoring rule, you can at least start to reason about strategies. You will probably not find an optimal strategy, but some strategies will be clearly better than others, and you can at least figure out some basic features of what a good strategy looks like.
For example, a reasonable strategy for most scoring functions should always pick a number in your plausible range. Why? Well, because if it picks a number larger than your plausible upper bound, picking the plausible upper bound would always have been better for any true value - it’s a smaller error in the same direction. Technically you could have a scoring function that measures that as “worse”, but this would be dumb so you probably don’t. Similarly the lower bound.
Beyond that it’s hard to come up with a good general rule. In particular you want to evaluate strategies on two axes:
How much effort is this strategy?
What sort of scores does this strategy typically get?
For example, the decent first guess strategy is a low effort strategy that doesn’t get particularly good scores. Other good strategies include things like planning poker - where you get multiple people to estimate privately and then relying on wisdom of the crowd to get the right answer by e.g. taking the median of the estimates7.
One strategy, which I will explore more in a later post, is to make a simple computer model of the situation, and pick the number which would be optimal given that model.
The effectiveness of this strategy is, I think, one of the strengths of this game view of estimation: If you think of the model as needing to accurately reflect reality, you end up in all sorts of metaphysical questions about whether your distributions are right, whether probability is real, etc.
But a model for estimation isn’t that, it’s not supposed to be a perfect representation of reality. All it is a game strategy, and you can measure the quality of the model not based on how well it represents reality, but how many points it gets you.
Take my courses!
Do you like learning about this sort of thing? Why not learn about it from me directly! I’ve started offering various courses on the sorts of skills you need as a software developer, including one on estimation. You can learn more about them at consulting.drmaciver.com/courses.
I’m also available for a wide variety of other consulting and coaching services for software companies. Have a read through the consulting site and/or drop me an email at david@drmaciver.com if you want to know more.
Subscribe to my newsletter!
If you liked this piece and want to read many more like it, why not subscribe if you’ve not already? Here’s a subscribe button for you to click. Go on, click the button…
Community
If you’d like to hang out with the sort of people who read this sort of piece, you can join us in the Overthinking Everything discord by clicking this invitation link. You can also read more about it in our community guide first if you like.
Cover image
The cover image is a picture of a card game from Ghana, provided by wikimedia user Benebianke.
Honestly in the real world if someone offers you a game that good you should probably conclude out of the gate that it’s probably rigged. But thought experiments require you to suspend a certain amount of suspension of disbelief.
I don’t necessarily recommend tracking these points for real. Estimation leaderboards are an interesting idea but I’m a little leery of gamification at work.
In statistics this is called a loss function, and the position I’m arguing for is the extremely normal statistical orthodoxy that point estimation is about minimising the expected loss.
Story points delenda est.
When I play board games, which I often do, there’s usually a moment towards the end of a long explanation of how the game works where I have to say “OK, I think I get it, but I have just one question: How do I win?”
Some games such as tic tac toe do have an easily describable winning strategy, and that’s why they’re mostly relegated to being children’s games.
Or possibly aggregating in some other way! When writing this I realised that I genuinely don’t know what the best way to aggregate estimates other than the mean or median are. The median is likely to be OK, but it’s possible that it ends up an underestimate. Also you may want to take into account individual predictors’ past performance.