Technical debt (a.k.a. 'tech debt') became a popular term in the past years. Our codebases and systems tend to build up 'cruft' over time, making it harder to make changes to them or build on them later.
Tech debt is a metaphor by Ward Cunningham, a co-author to the Agile Manifesto and the creator of the first wiki software. Metaphors, like models, help shape our thinking about a particular topic. So, why do we call it 'technical debt' and not simply 'cruft'? The reason is that in certain aspects, it is analogous to financial debt – if we don't pay it back, it takes its toll on new features and even maintenance, just like interest accumulates on the financial debt.
One source of this is our conscious and unconscious tradeoffs when designing and building our systems, lack of clean code, documentation, and best practices. Another source is that systems and their ecosystem evolve, and what was a perfect solution two years ago isn't so shiny anymore.
I will focus on techniques to 'sell the idea' and prioritize paying back tech debt in product development with multiple stakeholders at the table.
The myth of the two backlogs
Let's start with what not to do. Many teams end up with two kinds of backlogs – one product-focused and one tech-focused. It won't work for multiple reasons:
- It'll be nearly impossible to cross-prioritize items in separate backlogs against each other.
- It fuels the "us vs. them" mindset, which kills team cohesion and common goals.
- Prioritization is the job of the whole team; leaving out Product representatives won't work.
- It makes sprint planning harder unless you have strict rules on how to allocate capacity (e.g., 20% goes to tech debt consistently)
Recognize that any type of work is ... well, work. Keep one backlog and add any kind of work there. Feel free to tag tech debt if you want stats or filtered views, but make sure you prioritize the whole backlog together.
Don't make tech debt a blame game.
Another usual mistake is to make arguments about tech debt one full of finger-pointing. I've seen some managers (engineering and product alike) say phrases like, "well, if you created a better solution in the first place, we wouldn't be in this situation" – this is just stupid, inhumane, and ultimately pointless. Nothing ever will be perfect. We are humans, context changes, things evolve. Just accept it. The best we can do is learn from our design mistakes, instill good practices and get as conscious as we can about the tradeoffs we're making.
Playing the blame game takes our focus away from solving the issue at hand. It makes most folks defensive, which again shifts the discussion into an unnecessary back-and-forth without getting any closer to solutions.
Blameless retrospectives and post mortems are the way to go. The fix is almost always systemic and not personal.
How to discuss tech debt
Now that we know what to avoid let's see some ways to argue about the importance of paying back technical debt. It usually comes down to the risk around existing tech debt, the toil (of maintenance) caused by it, and how it hinders the development of new features.
In my experience, most engineers don't need to be convinced to work on technical debt. In fact, they are the ones requesting that. On the other hand, engineering and product managers often need more context to understand the importance of paying back tech debt.
Your best strategy here is to understand how to give that context in the language your stakeholders speak. "Well, it's trivial why this is important" won't work. There are 50 other items in your backlog which is "trivially important" to them. Also, please don't make it about yourself. It's great that you're passionate about fixing this, but if it sounds like your pet project without any further argument, it likely won't get prioritized. I'll show you a few ways to frame your views. Spoiler: it's all about connecting the dots from your tech debt items to your internal or external customers and the product's performance.
A certain percentage of technical debt work has a risk profile attached. One trivial example is when your solution uses a 3rd party (library/service) that is getting to its end of life. The risk here is that if you don't upgrade or migrate away to another supported solution, the systems depending on the 3rd party would either be dysfunctional or would, for example, stop receiving security patches in the future, risking a potential breach, which is obviously bad for your users or customers. Not each situation is equal, of course – it does matter whether we'd be going offline next week or there is a low chance of a low-impact security issue hitting you in 2 years. Check the 'Cost of Delay' section below.
Another type of risk is also around customer retention, but it's less direct. The subpar experience caused by not great solutions (think about frequent outages and slow services) creates the risk of customer churn.
Some examples of phrasing technical debt around risk:
- We might fix a bug in 2 places but miss the 3rd due to code duplication.
- The design of the current system could lead to a slow user experience at higher usage scenarios.
- Lacking security practices could result in breaches and legal liabilities.
- Accidental introduction of new bugs into the feature is probable due to our lack of unit tests.
- The complexity and inflexibility of the codebase result in us saying no to new features due to long development times.
To make your argument stronger and honest, whenever you can, gather data and make it part of the argument. This is not always possible, of course. Remember, data can be industry best practice, too, e.g., long test run times vs. what's acceptable.
Toil of maintenance
Most tasks will take longer in a complex or inflexible codebase; by using ineffective or missing tools. This, combined with an influx of customer & technical issues, can make the whole team grind to a halt as even trivial fixes will take an entire day to do and roll out. Now you need 4 hours to understand what's happening in the system, two more to make the tests green because half of them are flaky, and one more to deploy your fix because the deployment system gets stuck three times out of 5.
Situations like above can (should!) be avoided early as a risk, but many times you only realize it when it's already heavily happening (and people complain frequently and loudly).
Data here will be primarily anecdotal, but you can get a good sense of what time you'd be saving if things would just work optimally. Talk to your team about their view on the added toil and come up with a rough estimate to support your argument.
To help your stakeholders understand its importance, paint the picture of a better world where delivering value to the customer is a relatively seamless thing to do.
The efficiency of development of new features
Reduced efficiency here is coupled with the toil above but deserves a few words on its own. While toil takes away time from your team, resulting in less capacity for delivering new features, there are additional factors at play here.
- Working in a hard-to-understand codebase reduces development speed (and might increase the number and severity of new defects introduced)
- Onboarding new team members with such codebases will require more time and effort from the team.
- Implementing a solution in a system that had been poorly designed or is simply outdated in terms of architecture is tough. Dreaded month-long refactoring projects are born this way.
- “Hacking” and “patching” your way in a system that’s not fit for the new solution you’re implementing - basically working around the existing system - will often increase the resulting system’s complexity and add even more to the pile of tech debt - realize that there’s a vicious cycle here!
Cost of Delay
'Cost of Delay' is not a full prioritization technique but it is a very handy property when thinking about risk.
Cost of delay is a key metric in lean management. It combines urgency and value – two things that humans are not very good at distinguishing.
Because most of the time we tend to focus only on the production costs and other fixed costs, we are having a hard time prioritizing, as some of the costs are unpredictable, as well as the potential value can be, as it is not fixed over time. A better approach is to calculate the Cost of Delay. This value represents the cost diluted over time, that the company will incur, by delivering that specific feature/project/product later than what the market or a client expects.
Calculating how your Cost of Delay behaves over time requires fairly mature tracking and monitoring. An alternative to calculating how the Cost of Delay will behave over time is to select one of the following profiles:
Your judgment on the cost of delay profile for an issue will inform your Impact rating during RICE scoring.
ICE and RICE scoring
You end up with many items of different sizes and impacts – then it becomes very chaotic. ICE can help bring order to this chaos by a methodical approach to assessing these items and creating a single numerical representation of their priority based on which you can simply sort them.
ICE stands for Impact, Confidence, and Ease/Effort. The R in RICE stands for Reach.
For each of these factors, the team agrees on a set of numerical points, e.g. Massive = 3 High = 2, Medium = 1, Low = 0.5, Minimal = 0.25
Impact - What impact would solving this issue have on the customers (remember, customers can be internal too!) - or, when thinking about risk, what impact would not solving the issue have?
Confidence - How confident are you in your estimation of the impact (and optionally the ease/effort)?
Ease/Effort - Effort is usually easier to talk about - how much effort would it take to solve the issue? Remember that it's a relative metric, comparable only across the other issues in the current batch.
Reach - How many people will this impact? 100% of your customer base? Only a specific persona?
It's important to understand that these scores are only meaningful in the local, relative context and should not be compared across domains.
Once you have the numbers, the calculation is easy:
RICE score = (Impact x Confidence) / Effort or Impact x Confidence x Ease
Read more about RICE and ICE in this excellent article.
Some tips on how to actually do the work
This section should (shall?) be its own blog post, but let me quickly iterate over a few strategies to actually chip away at tech debt. There's no perfect strategy here and there are tradeoffs for each approach
Dedicated capacity for tech debt
Some teams dedicate a certain percentage of their sprint time to categories of work. One common setup is to have 70% for feature work, 20% for technical debt, and 10% for learning/experiments.
The challenge with this setup is that usually bigger tech debt issues never get solved in only 20% of the time - moving from sprint to sprint with lost context usually, so restarting them is harder. Another challenge is that keeping accurate timing, considering how hard estimation is, is almost impossible. You can try timeboxing, but that requires discipline.
A commitment to take N cards each sprint
Another end of the spectrum is to stop talking about time invested and just take on a fixed amount of technical debt cards from the backlog each sprint.
The obvious tradeoff here is that some cards might be big and take the majority of the sprint. How do you ensure that time is spent in a balanced way?
Treat more significant pieces of technical debt as projects
Sometimes tech debt takes the form of longer projects that actually need to be planned and executed accordingly.
A key to getting to a good tech debt project hygiene is to really treat these as regular projects: be clear about the purpose, scope, and set goals (and actually measure them!)
Treat medium-sized pieces as part of the next project that touches that system or codebase
A version of the so-called 'boy scout rule': you should leave codebases and systems in a better shape than you found them in.
One way is to always plan some extra tech debt payback into your projects.
The tradeoff here is that tech debt priorities are lost - what you pick up is driven by what your next project makes you touch.
You also need a strong trusting relationship with your product peers to do this, as this will seem like scope creep for them.
It might seem counterintuitive, but speed results in cost reduction; confidence enables speed; confidence requires quality. Rinse, repeat, and profit.