Disclaimer - This article is written in the context of Software Engineering at companies with multiple environments, huge code bases, multiple services etc. Extrapolate to other cases as per fit.
If you’ve been tinkering with code for sometime, you might’ve debugged at some point in your journey. And you know how hard it can be sometimes to find that simple bug. Debugging can be stressful due to a bunch of constraints like environment(production), no proper tooling, or it’s simply taking time to find that bug and for a variety of reasons.
But, does it have to be like that always? Has it been like that always? Probably not.
Debugging is about finding out what’s going wrong with your code or system and why. So, essentially it’s a bunch of questions that you ask a system to lead to the right conclusions and eventually be able to identify that bug/mis-configuration or whatever caused your system to fail.
But, having the right questions is one part of the equation, but system should also be able to answer them. So, here comes the rule #1.
#1. Give the system a voice, language and a medium to speak.
Traditionally, this means one or more of below.
- Having the right level of logging. Having too much logging can also hide insights that you are searching for
- Monitoring. Tracking metrics, building dashboards. Probably making use of USE(utilisation, saturation, errors) framework
- Alerting on incidents that are about to happen and will need human intervention
Sometimes even with all the information at hand, we tend to guess what’s happening with our system using some information we already know. This is generally called creating a hypothesis. It’s good to have a hypothesis, but if you make changes to your system with the hypothesis blindly before having a Q&A session with your system, it can potentially prolong the MTTR(mean time to recovery). So, this leads us to rule #2.
#2. Verify your hypothesis before making changes to the system in the process of recovery.
Imagine a situation where you are tasked with fixing a production outage. And if you are new to the codebase or the system, you will definitely not be in a position to fix the outage in a reasonal amount of time. Apparently, this is because of lack of familiarity with the things you are dealing with. It always helps to understand the system/codebase prior. So, rule #3.
#3. Know the system/codebase in-depth and don’t leave out any grey areas to be able to debug quickly.
There is a good chance that most outages or bugs are seen when there is a recent change. If you’ve a streamlined way of being in loop with the recent changes, you can theoreotically reduce the MTTR significantly with a valid correlation.
#4. Maintain a changelog, or be aware of changes happening to the software
What are the techniques you are using for debugging?