Back

Making Sense Of Things That Don't Make Sense

I’ve been solving problems my entire career. That’s what working in tech is all about!

I’ve worn many different hats over my career. Each of those roles allowed me gain an insight into different problems, and how to solve them.

Solving problems isn’t always about the “solution”, but how to get there. And let me tell you - the way to get there is not always intuitive. Sometimes we think about a problem and its solution like this:

  1. Problem
  2. ????
  3. Solution

The part that’s missing in the middle is the hard part.

It requires an open mind, curiosity, and tenacity.
It requires challenging every assumption you hold dear.
It requires a copious amount of muttering under your breath as you question the very basis of reality itself.
It requires the humility to know you don’t (and won’t) have all the answers.
It requires a trust in your colleagues, and to lean on them when you need it.
It requires the maturity to know that even though you’re smart, you will always learn something new.

I wanted to teach others the lessons I’d learned over the years. The little nuggets of wisdom that only failure can teach. These lessons have helped me approach problems with a particular mindset and methodology.

To gather the material, I started writing down my thought processes while investigating problems, and continued to do this over the course of 2 years. The result was a video presentation of approximately 44 minutes in length.

If you prefer a multimedia learning approach, you can view the presentation in video form via this link. If you prefer to read at your own pace, I’ve included the written portion below.


Aim of this presentation

Solving problems is hard. Doing so without all the necessary information is even harder.

Let’s try to navigate through the uncertainty.

The goal of this presentation is not to help solve specific problems, but how to approach problems in a repeatable way that helps to solve them quicker.

Speaker Notes
  • Personal context: I actually wrote most of this material back in 2023! I was self-conscious about sharing this material until I had a bit more tenure at GitLab. I’ve realized this might be valuable to share with the team. So here it is.
  • It started as a list of things to remind myself during an investigation, and turned into a presentation. This information is for me as much as it is for others.
  • Audience engagement: I’d prefer that most questions wait until the end, as there’s space dedicated at the end for questions. However, if you need something clarified because it impacts your understanding of the material, please feel free to ask your question during the presentation.

 

What is troubleshooting?

Wikipedia has a pretty good explanation (emphasis mine):

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system.

It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again.

Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem.

Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Speaker Notes

Before we dive into how to troubleshoot, we need to understand what troubleshooting is. Luckily, Wikipedia has an awesome description of what troubleshooting is. To be honest, I could just show this slide and stop the presentation early. It succinctly boils down the key parts of what troubleshooting is. I’ve spaced it out for each sentence, and provided my own emphasis, to make each point easier to digest.

 

Bringing order to chaos

Speaker Notes
  • Be logical… Having a methodology and structure helps to guide the investigation in a predictable and structured way
  • Identify the symptoms… We need to be strategic with our requests for information, and our processing of it.
  • Use a process of elimination… The process of elimination was alluded to in the previous slide. Eliminating potential causes or factors saves time and ensures the investigation is productive.
  • Ask specific questions… a smaller problem is an easier problem to grasp and solve

 

Where do we start?

Speaker Notes
  • Extract maximum value… Sometimes we already have all we need to start, we just need to look deeper.
  • Don’t downplay their observations… they have first-hand experience with the problem, and are impacted enough to ask us for help.
  • Trust, but verify… We need to be aware of our own assumptions and not let them get in the way. It’s ok to have assumptions, but our job is to prove or disprove them.
  • Analyze problem nature and scope… We need to understand what the impacts are, the type of problem, how widespread it is, how often it occurs, etc. These details change the way we troubleshoot.

 

Establish the facts: Think like a journalist

Speaker Notes
  • I chose the journalist analogy because troubleshooting is really investigative work. You’re trying to piece together what happened from incomplete information, just like a reporter covering a story
  • Initially I wrote an analogy of a police investigation of a crime, but after careful review and feedback, we removed it because we didn’t want to associated our customers with crime!
  • What questions would you ask… Think about how journalists interview people. They don’t just ask ‘what happened?’ They ask specific questions.
  • Evidence to support your story… You need hard evidence. This is basic journalistic integrity - you don’t publish something just because someone told you it’s true. Journalists corroborate their sources and verify claims. In troubleshooting, this means don’t just rely on someone saying ‘it’s slow’ - get metrics, timestamps, actual data to back up the story.

 

The Five Ws - Breaking down the story

 

Don’t assume consensus

Speaker Notes
  • Customer knows best… This is sometimes hard for technical people - we often think we know better. Our job is to understand and make sense of their observations, not judge them.
  • Confirm understanding early… There’s nothing worse than realising we’re investigating the wrong thing late in the ticket lifecycle. Confirm understanding early and often, as problems evolve and get more complicated.
  • Common understanding accelerates resolution… When customers feel heard and understood, they become your allies. They’ll give you better information, and have more patience. It’s not just good customer service - it’s good troubleshooting strategy.
  • Rephrase and verify… “Cunningham’s Law” - “The best way to get the right answer on the Internet is not to ask a question, but to post the wrong answer.”. People will often be motivated to correct wrong information, which works in our favour here. If we explain the situation in our own words and we’re wrong, they’ll let us know. If we’re right, then we know we’re on the right track.
  • Watch for XY problems… Classic example: ‘How do I restart the server?’ when what they mean is ‘The application is slow’. Always ask ‘What are you ultimately trying to achieve?’, or ‘what problem are you trying to solve?’

 

Make and record observations

Speaker Notes
  • Document what you see… This is harder than it sounds because we’re trained to filter information. I’ve learned to document everything I notice, even if it seems silly.
  • Observations build the timeline… Every observation gets a timestamp and becomes a data point in your story. These observations become the backbone of your narrative - they tell you the sequence of events and help you identify cause and effect, or correlations.
  • Don’t filter too early… We don’t want to pre-judge what’s important early in the investigation
  • Timestamp everything… This gives you a specific window of time to start investigating, and helps to determine the actual timeframe to investigate, which might be larger or longer than you first anticipated.

 

Build a narrative

Speaker Notes
  • Everyone likes a good story… Our brains are literally wired to understand information as stories. When you have many different error messages and many system components involved, a timeline narrative makes it digestible. Instead of overwhelming complexity, you get ‘Here’s what happened, then this happened, which caused that.’
  • Check for plot holes… When you find these plot holes, don’t ignore them - they’re clues. Maybe there were actually two separate issues happening at once. Maybe your timeline is wrong. Maybe there’s a third component you haven’t considered yet. The inconsistencies in your story point you toward what you need to investigate next.
  • Resolve inconsistencies… I actually tell the story back to customers sometimes in a timeline. The narrative helps them remember crucial details they initially left out.

 

Zoom in, zoom out

Speaker Notes
  • These terms are borrowed from photography, but they’re perfect for troubleshooting. Zoom in means diving deep into specific details - reading individual log lines, checking specific configuration values, tracing exact code paths. Zoom out means stepping back to see the bigger picture - overall system health, business impact, whether you’re even working on the right problem.
  • You can get “stuck in the weeds”… You got so focused on one tree that you missed the forest burning down.
  • Zoom back out regularly… Constantly re-evaluate if your current lines of inquiry are fruitful and in service of the goal, solving the problem
  • Frequent zoom-outs renew focus… The zoom out isn’t just about stopping wasted effort - it’s about getting re-energized. When you remember why you’re doing this and what success looks like, it gives you clarity on your next steps.

 

Be Curious

Speaker Notes
  • Sometimes intuition trumps methodology… You need to remain flexible in your approach, which allows for random detours or testing of wild theories.
  • Test your hunches… Your prior experience can guide you
  • Look for the unusual… What’s different about this customer? This specific request? This timing? The group membership hierararchy?
  • Channel your inner tester… This comes from my background as a software tester. Good testers don’t just follow happy paths - they ask ‘how do I break this?’ What would cause exactly this failure? I’ve used this method many times recently to reproduce weird race conditions or bugs, by breaking things in a controlled and methodical way. It’s like reverse engineering the problem.
  • Sometimes the answer is hiding… Don’t assume something is irrelevant, test whether it is first.

 

Embrace the learning opportunity

Speaker Notes
  • Problems are teachers… Don’t be intimidated when you encounter something completely outside your comfort zone.
  • Curiosity over frustration… This requires a mindset shift. Instead of saying ‘Ugh, I don’t know Kubernetes,’ try ‘Cool, I get to learn Kubernetes today.’ Instead of ‘This PostgreSQL error makes no sense,’ try ‘I’m about to understand how PostgreSQL really works.’
  • Document what you learned… Future you - and your teammates - will thank you. When we write public Knowledge Base articles, future tickets will be deflected. Everybody wins!

 

AI is your friend, not your shortcut

Speaker Notes
  • I added this section because AI has fundamentally changed how we troubleshoot. The key is understanding what AI is good at and what it’s terrible at.
  • Expand your perspective… AI excels at generating alternative hypotheses you might not think of. It’s like having a brainstorming partner. Like our fellow colleagues, it’s not infallible. Treat it as such.
  • Guide through unfamiliar territory… When I encounter something completely outside my domain - like ‘what does this Kubernetes error actually mean?’ - AI can explain it in terms I understand, suggest what to check next, or translate technical jargon into plain language. It’s like having a patient mentor.
  • Use as a rubber duck… You know the term rubber duck debugging? It involves talking to an inanimate object about your problem, and in doing so you can improve your thinking and decision making process. AI is a much better rubber duck. Instead of only hearing your own voice, you can have a dynamic conversation.

 

AI pitfalls

AI should assist you and extend your investigation, not be the sole form of troubleshooting.

Speaker Notes
  • AI isn’t a replacement for thinking… AI isn’t a magical oracle. We are the Directly Responsible Individuals (DRI) for the problem, and we can’t outsource that.
  • AI can accelerate confirmation bias… This is subtle but crucial. If you ask AI ‘Why might this be a database problem?’, it’ll give you a dozen reasons why it could be the database - even if it’s not. You’ve primed it to confirm your existing theory. Instead, ask ‘What could cause these symptoms?’ and let it suggest multiple possibilities, including ones that contradict your current thinking.
  • AI should assist you and extend… AI can help gather information, suggest hypotheses, explain concepts - but you’re still the one making decisions, testing solutions, and understanding the bigger picture. The moment you stop thinking and just follow AI’s suggestions blindly, you’re no longer troubleshooting.

 

AI troubleshooting strategies

Four approaches to choose from:

Choose the approach based on your situation and what you need from AI.

Speaker Notes
  • Start with minimal information… This seems counter-intuitive - we usually want to give AI all the context we have. But sometimes that context includes our biases and assumptions. It might cause us to fall deeper into unproductive rabbit holes. Feed AI additional context during the discussion.
  • Full context… can be useful when you need to move fast and you’re confident in your investigation direction. Or, if you are totally stuck and you need some direction. When you have hundreds of lines of logs, multiple error messages, and a complex timeline, AI can spot correlations you might miss. It can also make summaries that help you make sense of complexity.
  • Devil’s advocate… This challenges your assumptions. This is uncomfortable but valuable. When AI challenges your theory, you have to articulate why you believe what you believe. Sometimes you realize your evidence is weaker than you thought. Other times, defending your position helps you identify what additional evidence you need to confirm or refute your hypothesis.
  • Step-by-step guidance… This is perfect when you’re dealing with something completely outside your comfort zone. Instead of fumbling around randomly, you can ask AI to give you a structured approach. It’s like having a troubleshooting playbook for domains you’re unfamiliar with. This helps to identify where you might want to focus your investigation, and start eliminating lines of inquiry that are not relevant.
  • Main message: The point isn’t to memorize these approaches, but to understand that AI is a flexible tool. Match the approach to your needs.

 

Reproducing the problem: Why it matters

Reproduction allows you to verify both the problem and your proposed solution.

Speaker Notes
  • Confirms the problem type.. You can’t know which approach to take until you understand the problem’s nature.
  • Enables controlled testing… We’re not only testing the solution, but testing how the problem is reproduced
  • Builds confidence… when you can reliably make something fail and then make it work, both you and the customer know the problem and solution are both real. There’s no uncertainty. It also helps our engineers understand the problem and allows them to fix any bug reports quicker. I’ve seen many bug reports sit in backlogs indefinitely because they’re vague and not reproducible. When you can hand over exact reproduction steps, developers are more likely to identify and fix the root cause.

 

Question time!

What are the main things you think about when trying to reproduce a customer’s issue?

 

Aspects of reproducing a problem

Four main aspects to reproducing problem:

  1. Reproduce the scenario
  2. Reproduce the environment
  3. Reproduce the data conditions
  4. Reproduce the architecture
Speaker Notes
  • Keep in mind that these are high-level aspects to reproducing a problem. We’ll dive deeper into each of these on their own slides.

 

Reproduce the scenario

Speaker Notes
  • Sequence of events/actions… The scenario’s sequence of steps might be triggering a very specific condition
  • Timing patterns and conditions… Finding a time-based pattern to the scenario can save a lot of time for some types of problems. Example: I helped with a ticket recently where a CI/CD job was failing. It used service containers. The health check of the service container was fine, but when the job started executing, it failed. My hypothesis was that the service container was being killed between the health check and the start of the job’s execution. I tested that hypothesis by building a custom Docker image that mimics normal service behaviour initially, but then self-destructs after a predetermined time using a “time-bomb” script. This approach allowed me to precisely control the timing window - the service container stayed up long enough for the health check to pass, but died before the container linking phase. The result was an error message exactly matching the customer’s error message, successfully reproducing the problem, and allowing us to test our solutions and workarounds.
  • User journey and context… This is about understanding intent versus action. Understanding what they’re ultimately trying to achieve helps you spot when they’re taking an unusual path. Their ‘problem’ might be a training issue, not a technical issue. The best kinds of problems are not problems at all.

 

Reproduce the environment

Match their software stack as closely as possible - version mismatches are a common cause of “works for me” issues.

Speaker Notes
  • The problem may be related to specific versions of components, or incompatibilities between versions. Specific configuration settings or conflicts. Feature flags that have been enabled or disabled.

 

Reproduce the data conditions

Data size, content, complexity, and format can trigger specific issues - create test data that mimics the customer’s data characteristics.

Speaker Notes
  • Don’t just replicate the process… This is where a lot of troubleshooting falls short. We test the happy path with clean, simple data. But Production data is messy. It has edge cases, unusual formats, unexpected sizes. The process might work fine with your test data but break with their real-world complexity.
  • Data volume… Testing with 10 records is different from 10,000 records. Pagination breaks, queries time out, memory usage explodes, UI becomes unusable. I’ve reproduced performance issues when replicating similar data volume and complexity as the customer. I was able to provide empirical evidence of performance bugs where increases in data volume directly correlated with higher response times. You can even plot this data in a graph to better highlight the problem.
  • Data characteristics… We need to confirm whether the format or content of the data itself is causing the problem. Example: I had a ticket where a project on GitLab.com couldn’t be deleted. I wasn’t able to figure it out initially, so I raised a RFH. Even then, the backend engineers were also unsure of the problem. I went back to basics, and took a look at the project’s characteristics. I noticed that the project description was not only long, but had some special characters in it. I ran a few tests of my own with similarly sized and complex project descriptions, and it turns out that the length reproduced the problem. You could create a project description of that length, just not delete it. So removing or truncating the project description allowed for the deletion to succeed. This analysis allowed for a bug report to be quickly raised and fixed.
  • Data state… It’s not just the data itself, it’s the relationships between entities and objects. The specific state or transitions of state.
  • Respect data security… In most cases you can’t just copy Production data, but you can create realistic test data that matches their patterns. Try to determine their data requirements based on the points above, and replicate those patterns safely.

 

Reproduce the architecture

Speaker Notes
  • This is often the last piece of the puzzle. You’ve matched versions, replicated the scenario, created similar data, but the problem still won’t reproduce. That’s when you need to look at the infrastructure layer. Architecture differences can create problems that are completely invisible until you match the exact setup.
  • Architecture pattern… Single-node setups have different failure modes than distributed systems. Distributed systems add complexity, and this complexity is where problems can occur. Customers may also implement architecture patterns that are not recommended.
  • Resource constraints… Problems that are invisible with abundant resources become obvious under load. Memory leaks, CPU spikes, disk I/O bottlenecks - these often only appear when resources are limited.
  • Network topology… Although we don’t provide direct support for their network setup, it’s useful to test whether this may be a contributing or causing factor.

 

Code is the best documentation

Speaker Notes
  • Caveat: This requires some programming literacy, but even basic code reading skills can be incredibly valuable for troubleshooting. You don’t need to be a developer, just curious enough to trace through logic. The more you expose yourself to reading code, the more comfortable you’ll be reading code in the long run.
  • Code shows reality… During troubleshooting, I often consult the actual codebase because our documentation represents an aspirational view of how things should work. The code, however, shows exactly how it does work. There’s always a gap between intended and actual implementation. Sometimes the truth lies in the middle ground between how things are and how they’re intended to be.
  • GitLab is open source… Having access to the code is a huge benefit to troubleshooting. This is the difference between black box and white box testing. Black box tests the functionality without consideration for its implementation. White box tests consider the internal code structure, code paths, etc. With open source, we can see inside the system instead of just guessing from external behavior. This lets us target specific code paths and conditions during reproduction.
  • Code is never out of sync… Documentation can become outdated, but the code is literally what’s being executed.

 

Use code to guide troubleshooting

Speaker Notes
  • Use code to inform reproduction… Think about the aspects of reproduction we discussed earlier. This point speaks to reproducing the scenario and the data conditions, to hopefully achieve a specific outcome, based on our analysis.
  • Code reveals edge cases… Conditional logic, error handling, state machine transitions, event hooks, validation rules - these are all explicitly defined in the code and can guide your troubleshooting.
  • Don’t assume you understand the code… This is crucial - just because you’ve read the code doesn’t mean you understand it correctly. Code reading creates assumptions about behaviour, and those assumptions need to be tested just like any other hypothesis in our systematic approach. I’ve been wrong many times about what I thought code was doing versus what it actually does.

 

In Logs We Trust

Speaker Notes
  • Logs are the currency of troubleshooting. They’re objective evidence of what actually happened, not what people think happened.
  • Correlate across components… Modern systems are distributed - GitLab connects to Redis, PostgreSQL, Sidekiq, and external services. An error in one component often cascades to others. What you might initially see is a cause might be a symptom of an earlier or larger problem.
  • Match the time period… This sounds obvious but gets missed constantly. Always verify you’re looking at logs that cover the exact timeframe when the problem occurred. Make sure it’s the correct timezone, too.
  • Get all participant logs… Don’t just look at the component that’s throwing the error - look at everything in the request path. Each component can contribute to the problem, and you need the full chain to understand what happened.

 

Correlate log events

Speaker Notes
  • Search by time… Often you’ll find related events that happened slightly before or after the reported problem time. The user saw the symptom at 2:15, but the cause might have started at 2:12. You might notice a big spike at a certain time, but that doesn’t necessarily mean that the problem started then.
  • Search by context… Once you have a time window, search for specific context: user IDs, project names, IP addresses, error patterns.
  • Zoom out, zoom in… Start broad - look for overall patterns in the time window. High error rates? Unusual traffic? Resource spikes? Then zoom in on specific events. Maybe there’s a pattern
  • Look for presence or absence… Depending on the problem, we might see things we shouldn’t, or things we should see aren’t there. Ask yourself what you should be seeing, versus what you are seeing.

 

Keep your cool

Speaker Notes
  • Breathe and reset… When you feel that panic is creeping in - when you’ve been staring at logs for 30 minutes and nothing makes sense - literally stop. Take some deep breaths. Step away from the screen for 60 seconds if you have to. Panic literally narrows your cognitive focus and makes you miss obvious things.
  • Stay curious… and assumptions can often lead to wrong solutions.
  • Separate urgency from panic… Urgent means important and time-sensitive. You can move quickly while still being systematic.
  • Communicate transparently… Customers prefer bad news to silence. ‘I don’t know yet, but here’s what I’m checking next’ is infinitely better than radio silence. It shows progress, demonstrates competence, and buys you time to work.

 

Know when to escalate

Escalation is a skill, not a failure!

Early escalation signals:

Speaker Notes

Escalating isn’t due to an individual failure. It’s the next logical step in the investigation process. We’re a team and we need to leverage the resources, experience, and knowledge of that team.

 

How to escalate effectively

Speaker Notes
  • Explain the customer impact… Impact and timeline help people prioritize their response appropriately.
  • Be timely with escalation… Don’t be a hero. If you’re stuck and it’s something outside your expertise, escalate. And definitely don’t wait until the customer is angry. Escalating before the customer gets frustrated looks much better than escalating because the customer demanded a manager.
  • Target the right group… The more specific you are, the faster you’ll get to the right person.
  • Have a specific ask… Ambiguous questions get ambiguous answers. People are more motivated to assist when the problem and request is understandable and seems achievable. Consider customer tickets that lack actionable information, and compare that to your ask. Would you be satisfied with receiving a ticket with that information? If not, structure your ask and ensure it has the right information.

 

Capture what you’ve learned

Speaker Notes
  • Every problem you document makes the entire team better at troubleshooting. Your hard-won knowledge becomes organizational knowledge. That’s how you build a really effective support team.
  • Write for the next person… That person could be you! Make your fellow colleagues successful, not just informed.
  • Update existing documentation… If you had to learn something the hard way because existing docs were wrong or incomplete, fix them. Don’t let the next person waste the same hours you did. Fix the absences or inconsistencies as you identify them.

 

We are our values

GitLab Values are very useful in the troubleshooting process. Let’s look at some of our operating principles and see how they help us troubleshoot.

 

Serving our customers

“Assume positive intent” \ “Kindness” \ “Say sorry” \ “No ego”

 

Knowledge isn’t everything

“It’s impossible to know everything” \ “Self-service and self-learning”

 

Keep an open mind and be prepared for failure

“Write things down” \ “Low level of shame” \ “Articulate when you change your mind”

Speaker Notes
  • Use the Scientific Method.. I mention this because we shouldn’t be emotionally invested in a specific outcome, theory, or resolution. We should go where the observations and tests tell us. We should be open to new possibilities, and change our view based on new information or evidence.

 

Balance the need for problem solving and customer results

“Customer results” \ “Sense of urgency” \ “Have Ownership & Accountability” \ “Reach across company departments”

 

Own your troubleshooting journey

“Manager of one” \ “Operate with a bias for action” \ “Give agency”

 

Key takeaways

 

Making sense of things that don’t make sense

The goal isn’t to memorize every technique - it’s to develop a systematic mindset that helps you: