Back

Making Sense Of Things That Don't Make Sense

I’ve been solving problems my entire career. That’s what working in tech is all about!

I’ve worn many different hats over my career. Each of those roles allowed me gain an insight into different problems, and how to solve them.

Solving problems isn’t always about the “solution”, but how to get there. And let me tell you - the way to get there is not always intuitive. Sometimes we think about a problem and its solution like this:

Problem
????
Solution

The part that’s missing in the middle is the hard part.

It requires an open mind, curiosity, and tenacity.
It requires challenging every assumption you hold dear.
It requires a copious amount of muttering under your breath as you question the very basis of reality itself.
It requires the humility to know you don’t (and won’t) have all the answers.
It requires a trust in your colleagues, and to lean on them when you need it.
It requires the maturity to know that even though you’re smart, you will always learn something new.

I wanted to teach others the lessons I’d learned over the years. The little nuggets of wisdom that only failure can teach. These lessons have helped me approach problems with a particular mindset and methodology.

To gather the material, I started writing down my thought processes while investigating problems, and continued to do this over the course of 2 years. The result was a video presentation of approximately 44 minutes in length.

If you prefer a multimedia learning approach, you can view the presentation in video form via this link. If you prefer to read at your own pace, I’ve included the written portion below.

Aim of this presentation

Solving problems is hard. Doing so without all the necessary information is even harder.

Let’s try to navigate through the uncertainty.

The goal of this presentation is not to help solve specific problems, but how to approach problems in a repeatable way that helps to solve them quicker.

Speaker Notes

Personal context: I actually wrote most of this material back in 2023! I was self-conscious about sharing this material until I had a bit more tenure at GitLab. I’ve realized this might be valuable to share with the team. So here it is.
It started as a list of things to remind myself during an investigation, and turned into a presentation. This information is for me as much as it is for others.
Audience engagement: I’d prefer that most questions wait until the end, as there’s space dedicated at the end for questions. However, if you need something clarified because it impacts your understanding of the material, please feel free to ask your question during the presentation.

What is troubleshooting?

Wikipedia has a pretty good explanation (emphasis mine):

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system.

It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again.

Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem.

Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Speaker Notes

Before we dive into how to troubleshoot, we need to understand what troubleshooting is. Luckily, Wikipedia has an awesome description of what troubleshooting is. To be honest, I could just show this slide and stop the presentation early. It succinctly boils down the key parts of what troubleshooting is. I’ve spaced it out for each sentence, and provided my own emphasis, to make each point easier to digest.

Bringing order to chaos

Be logical and systematic in your approach. This requires forethought and planning
Identify the symptoms by asking for information, and parsing that information
Use a process of elimination to determine the most likely cause
Ask specific questions to reduce the problem scope and guide troubleshooting
Confirm that the solution restores the product or process to its working state
- Sometimes this is not always possible. In that case, we should also look for a solution that is acceptable to the customer

Speaker Notes

Be logical… Having a methodology and structure helps to guide the investigation in a predictable and structured way
Identify the symptoms… We need to be strategic with our requests for information, and our processing of it.
Use a process of elimination… The process of elimination was alluded to in the previous slide. Eliminating potential causes or factors saves time and ensures the investigation is productive.
Ask specific questions… a smaller problem is an easier problem to grasp and solve

Where do we start?

Extract maximum value from the customer’s initial information, even if incomplete
Don’t downplay their observations - they know their problem best
Trust, but verify. We need to trust the customer, but verify what they’re providing is correct
- I’ve been humbled many times when what seemed like an outrageous claim actually turned out to be true
Analyze problem nature and scope, then ask for more details

Speaker Notes

Extract maximum value… Sometimes we already have all we need to start, we just need to look deeper.
Don’t downplay their observations… they have first-hand experience with the problem, and are impacted enough to ask us for help.
Trust, but verify… We need to be aware of our own assumptions and not let them get in the way. It’s ok to have assumptions, but our job is to prove or disprove them.
Analyze problem nature and scope… We need to understand what the impacts are, the type of problem, how widespread it is, how often it occurs, etc. These details change the way we troubleshoot.

Establish the facts: Think like a journalist

Imagine you’re an investigative journalist uncovering a complex story
- What questions would you ask the key stakeholders and participants?
- What information would you seek about the circumstances, the timeline, or the relationships between parties?
- What kinds of evidence do you need to support your story?
- Does the evidence support your working narrative?
The Five Ws are useful to consider:
- Who, What, When, Where, Why

Speaker Notes

I chose the journalist analogy because troubleshooting is really investigative work. You’re trying to piece together what happened from incomplete information, just like a reporter covering a story
Initially I wrote an analogy of a police investigation of a crime, but after careful review and feedback, we removed it because we didn’t want to associated our customers with crime!
What questions would you ask… Think about how journalists interview people. They don’t just ask ‘what happened?’ They ask specific questions.
Evidence to support your story… You need hard evidence. This is basic journalistic integrity - you don’t publish something just because someone told you it’s true. Journalists corroborate their sources and verify claims. In troubleshooting, this means don’t just rely on someone saying ‘it’s slow’ - get metrics, timestamps, actual data to back up the story.

The Five Ws - Breaking down the story

Who: Identify the key stakeholders and participants involved. Relationships are key!
What: Understand what events occurred. We also need evidence that they did occur
When: Establish a timeline, with key events occurring chronologically over time
Where: The systems and locations involved in the situation. Map the connections!
Why: Was there a change, an event, a previous failure that caused the current one?

Don’t assume consensus

Customer knows best - about their problem and its impact (most of the time)
They’re doing their best - to explain the problem in the circumstances
Confirm understanding early - as soon as possible in the process
Common understanding accelerates resolution - faster troubleshooting and fixes
Rephrase and verify - explain the problem and its impact in your words, ask them to confirm
Watch for XY problems - are they describing symptoms when the issue is elsewhere?

Speaker Notes

Customer knows best… This is sometimes hard for technical people - we often think we know better. Our job is to understand and make sense of their observations, not judge them.
Confirm understanding early… There’s nothing worse than realising we’re investigating the wrong thing late in the ticket lifecycle. Confirm understanding early and often, as problems evolve and get more complicated.
Common understanding accelerates resolution… When customers feel heard and understood, they become your allies. They’ll give you better information, and have more patience. It’s not just good customer service - it’s good troubleshooting strategy.
Rephrase and verify… “Cunningham’s Law” - “The best way to get the right answer on the Internet is not to ask a question, but to post the wrong answer.”. People will often be motivated to correct wrong information, which works in our favour here. If we explain the situation in our own words and we’re wrong, they’ll let us know. If we’re right, then we know we’re on the right track.
Watch for XY problems… Classic example: ‘How do I restart the server?’ when what they mean is ‘The application is slow’. Always ask ‘What are you ultimately trying to achieve?’, or ‘what problem are you trying to solve?’

Make and record observations

Document what you see - even seemingly irrelevant details matter
Observations build the timeline - they become the foundation of your narrative
Don’t filter too early - record first, analyze relevance later
Include the mundane or obvious - these can be vital clues that could be missed
Timestamp everything - when did you observe it, when did it happen?

Speaker Notes

Document what you see… This is harder than it sounds because we’re trained to filter information. I’ve learned to document everything I notice, even if it seems silly.
Observations build the timeline… Every observation gets a timestamp and becomes a data point in your story. These observations become the backbone of your narrative - they tell you the sequence of events and help you identify cause and effect, or correlations.
Don’t filter too early… We don’t want to pre-judge what’s important early in the investigation
Timestamp everything… This gives you a specific window of time to start investigating, and helps to determine the actual timeframe to investigate, which might be larger or longer than you first anticipated.

Build a narrative

Everyone likes a good story - it makes complex problems less intimidating
Stories have structure - specific time, place, and participants interacting
Build a coherent narrative - define time, place, participants, and their interactions
Check for plot holes - does your current story make sense?
Resolve inconsistencies - fix the gaps in your narrative

Speaker Notes

Everyone likes a good story… Our brains are literally wired to understand information as stories. When you have many different error messages and many system components involved, a timeline narrative makes it digestible. Instead of overwhelming complexity, you get ‘Here’s what happened, then this happened, which caused that.’
Check for plot holes… When you find these plot holes, don’t ignore them - they’re clues. Maybe there were actually two separate issues happening at once. Maybe your timeline is wrong. Maybe there’s a third component you haven’t considered yet. The inconsistencies in your story point you toward what you need to investigate next.
Resolve inconsistencies… I actually tell the story back to customers sometimes in a timeline. The narrative helps them remember crucial details they initially left out.

Zoom in, zoom out

Zoom in: Focus on details and low-level specifics when required
Zoom out: Step back to see the bigger picture and reassess direction
You can get “stuck in the weeds” if you only zoom in
Zoom back out regularly to re-evaluate if you’re:
- Spending time on the right things
- Making progress toward solving the problem
- Going down productive paths vs. dead-end side quests
Frequent zoom-outs renew focus on the main goal

Speaker Notes

These terms are borrowed from photography, but they’re perfect for troubleshooting. Zoom in means diving deep into specific details - reading individual log lines, checking specific configuration values, tracing exact code paths. Zoom out means stepping back to see the bigger picture - overall system health, business impact, whether you’re even working on the right problem.
You can get “stuck in the weeds”… You got so focused on one tree that you missed the forest burning down.
Zoom back out regularly… Constantly re-evaluate if your current lines of inquiry are fruitful and in service of the goal, solving the problem
Frequent zoom-outs renew focus… The zoom out isn’t just about stopping wasted effort - it’s about getting re-energized. When you remember why you’re doing this and what success looks like, it gives you clarity on your next steps.

Be Curious

Sometimes intuition trumps methodology - if something feels “off,” investigate it
Test your hunches - even if they seem unrelated to the main problem
Look for the unusual - what’s different about this specific case?
Channel your inner tester - what would break this? What edge cases exist?
Sometimes the answer is hiding in the details you dismissed (or missed)

Speaker Notes

Sometimes intuition trumps methodology… You need to remain flexible in your approach, which allows for random detours or testing of wild theories.
Test your hunches… Your prior experience can guide you
Look for the unusual… What’s different about this customer? This specific request? This timing? The group membership hierararchy?
Channel your inner tester… This comes from my background as a software tester. Good testers don’t just follow happy paths - they ask ‘how do I break this?’ What would cause exactly this failure? I’ve used this method many times recently to reproduce weird race conditions or bugs, by breaking things in a controlled and methodical way. It’s like reverse engineering the problem.
Sometimes the answer is hiding… Don’t assume something is irrelevant, test whether it is first.

Embrace the learning opportunity

Problems are teachers - unfamiliar territory is where you grow. Embrace what you don’t know
Research with purpose - when investigating, you’re building expertise
Curiosity over frustration - when you don’t understand something, that’s your cue to learn
Ask “why” and “how” - don’t just fix it, understand it
Document what you learned - not just the solution, but the new concepts you discovered

Speaker Notes

Problems are teachers… Don’t be intimidated when you encounter something completely outside your comfort zone.
Curiosity over frustration… This requires a mindset shift. Instead of saying ‘Ugh, I don’t know Kubernetes,’ try ‘Cool, I get to learn Kubernetes today.’ Instead of ‘This PostgreSQL error makes no sense,’ try ‘I’m about to understand how PostgreSQL really works.’
Document what you learned… Future you - and your teammates - will thank you. When we write public Knowledge Base articles, future tickets will be deflected. Everybody wins!

AI is your friend, not your shortcut

Expand your perspective - ask AI to suggest alternative hypotheses or identify blind spots
Guide through unfamiliar territory - let AI explain concepts or technologies you don’t know
Use as a rubber duck - brainstorm plans or talk through your troubleshooting approach

Speaker Notes

I added this section because AI has fundamentally changed how we troubleshoot. The key is understanding what AI is good at and what it’s terrible at.
Expand your perspective… AI excels at generating alternative hypotheses you might not think of. It’s like having a brainstorming partner. Like our fellow colleagues, it’s not infallible. Treat it as such.
Guide through unfamiliar territory… When I encounter something completely outside my domain - like ‘what does this Kubernetes error actually mean?’ - AI can explain it in terms I understand, suggest what to check next, or translate technical jargon into plain language. It’s like having a patient mentor.
Use as a rubber duck… You know the term rubber duck debugging? It involves talking to an inanimate object about your problem, and in doing so you can improve your thinking and decision making process. AI is a much better rubber duck. Instead of only hearing your own voice, you can have a dynamic conversation.

AI pitfalls

AI should assist you and extend your investigation, not be the sole form of troubleshooting.

AI isn’t a replacement for thinking - you still need to apply the systematic approach
AI can accelerate confirmation bias if you ask leading questions or only seek supporting evidence
Don’t skip the verification step - always Trust, but verify

Speaker Notes

AI isn’t a replacement for thinking… AI isn’t a magical oracle. We are the Directly Responsible Individuals (DRI) for the problem, and we can’t outsource that.
AI can accelerate confirmation bias… This is subtle but crucial. If you ask AI ‘Why might this be a database problem?’, it’ll give you a dozen reasons why it could be the database - even if it’s not. You’ve primed it to confirm your existing theory. Instead, ask ‘What could cause these symptoms?’ and let it suggest multiple possibilities, including ones that contradict your current thinking.
AI should assist you and extend… AI can help gather information, suggest hypotheses, explain concepts - but you’re still the one making decisions, testing solutions, and understanding the bigger picture. The moment you stop thinking and just follow AI’s suggestions blindly, you’re no longer troubleshooting.

AI troubleshooting strategies

Four approaches to choose from:

Start minimal - avoid confirmation bias, let AI generate fresh hypotheses
Full context - provide all relevant information and context upfront
Devil’s advocate - ask AI to argue against your current hypothesis
Step-by-step guidance - get structured checklists for unfamiliar domains

Choose the approach based on your situation and what you need from AI.

Speaker Notes

Start with minimal information… This seems counter-intuitive - we usually want to give AI all the context we have. But sometimes that context includes our biases and assumptions. It might cause us to fall deeper into unproductive rabbit holes. Feed AI additional context during the discussion.
Full context… can be useful when you need to move fast and you’re confident in your investigation direction. Or, if you are totally stuck and you need some direction. When you have hundreds of lines of logs, multiple error messages, and a complex timeline, AI can spot correlations you might miss. It can also make summaries that help you make sense of complexity.
Devil’s advocate… This challenges your assumptions. This is uncomfortable but valuable. When AI challenges your theory, you have to articulate why you believe what you believe. Sometimes you realize your evidence is weaker than you thought. Other times, defending your position helps you identify what additional evidence you need to confirm or refute your hypothesis.
Step-by-step guidance… This is perfect when you’re dealing with something completely outside your comfort zone. Instead of fumbling around randomly, you can ask AI to give you a structured approach. It’s like having a troubleshooting playbook for domains you’re unfamiliar with. This helps to identify where you might want to focus your investigation, and start eliminating lines of inquiry that are not relevant.
Main message: The point isn’t to memorize these approaches, but to understand that AI is a flexible tool. Match the approach to your needs.

Reproducing the problem: Why it matters

Reproduction allows you to verify both the problem and your proposed solution.

Confirms the problem type - is it consistent, intermittent, or truly random?
Shapes your strategy - different problem types require different approaches
Enables controlled testing - you can test potential solutions safely
Eliminates guesswork - moves from theory to observable reality
Builds confidence - both yours and the customer’s in the proposed solution

Speaker Notes

Confirms the problem type.. You can’t know which approach to take until you understand the problem’s nature.
Enables controlled testing… We’re not only testing the solution, but testing how the problem is reproduced
Builds confidence… when you can reliably make something fail and then make it work, both you and the customer know the problem and solution are both real. There’s no uncertainty. It also helps our engineers understand the problem and allows them to fix any bug reports quicker. I’ve seen many bug reports sit in backlogs indefinitely because they’re vague and not reproducible. When you can hand over exact reproduction steps, developers are more likely to identify and fix the root cause.

Question time!

What are the main things you think about when trying to reproduce a customer’s issue?

Aspects of reproducing a problem

Four main aspects to reproducing problem:

Reproduce the scenario
Reproduce the environment
Reproduce the data conditions
Reproduce the architecture

Speaker Notes

Keep in mind that these are high-level aspects to reproducing a problem. We’ll dive deeper into each of these on their own slides.

Reproduce the scenario

Sequence of events/actions - order matters
Timing patterns and conditions - load patterns, batch job schedules, race conditions, session timeouts, cache expiry, peak usage times, etc.
User journey and context - what were they trying to accomplish vs. what they are actually doing?

Speaker Notes

Sequence of events/actions… The scenario’s sequence of steps might be triggering a very specific condition
Timing patterns and conditions… Finding a time-based pattern to the scenario can save a lot of time for some types of problems. Example: I helped with a ticket recently where a CI/CD job was failing. It used service containers. The health check of the service container was fine, but when the job started executing, it failed. My hypothesis was that the service container was being killed between the health check and the start of the job’s execution. I tested that hypothesis by building a custom Docker image that mimics normal service behaviour initially, but then self-destructs after a predetermined time using a “time-bomb” script. This approach allowed me to precisely control the timing window - the service container stayed up long enough for the health check to pass, but died before the container linking phase. The result was an error message exactly matching the customer’s error message, successfully reproducing the problem, and allowing us to test our solutions and workarounds.
User journey and context… This is about understanding intent versus action. Understanding what they’re ultimately trying to achieve helps you spot when they’re taking an unusual path. Their ‘problem’ might be a training issue, not a technical issue. The best kinds of problems are not problems at all.

Reproduce the environment

Match their software stack as closely as possible - version mismatches are a common cause of “works for me” issues.

Versions: GitLab, GitLab Runner, Kubernetes, Docker
Configuration: Settings, integrations, customizations
Tier and Feature Flags - what’s enabled/disabled?
Dependencies - external services, databases, third-party tools

Speaker Notes

The problem may be related to specific versions of components, or incompatibilities between versions. Specific configuration settings or conflicts. Feature flags that have been enabled or disabled.

Reproduce the data conditions

Data size, content, complexity, and format can trigger specific issues - create test data that mimics the customer’s data characteristics.

Don’t just replicate the process, replicate the actual data
Data volume - small test vs. production-scale datasets
Data characteristics - special characters, encoding, formats
Data state - existing records, permissions, relationships
Sometimes the “irrelevant” details are the key
Respect data security - create realistic test data when you can’t use actual customer data

Speaker Notes

Don’t just replicate the process… This is where a lot of troubleshooting falls short. We test the happy path with clean, simple data. But Production data is messy. It has edge cases, unusual formats, unexpected sizes. The process might work fine with your test data but break with their real-world complexity.
Data volume… Testing with 10 records is different from 10,000 records. Pagination breaks, queries time out, memory usage explodes, UI becomes unusable. I’ve reproduced performance issues when replicating similar data volume and complexity as the customer. I was able to provide empirical evidence of performance bugs where increases in data volume directly correlated with higher response times. You can even plot this data in a graph to better highlight the problem.
Data characteristics… We need to confirm whether the format or content of the data itself is causing the problem. Example: I had a ticket where a project on GitLab.com couldn’t be deleted. I wasn’t able to figure it out initially, so I raised a RFH. Even then, the backend engineers were also unsure of the problem. I went back to basics, and took a look at the project’s characteristics. I noticed that the project description was not only long, but had some special characters in it. I ran a few tests of my own with similarly sized and complex project descriptions, and it turns out that the length reproduced the problem. You could create a project description of that length, just not delete it. So removing or truncating the project description allowed for the deletion to succeed. This analysis allowed for a bug report to be quickly raised and fixed.
Data state… It’s not just the data itself, it’s the relationships between entities and objects. The specific state or transitions of state.
Respect data security… In most cases you can’t just copy Production data, but you can create realistic test data that matches their patterns. Try to determine their data requirements based on the points above, and replicate those patterns safely.

Reproduce the architecture

Provider: AWS, GCP, Azure - cloud-specific behaviors
Architecture pattern: Single node, distributed, cloud native hybrid
Resource constraints - CPU, memory, disk, network limits
Network topology - firewalls, load balancers, proxies
Geographic distribution - latency, regional differences

Speaker Notes

This is often the last piece of the puzzle. You’ve matched versions, replicated the scenario, created similar data, but the problem still won’t reproduce. That’s when you need to look at the infrastructure layer. Architecture differences can create problems that are completely invisible until you match the exact setup.
Architecture pattern… Single-node setups have different failure modes than distributed systems. Distributed systems add complexity, and this complexity is where problems can occur. Customers may also implement architecture patterns that are not recommended.
Resource constraints… Problems that are invisible with abundant resources become obvious under load. Memory leaks, CPU spikes, disk I/O bottlenecks - these often only appear when resources are limited.
Network topology… Although we don’t provide direct support for their network setup, it’s useful to test whether this may be a contributing or causing factor.

Code is the best documentation

Documentation shows intent - how things should work in theory
Code shows reality - how things actually work in practice
Code is never out of sync - it IS the current implementation
GitLab is open source - embrace this advantage and use it for troubleshooting
Use Git blame and commit history - understand why code changed and when

Speaker Notes

Caveat: This requires some programming literacy, but even basic code reading skills can be incredibly valuable for troubleshooting. You don’t need to be a developer, just curious enough to trace through logic. The more you expose yourself to reading code, the more comfortable you’ll be reading code in the long run.
Code shows reality… During troubleshooting, I often consult the actual codebase because our documentation represents an aspirational view of how things should work. The code, however, shows exactly how it does work. There’s always a gap between intended and actual implementation. Sometimes the truth lies in the middle ground between how things are and how they’re intended to be.
GitLab is open source… Having access to the code is a huge benefit to troubleshooting. This is the difference between black box and white box testing. Black box tests the functionality without consideration for its implementation. White box tests consider the internal code structure, code paths, etc. With open source, we can see inside the system instead of just guessing from external behavior. This lets us target specific code paths and conditions during reproduction.
Code is never out of sync… Documentation can become outdated, but the code is literally what’s being executed.

Use code to guide troubleshooting

Use code to inform reproduction - understand what conditions trigger specific behaviors
Code reveals edge cases - conditions and validations that might not be documented
Don’t assume you understand the code - reading code creates assumptions that must be verified like any other hypothesis

Speaker Notes

Use code to inform reproduction… Think about the aspects of reproduction we discussed earlier. This point speaks to reproducing the scenario and the data conditions, to hopefully achieve a specific outcome, based on our analysis.
Code reveals edge cases… Conditional logic, error handling, state machine transitions, event hooks, validation rules - these are all explicitly defined in the code and can guide your troubleshooting.
Don’t assume you understand the code… This is crucial - just because you’ve read the code doesn’t mean you understand it correctly. Code reading creates assumptions about behaviour, and those assumptions need to be tested just like any other hypothesis in our systematic approach. I’ve been wrong many times about what I thought code was doing versus what it actually does.

In Logs We Trust

Logs are crucial - they’re the foundation of any investigation
Logs tell the story - What occurred, When, and sometimes Why
Correlate across components - when multiple systems are involved
Match the time period - ensure logs cover when the problem occurred
Get all participant logs - every component in the chain matters

Speaker Notes

Logs are the currency of troubleshooting. They’re objective evidence of what actually happened, not what people think happened.
Correlate across components… Modern systems are distributed - GitLab connects to Redis, PostgreSQL, Sidekiq, and external services. An error in one component often cascades to others. What you might initially see is a cause might be a symptom of an earlier or larger problem.
Match the time period… This sounds obvious but gets missed constantly. Always verify you’re looking at logs that cover the exact timeframe when the problem occurred. Make sure it’s the correct timezone, too.
Get all participant logs… Don’t just look at the component that’s throwing the error - look at everything in the request path. Each component can contribute to the problem, and you need the full chain to understand what happened.

Correlate log events

Search by time - find errors occurring in similar timeframes as the problem
Search by context - events that match your investigation criteria
Zoom out, zoom in - look for specific things, but also look for patterns
Look for presence or absence - sometimes what’s missing is the clue
Use identifiers to correlate - across same or different log sources:
- Correlation id, Job id/Pipeline id, Sidekiq jid, Container id or pod name

Speaker Notes

Search by time… Often you’ll find related events that happened slightly before or after the reported problem time. The user saw the symptom at 2:15, but the cause might have started at 2:12. You might notice a big spike at a certain time, but that doesn’t necessarily mean that the problem started then.
Search by context… Once you have a time window, search for specific context: user IDs, project names, IP addresses, error patterns.
Zoom out, zoom in… Start broad - look for overall patterns in the time window. High error rates? Unusual traffic? Resource spikes? Then zoom in on specific events. Maybe there’s a pattern
Look for presence or absence… Depending on the problem, we might see things we shouldn’t, or things we should see aren’t there. Ask yourself what you should be seeing, versus what you are seeing.

Keep your cool

Acknowledge the pressure exists but don’t let it drive your decisions
Breathe and reset when you feel stuck or overwhelmed
Stay curious - pressure kills curiosity, but questions lead to answers
Separate urgency from panic - urgent doesn’t mean abandon good practices
Communicate transparently - “I don’t know yet, but here’s what I’m checking”
Trust your process even when others want shortcuts

Speaker Notes

Breathe and reset… When you feel that panic is creeping in - when you’ve been staring at logs for 30 minutes and nothing makes sense - literally stop. Take some deep breaths. Step away from the screen for 60 seconds if you have to. Panic literally narrows your cognitive focus and makes you miss obvious things.
Stay curious… and assumptions can often lead to wrong solutions.
Separate urgency from panic… Urgent means important and time-sensitive. You can move quickly while still being systematic.
Communicate transparently… Customers prefer bad news to silence. ‘I don’t know yet, but here’s what I’m checking next’ is infinitely better than radio silence. It shows progress, demonstrates competence, and buys you time to work.

Know when to escalate

Escalation is a skill, not a failure!

We can’t know everything
We are a team, not a collection of individuals

Early escalation signals:

You’ve exhausted your systematic approach without significant progress
The problem requires domain expertise you don’t have
Customer impact is escalating faster than your investigation

Speaker Notes

Escalating isn’t due to an individual failure. It’s the next logical step in the investigation process. We’re a team and we need to leverage the resources, experience, and knowledge of that team.

How to escalate effectively

Summarize what you’ve tried - save others from repeating your work
Explain the customer impact and timeline pressure
Be timely - don’t wait until you’re completely stuck, or the customer is frustrated
Target the right group - identify who has the specific expertise you need
Have a specific ask - not “I need help!” but “Can you help me understand why X is happening when Y occurs?”

Speaker Notes

Explain the customer impact… Impact and timeline help people prioritize their response appropriately.
Be timely with escalation… Don’t be a hero. If you’re stuck and it’s something outside your expertise, escalate. And definitely don’t wait until the customer is angry. Escalating before the customer gets frustrated looks much better than escalating because the customer demanded a manager.
Target the right group… The more specific you are, the faster you’ll get to the right person.
Have a specific ask… Ambiguous questions get ambiguous answers. People are more motivated to assist when the problem and request is understandable and seems achievable. Consider customer tickets that lack actionable information, and compare that to your ask. Would you be satisfied with receiving a ticket with that information? If not, structure your ask and ensure it has the right information.

Capture what you’ve learned

Document your journey, not just the destination
- What you tried and why (even the dead ends)
- The key evidence that led to the solution
Write for the next person who encounters this problem
Update existing documentation if you found gaps or inaccuracies
- Don’t let the next person fall into the same traps
Create KB articles for novel problems - if this was hard to solve, someone else will face it too
Include preventive measures when possible - how to avoid this problem in the future

Speaker Notes

Every problem you document makes the entire team better at troubleshooting. Your hard-won knowledge becomes organizational knowledge. That’s how you build a really effective support team.
Write for the next person… That person could be you! Make your fellow colleagues successful, not just informed.
Update existing documentation… If you had to learn something the hard way because existing docs were wrong or incomplete, fix them. Don’t let the next person waste the same hours you did. Fix the absences or inconsistencies as you identify them.

We are our values

GitLab Values are very useful in the troubleshooting process. Let’s look at some of our operating principles and see how they help us troubleshoot.

Serving our customers

“Assume positive intent” \ “Kindness” \ “Say sorry” \ “No ego”

Customer management is problem management - expectations and satisfaction matter
Assume good faith - they’re doing their best to help (usually)
Communicate with empathy - they may be under intense pressure to get this resolved
Separate work from person - be kind, even when frustrated. To the customer, and to yourself
Own mistakes quickly - apologize and move forward
Check your ego - focus on the problem, not yourself

Knowledge isn’t everything

“It’s impossible to know everything” \ “Self-service and self-learning”

Not knowing is the first step - to knowing something
Approach matters more than knowledge - it’s OK not to know things
Uncertainty is guaranteed - roll with it!
Perfect information doesn’t exist in our world - we never have the complete picture
Optimize information access - get better at finding information, and making sense of what you have

Keep an open mind and be prepared for failure

“Write things down” \ “Low level of shame” \ “Articulate when you change your mind”

Use the Scientific Method - observations, hypotheses, thorough testing
Explore different paths - don’t fear going down the wrong road
Expect to be wrong - you will be wrong more often than right, adapt quickly
Document everything - observations, hypotheses, and tests reduce the problem scope, and help others

Speaker Notes

Use the Scientific Method.. I mention this because we shouldn’t be emotionally invested in a specific outcome, theory, or resolution. We should go where the observations and tests tell us. We should be open to new possibilities, and change our view based on new information or evidence.

Balance the need for problem solving and customer results

“Customer results” \ “Sense of urgency” \ “Have Ownership & Accountability” \ “Reach across company departments”

Customer results ≠ perfect solutions - workarounds and mitigation count
Mitigate first, solve later - reduce customer impact while investigating
Engage help early - you can’t solve everything alone
Take ownership with urgency - this drives better decision-making and gets better results
Keep communicating - even when there’s no progress to report

Own your troubleshooting journey

“Manager of one” \ “Operate with a bias for action” \ “Give agency”

Take initiative in investigations - don’t wait for someone to tell you what to check next
Own the problem - from initial customer contact through resolution and documentation
Make decisions autonomously - use your judgment on investigation paths and escalation timing
Empower yourself to learn - research unfamiliar concepts, dig into new technologies
Trust your process - you have the framework and skills to navigate uncertainty

Key takeaways

Be systematic, but stay curious - balance methodology with intuition
Trust, but verify - applies to customers, AI, and your own assumptions
Zoom in, zoom out - don’t get lost in the details
Document the journey - help the next person (including yourself)
Escalate early - it’s a skill, not a failure
Keep your cool - panic leads to missed details and rushed mistakes

Making sense of things that don’t make sense

The goal isn’t to memorize every technique - it’s to develop a systematic mindset that helps you:

Navigate uncertainty with confidence
Ask better questions
Find solutions faster
Learn from every problem you solve