Production: Error Tracking

In production, something will go wrong. It’s not a matter of if but when it will go wrong, and when it happens you need to:

Know when it’s gone wrong
Be able to quickly respond with a resolution

In development, these are easy. You run your development environment locally, where you can see problems in real-time as they happen. If a button in your web app stops working in your browser, you can open up your browser’s inspector to see an error in your console log to debug and see a stack trace of the problem to quickly fix it.

In production, you don’t have these conveniences. You can’t expect your users to send you bug reports with a screenshot of their browser’s inspector: they have no idea what that is, and if they’re on their mobile–good luck with getting that stack trace from them.

How will you know when there are problems in production?

Errors and exceptions are happening in your code, and you’re handling them some way.

Maybe you just let your app crash and burn and restart it. Maybe you log unhandled exceptions. Maybe when connection failures happen, you retry forever until it works.

Do you know when these errors are happening? If not, you’re flying blind, and your users are likely bumping into problems without you knowing.

A note about silent errors

Be careful of swallowing errors silently:

fetch(url)
  .then(/* ... */)
  .catch(err => {
     console.log(err.message); // fail silently
  });

Without a system in place to aggregate and do something useful with these logs, these kinds of logs aren’t very useful in production.

Error tracking is not the same as logging

A log is a continuous stream of real-time information about your system. This is useful for seeing how the business logic in your application is behaving and to debug anomalies you may find in your system.

But logs don’t have certain properties that are important for error tracking:

Stack traces: You don’t log a stack trace of every log statement. When an exception is thrown, you need to have as much context as possible to quickly debug why the error occurred: the stack trace is where you’ll likely find your first clues on what went wrong.
Aggregation/frequency tracking: You need to know when a new bug occurs and how often. If you don’t know the frequency of a bug, you can’t make an informed choice on whether to take action on it or not because you don’t know how severe it is.
Resolution: Errors should be bound to a version/build of your application so you can resolve them with the next release.

Alerts

With logs, you could setup a regex or a pattern to match to get notifications when something you care about happens. It’s possible to just log your errors and just do this for a quick and dirty solution. But what if it happens 100 times/minute? If you’re sending alerts to your email: RIP your inbox.

The same is true for errors, and you need to have a way of getting alerts in a timely manner without flooding yourself with too many notifications when many errors do occur. Aggregation is key here, and your error tracking service should do it for you. If you don’t have a way of aggregating, like the above mentioned scenario with emails, you’re going to be in for a world of hurt.

Once you know you’re aggregating errors with an error tracking service, here is a list of some common ways you can get alerts in a useful manner:

Slack: I like this as long as it’s what your company’s using already. I say Slack, but it can be Discord, Microsoft Teams, whatever. As long as it’s in a workflow that you’re using with your team. Once a message is in Slack, you can reference it and start threads to discuss the error right inside of the chat.
Email: If you use your email inbox as a todo list, this may be helpful for you, as you can keep important issues around until you’re ready to archive them away. Most error tracking services will keep track of issues on their website, but if you’re not spending time visiting them on a regular basis, you’re blind to them.
PagerDuty: This is for larger teams that need coordination and scheduling of who’s on call and when. Early stage companies likely don’t need this, but as a team grows, it may be worth looking into.

Lifecycle of an error

You deploy a new version of your app: v1.3.2.
You get notified of a new exception in a Slack channel for your organization: Issue #1337. A conversation between engineers starts around this exception in a thread.
You check your error tracking service that Issue #1337 is happening 100 times/minute. Not only that, but it seems to relate to the paid sign-up flow of your application. Every error that you’re seeing is costing your organization real money.
The problem is severe enough that your team decides it’s worthwhile to roll back your production release to v1.3.1. You roll back, and you stop seeing the errors coming in.
The team steps back for a second and takes a look at stack traces and logs to try to debug and understand what was wrong.
One of your devs, Ruby, writes a test to exercise the bug to make sure she’s positive that it can be reproduced and that knowledge of the problem is incorporated and documented in your specs.
Ruby subsequently updates your production code that fixes the bug.
Ruby marks Issue #1337 as resolved in your error tracking service.
Ruby deploys a new version of your app with a fix: v1.3.3.
The team closely monitors for errors in the v1.3.3 release to see if Issue #1337 been successfully resolved or if Ruby’s new code has introduced new bugs.
Thankfully, the deploy was successful, and no new bugs were introduced. Good work! Time for a beer 🍻

Error tracking services

Here are some I’ve used in the past: