Managing high availability at Intercom

Senior Director of Engineering, Intercom

Intercom is a product-led company, focused on maximizing product innovation and development velocity.

That also means we hold ourselves to high operational standards: minimizing costs, speedily addressing quality issues that arise within existing products, and mitigating security risks.

The foundation of our operational health is availability. Without rock solid availability, nothing else matters. To achieve our mission of making internet business personal and scaling to support larger and larger customers, we have been continually, thoughtfully, and carefully investing in our people, systems, and processes to maintain Intercom’s high standards of availability.

“The secret to Intercom’s success in this realm is simple: years of consistent, careful, and multi-faceted cultural, organizational, and engineering work”

The secret to Intercom’s success in this realm is simple: years of consistent, careful, and multi-faceted cultural, organizational, systems and software engineering work. This is why our customers – from small startups to massive, complex enterprises – put their trust in us.

Here’s how we do it.

Availability is embedded in our culture

The role each Intercom employee plays in supporting availability is embedded deep in our culture. For engineering teams, getting code to production as quickly as possible in small batches allows us to learn and iterate.

Rather than slow our engineers down, we invest in systems and build a culture around what it means to build fast and safe. We educate our engineers in these policies and processes throughout their Intercom careers, from employee onboarding to performance reviews. It’s not just isolated code changes that are shipped safely, our entire software delivery process aims to ensure that, at every stage, we are building products that are reliable and scalable by default.

“Our observability toolset is world-class and empowers every engineer in the company to understand, to the most minute level of detail, how the code they ship behaves in a production environment”

Our observability toolset is world-class and empowers every engineer in the company to understand, to the most minute level of detail, how the code they ship behaves in a production environment. Despite all our best intentions, however, some code changes will inevitably cause issues. That’s why we also invest in mechanisms that allow us to recover from issues even faster than we deploy.

We build strong technical foundations

We build exclusively with a very small number of technologies as part of our stack, backed by a specific “core technologies” enablement team. We’ve developed deep expertise in these technologies over time, and our architecture choices and implementation patterns are simple and proven. This means we know what it means to design and build for reliability, with solutions that are proven at scale.

“We invest in building and maintaining shared systems and tools that underpin our ability to ship code safely, and to recover quickly in the event anything goes wrong”

Using these core technologies, we invest in building and maintaining shared systems and tools that underpin our ability to ship code safely, and to recover quickly in the event anything goes wrong.

This type of automation gives us the ability to deploy changes to a small percentage of customer traffic, or to a specific set of customers, in order to understand its impact. We can easily toggle any customers’ access to a feature on or off, which is a useful capability if an incident occurs. We can also recover by simply opting to push the button to ‘roll back’ to a safe working version of the code – in less than five minutes.

We maintain very close relationships with our primary cloud infrastructure vendor, Amazon Web Services (AWS), to continually jointly assess the robustness of our infrastructure platform and understand if there are opportunities to evolve and further strengthen our reliability.

We manage risk and respond instantly when things go wrong

Part of the program strategy aims to identify, prioritize, and mitigate risks that would threaten our availability. Within the engineering org, we have a dedicated Availability Technical Program Manager (TPM) driving a cross-Intercom program dedicated to continuously strengthening and protecting our availability.

“The program team works with managers across Engineering to deeply understand any risks we’re facing”

The program team works with managers across Engineering to fully understand any risks we’re facing. These items are then prioritized as inputs to engineering roadmaps, with the TPM helping to ensure the work is carried out to schedule.

When we encounter an incident impacting our customers, our extensive monitoring and alarming platforms pick it up almost instantaneously, and our incident response process kicks into gear. Our customers are truly global, and that means we support them with continuous 24/7 on-call engineering and incident management support.

Our emergency responders are online and respond within minutes of being paged, joined by an Incident Commander. The Commander’s immediate focus is on minimizing customer impact, and they coordinate the entire effort, including issue identification, triage, communications, and resolution. This is a highly disciplined and organized process, underpinned by very well-defined roles and operating principles.

“Typically we resolve such incidents in minutes, posting updates to our status page while simultaneously working to restore service to normal”

Typically we resolve such incidents in minutes, posting updates to our status page while simultaneously working to restore service. Resumption of normal service is certainly not the end point for us though. A key part of our incident management process is the incident review, where we deep dive into the causes and contributing factors of an incident and look for learnings.

In an internal open forum, we’ll reflect on where we might have done better, and propose short-term action items as well as longer-term strategic changes. This meeting is one of the most beneficial for us: a reminder that being truly great requires dedication to continuous improvement.

What our focus on availability means for our customers

Our commitment to our customers’ success means that rock solid availability is a must. Our holistic approach has allowed us to significantly exceed our target uptime of 99.8% for a number of years, providing a platform for growth that all our customers can trust.

This is the second in a content series diving into Intercom’s investments in supporting enterprises. Explore other articles in the series.