Honeycomb - Multitudes

10x faster

alert reporting

Honeycomb is an observability platform that enables engineering teams to find and solve problems. As a company, they care deeply about the quality of the on-call experience for engineers.

When PagerDuty changed their data export functionality, the Honeycomb team lost the ability to do their own mission-critical reporting on the human impact of alerts. In response, the Multitudes team co-designed new PagerDuty features with Honeycomb, allowing them to keep their alert reporting (which otherwise would have disappeared along with the reporting functionality being dropped), get the reports done 10x faster, and provide more transparency across the org – since anyone could now jump in and see which teams and people were most impacted by pages.

‍
‍Background

As an observability company, the Honeycomb team cares deeply about the quality of the on-call experience. As Jess Mink (Sr. Director of Expand Engineering) explains, they have a clear ethos:

You need to trust that the system will tell you when it’s unhealthy
Any alerts that come through should be actionable and important
Context matters – more noise is ok during the day but not at night. Also, the “noisiness” of the alert should vary based on severity. For them, the escalating noisiness might look like:
1. send a message in a Slack channel
2. create a ticket
3. send a page

A big part of a good alert system is understanding how it impacts the people who get interrupted to fix things. As Jess said: “You need everything to have resiliency, especially when it comes to people’s lives”.

Like any good analytics company, Honeycomb uses data as one check on how they’re doing with meeting their goals. Fred Hebert, Staff Site Reliability Engineer (SRE), runs regular evaluations on the quality of their on-call experience.

Initially, Fred started with a weekly survey for on-call engineers – with questions about how disrupted people were by alerts and how confident they felt doing on-call. That was helpful, but it couldn’t answer questions about the overall incidence of alerts or what impact they were missing in the qualitative responses.

To fix that, Fred had started pulling in quantitative data via the PagerDuty data export feature – looking at the number of on-call pages and when they happened.

‍

‍The challenge

The first challenge was that even with the script that Fred had written, there was still a lot of manual effort to get the data out of PagerDuty. PagerDuty only provided one csv per month, so for quarterly analysis, he had to manually merge the csv’s. He also had to do extra filtering to pull out noisy data from things like test pages and low-priority pages that were coming through from legacy components.

Since Honeycomb cares about the human impact of on-call, Fred also wanted to estimate how often folks were disrupted; in particular, what pages were happening outside of working hours? With a geographically distributed team, it was nontrivial to adjust for timezones and working hours across team members. Fred had a way to approximate it but knew there were things missing.

With all the manual effort, Fred was spending most of a day (up to 5 hours) per quarter to get the report done.

And then: PagerDuty decided to sunset their data reporting feature, including the CSV data export that Fred had been using.

So Fred was stuck, with no way to keep getting the insights that the Honeycomb leadership now expected to see.

‍

The solution: Co-designing PagerDuty analyses

Multitudes already had a feature that could show whether an event happened within or out of working hours for a specific individual: Each person could configure their own timezone, working days, and working hours to match how they worked. Plus, the Multitudes team was already planning to do a PagerDuty integration.

So Fred and the Multitudes team decided to co-design new PagerDuty features for Multitudes.

The co-design process was iterative, based on user research calls with Fred, Jess, and others at Honeycomb plus mockups from the Multitudes team for insights we could show in the product.

“I felt listened to throughout the co-designing process. I would give feedback and then saw the Multitudes team make changes immediately and iteratively over time.”

– Fred Hebert, Staff Site Reliability Engineer (SRE), Honeycomb
‍

As Fred pointed out, metrics can mislead or be abused without the proper context, so he appreciated the care that the Multitudes team put into building thoughtful analyses – and everyone welcomed the additional visibility over PagerDuty data.

After a few rounds of feedback, the Multitudes team launched the Page Disruptions feature. This shows the human impact of pages – how many pages are coming through, who’s getting paged, and how often are the pages during working hours versus out of hours. Thanks in part to feedback from Fred, the feature automatically excluded pages with “test” in the page name and had a smart fall-back for incidents that didn’t have an acknowledgement in PagerDuty but were closed (we use the Resolved timestamp as the fall-back).

Page Disruptions showing Out-of-hours pages are high

‍

This was in addition to other PagerDuty insights – including for Mean Time to Recovery (MTTR) and showing which services had the most pages.

After the feature had been launched, the Honeycomb and Multitudes teams continued to iterate and improve together. Every time Fred used the feature or wrote a report, he’d share his experience back with the Multitudes team via Slack Connect, and the Multitudes team would make changes accordingly. One follow-on was adding <code-text>Mean Time to Acknowledge<code-text> (responders arguably have more control over this than MTTR).

Something Fred noticed with the Page Disruptions chart was that bursts of pages could easily skew the numbers, since one incident can often be reflected across multiple services. In terms of the human impact though, if a person has already been woken up to respond to a page, then it’s not causing the same amount of disruption if several other pages go off simultaneously or shortly afterwards.

In response, the Multitudes team launched a V2 of the Page Disruptions chart with the option to show just the distinct hours that were disrupted by pages – not a total count of all pages.
‍

Chart showing number of hours disrupted by pages is low and trending well

‍

Overall, the process was focused on building thoughtful analysis, then iterating with user feedback over time.

“In Multitudes, we have more of a partner relationship than a vendor relationship. They are continuously learning along with us and will collaborate with us to improve the product.

Ultimately, this was a key reason we chose Multitudes over some of the competing products in the space.”

- Emily Nakashima, VP of Engineering, Honeycomb

‍

The win

The first win was immediate – the time that Fred spent pulling together the alert reports each quarter dropped from 5 hours to 30 minutes.

Alongside the time saved, the Honeycomb team also got new features:

Ready-made data visualization – not just data points and tables
Permalinks to share a specific view with others in the org
Drill-downs to see the detail – including what the time was in the responser’s local time when they got paged

Perhaps the biggest impact was that these insights were now available to anyone across the organization at any time. Fred can finally take a holiday even at the beginning of a quarter!
‍

“Out-of-hours page disruptions insights from Multitudes let us catch any issues early if we’re being impacted. Before, we had to wait until the end of the quarter to take action.”

- Jess Mink, Snr. Director of Expand Engineering, Honeycomb

‍

In seriousness, the team transparency made it easier to get organization-wide visibility across alerts and their impact. Directors look at this data at least once a month and managers look at it weekly, which means that as soon as team members get pulled into big weeks – as happened at the end of last quarter – they have a conversation about it and figure out the remediation plan.

Ultimately, as Jess shared, Multitudes gave Honeycomb an easier way to live their values.
‍

“Having the data to point to, especially at the Director level, gives us the backstop to help people not suffer in silence. It also provides accountability.”

- Jess Mink, Snr. Director of Expand Engineering, Honeycomb

‍

The future

This isn’t the end of the story, of course, since the Honeycomb team and other Multitudes users keep giving feedback and the Multitudes team keeps iterating.

An upcoming iteration includes the ability to bring in other data sources to the disrupted hours view. This will show not just the unique hours disrupted by out-of-hours pages, but also the hours disrupted by meetings, commits, messages on Slack, and more. This will show a collective view of how the work is impacting people on the team.

‍

Honeycomb did alert reporting 10x faster with Multitudes

‍
‍Background

‍The challenge

The solution: Co-designing PagerDuty analyses

‍

The win

The future

Start making data-informed decisions.

Honeycomb did alert reporting 10x faster with Multitudes

‍‍Background

‍The challenge

The solution: Co-designing PagerDuty analyses

‍

The win

The future

Start making data-informed decisions.

‍
‍Background