Eucalyptus

Eucalyptus maintained code quality while rolling out AI

Team working at shared desk.
161%

increase in Merge Frequency

8.5%

decrease in PR size

… while ensuring that AI PRs got enough reviews

Eucalyptus, which runs a trusted group of digital healthcare clinics on their telehealth platform, uses Multitudes to make sure that they get the productivity benefits of AI while maintaining code quality.

By tracking their metrics on Multitudes, they were able to spot the leading indicators of how AI was impacting quality and double down on what was working well. Ultimately, their high AI adopters reduced PR sizes by 8.5% while merging 161% more PRs.

The challenge

Eucalyptus rolled out AI coding tools to their engineering team in early 2025. Bronwyn Mercer, the Head of Security and Infrastructure, and Adrian Andreacchio, Platform Engineering Lead, were tasked with quantifying the impact of AI and ensuring that they got as much value as possible from these tools. 

Eucalyptus, like many other tech companies, had set big goals around their use of AI. A key benefit they hoped to get from AI was speeding up the onboarding time for new developers – they were going through a big period of growth which meant their engineering team was set to double over the following year. A secondary goal from using AI was to increase developer productivity across the team.

However, they were also aware of the issues with AI slop. Eucalyptus’s engineering team prides itself in having a high-quality codebase that’s quick and fun for developers to build on. Adrian and Bronwyn didn’t want AI speed to bring tech debt that would slow them down or degrade the developer experience. One concern was that AI slop could create bottlenecks in the code review process – with developers spending more time reviewing poor-quality AI code, or doing more rounds of editing to get an AI-written PR to the right quality level.

A second challenge was that the AI rollout was happening in the build-up to a major feature release – so Bronwyn and Adrian had to make sure that the team could still deliver quickly while learning AI tooling and managing AI slop. Their work was cut out for them!

Our unique insight

To help them spot leading indicators of AI impact, Eucalyptus worked with Multitudes. The telemetry data from Multitudes meant that they could get low-effort, ongoing updates on the AI rollout without interrupting the flow of work.

The Multitudes data showed positive early results for productivity but mixed results on the qulaity side.

Productivity wins: 62% more PRs merged, 10.5% faster onboarding

When we checked in on the rollout, Eucalyptus already had two positive productivity indicators:

1. Since rolling out AI tooling, Merge Frequency was up 62% for high AI adopters, well above the 29% increase that low AI adopters saw over this period.

The fact that high AI adopters were merging significantly more PRs than others using AI implies that AI did have a net positive impact on PRs merged, above and beyond changes coming from other factors. 

Multitudes research backs this up – across the board, there’s typically a 27% increase in PRs merged when teams adopt AI.

Box and whisker chart showing Merge frequency increased 29% for low AI users and 62% for high AI users post-intervention.

2. Time to tenth PR decreased 10.5%.

This metric for onboarding speed comes from Spotify – time to 10th PR merged shows how long it takes people to be truly up and running in the codebase. At Eucalyptus, they saw a 10.5% decrease in onboarding time for people who started before AI tooling was available versus those who started after.

Chart showing time to 10th PR decreased from 20 days to approx 17 days post-Cursor rollout

Mixed results for AI's impact on quality

The Eucalyptus team wanted leading indicators of how AI was impacting codebase quality; they weren't going to wait for an AI-caused incident to see the impacxt on MTTR.

To that end, they looked at 2 leading indicators from Multitudes for whether AI could be causing quality issues:

  • PR size: longer PRs take longer to review and let more defects through, so an increase in PR size is a warning sign for potential problems later (Google’s Engineering Practices Documentation shares more about why small code changes are important; this research with Cisco showed that 200-400 lines of code changed is the optimal size). 
  • Quantity and quality of human reviews: As Artie Shevchenko lays out in his book, humans are your code health guardians. This role is more important than ever with AI-generated code coming through. As one indicator of that, Multitudes metrics look at the number of reviews on PRs authored by people using AI.

The initial metrics on these for Eucalyptus were mixed but promising:

PR size

Here, the Euc team had an amazing win – PR size actually decreased for the high AI users! The question here was: How could they sustain this early sign of progress even as they increased their AI usage?

Box and whisker chart showing PR size increased for low AI users but decreased for High AI users.

Human reviews

This metric was interesting – while there was no change to the number of reviews that low AI users were getting, there was a decrease in the number of reviews going to high AI users. Especially given the increase in PRs merged, this was especially surprising. A key question going forward was how to make sure that people using AI were still getting enough reviews. 

Box and whisker chart showing the amount of feedback received decreased post-intervention.

Actions taken

From those insights, Bronwyn and Adrian had two questions to explore:

  • What’s helping us keep our PR size low even as we use AI more for coding? And how can we sustain that as we double down on AI usage?
  • What’s different about code reviews for people using AI, and how can we make sure our high AI adopters get enough reviews on their work?

Conversations with their engineers showed several things that had gone well in the rollout:

1. Existing code review norms served them well 

Their developers cited several norms they were thinking of when they used AI:

  • All PRs require two PRs – to make sure there’s a minimum set of reviews. As one developer said, “Eucalyptus has optimized for code review and not code output.”
  • There was a strong culture of being a good colleague – to not send your peers slop, to not put something forward for a review unless you would want to review it yourself
  • They follow the Google Style Guide for PRs, so even new engineers to Euc have guidance on what style practices to follow 

2. The expectations with AI were clear – move faster, but maintain code quality 

The big example here was that when the platform team rolled out AI tooling, Adrian and others shared that they wouldn’t review PRs if they were too long – so developers across the org knew it was important to keep PR size low even while using AI. 

With clear culture norms and goals with AI, Eucalyptus developers found practices that worked – and which even engaged AI to help them achieve the end goal of no AI slop:

  • No vibe coding – people shared that, culturally, they knew this was a bad idea. Many spoke about a “0 faith” approach to AI – review every single line of code it gives you, because you’re still responsible for the work. 
  • Tell AI to be more concise – even put it in your rules or markdown file
  • Have AI do an initial review before requesting a review from humans, to make sure there was nothing obvious to fix in the PR before asking for feedback. As one dev shared, “I want to make sure I’m not wasting another person’s time when I ask for something – it’s better to save human input for the harder questions.”
  • On the rare occasions that someone was moving quickly so asked for a review on a PR that was largely written by AI, it was best to be upfront about it. That way, the reviewer at least knew they could be blunt and honest in their feedback about the PR (e.g., for anything the AI did poorly) without worrying about how that might impact the human author. 
  • In fact, the main thing that experienced developers used AI for was learning something new and summarizing information, more than writing the code itself. 

These insights meant that the platform team knew what practices to double down on as they encouraged more AI usage across the team. Adrian also put together Cursor rules that people can add locally, to help ensure that AI-written PRs follow their style guide.

Outcomes

Over the following month, the continued AI push meant that AI adoption increased 26% across the engineering team (based on DAU) – so it was another good opportunity to check in on the impact of AI with the increased usage. 

The results were resoundingly positive:

1. The productivity gains increased: 161% more PRs merged with AI

They saw even bigger improvements in Merge Frequency, with it increasing by 161% for high AI adopters compared to the “before AI” period. That’s 100 percentage points more than the 62% increase that low AI adopters saw. That increase happened even as the onboarding benefits continued, with time to 10th PR staying low.

Box and whisker chart showing Merge Frequency increased 89% overall, with high AI users having the biggest gains

2. Positive signs on all quality indicators – PR size 8.5% smaller and review volume maintained

PR size

PR sizes remained low for high AI adopters – overall, they actually had an 8.5% decrease in PR size. This is an even bigger win given that the low AI adopters had an increase in their PR size over this period, because this can indicate that broader org pressures would have pushed PR sizes higher even separate from AI. Thanks to the improvements with the high AI adopters, overall organizational PR size remained constant. 

Maintaining a consistent PR size while rolling out AI is a huge win – it means Adrian and Bronwyn are succeeding in preventing AI slop while getting benefits from AI.

Box and whisker chart showing PR size increased overall post-intervention but decreased for High AI adopters.

Human reviews

The other win was on the reviews side. Their rollout strategies worked, with the number of reviews going up for the high AI users, meaning that reviews received were now at pre-AI levels.  

Box and whisker charts showing feedback received increased overall

“It was great to have real-time data to compare and see trends over time. In particular, it helped us to be able to do a deep dive into AI – that meant we could do the rollout quickly, knowing we would have leading indicators of AI’s impact. Those leading indicators gave us time to adjust as needed.”

- Adrian Andreacchio, Platform Engineering Lead

What next? 

Adrian and Bronwyn have even more reason to double down on their strong code review culture at Eucalyptus. Over time, they’ll be able to validate the quality indicators (like PR size and human reviews) with more lagging indicators, like Change Failure Rate and Mean Time to Recovery. 

As they try new AI interventions – like Adrian’s Cursor rules that he’s putting in a centralized repo – they can measure the impact of each one in Multitudes’s AI impact feature and see how it impacts AI adoption, productivity, and of course, code quality.

Two team members standing side-by-side, one looking at the text to the left.

Start making data-informed decisions.