Feature Launch: Improve your team’s code review processes with Feedback Quality

Feedback is one of the most important contributors to an engineering team’s performance. It’s not merely a matter of hurt feelings; we know that good feedback improves the quality of code, supports learning, and increases knowledge-sharing. Conversely, studies have also shown that bad feedback is linked to more defects, less-maintainable code, and in the worst cases it can lead to turnover.

With that in mind, we have been wanting to build this feature for literally years (Feedback quality is one of the focus areas of our original research). But until modern LLMs came along, there weren’t ways to get to a high enough accuracy level to ship something.

That’s why we’re delighted to now launch our Feedback Quality feature for Multitudes – allowing you to identify what constructive and actionable feedback looks like for your team, where rubber-stamping might be happening, and if conversations are starting to get too heated. Ultimately, our goal is to support you and your team to write code reviews that help, not hurt.

Why feedback quality now?

Code reviews are an integral part of software development, allowing developers to improve code quality and catch issues before they merge their PR into the main codebase. Reviews are one of the top things a team can do to support the quality of their work and the growth of their people. And with the rise of AI coding tools, more pressure is being put on code reviews than ever before – because there’s a higher-volume of code to review, and because LLMs can introduce hard-to-spot bugs.

Despite the importance of good reviews, bad reviews abound. Studies show that 55% of developers receive nonspecific negative feedback annually, while 22% experience inconsiderate criticism at least once per year.

The amount of poor-quality feedback is worse for people from marginalized groups too. A study conducted by Shelley Correll was seminal in pointing out that women tend to receive less specific, helpful feedback compared to men. Iris Bohnet’s research also showed that people from marginalized groups get less feedback on the whole – known as the “thin file problem”, this affects their eligibility for promotions. Women, Hispanic/ Latino, and Black people then tend to be overrepresented when it comes to being recipients of negative stereotyping related to feedback, and are disproportionately impacted by increased turnover from negative feedback.

Despite all of that, we know that most people want to provide good feedback. 76% of developers believe that improving code quality and considering the impact on the recipient are equally important with writing code review comments.

So, what is the disproportionately high rate of poor-quality feedback telling us? We’ve inferred that teams need guidance on how to ensure that criticism remains constructive, so that team retention and performance don’t suffer. Understanding your team’s feedback quality helps you to create more inclusive review processes, and identifies opportunities within your team where you can improve collaboration.

Graph labeled 'quality of feedback given' showing the different categories of feedback within one team's code review, ranging from Highly specific to Minimal review. The graph also shows suggested responses.

How do you even measure feedback quality?

One of the challenges with setting out to build a feature in this space is that even humans might disagree about what good-quality feedback looks like.

To address this, we dived deep into the research to look for:

What aspects of feedback are most linked to outcomes we care about? E.g., better quality code, more knowledge-sharing, etc.
What aspects of feedback most get in the way of outcomes we care about?

We then pulled out the aspects of feedback quality that stood out most:

On the side of positive outcomes, a clear takeaway in the research was that the more specific feedback is, the better the outcomes.
On the things that block good outcomes, negative feedback was the other stand-out here (feedback which unnecessarily tears people down, as opposed to constructive feedback, which is framed around how to improve.) Also categorized as ‘toxic feedback’, comments like this are associated with undesirable outcomes such as stress and turnover.
Rubber-stamping (or minimal reviews) also stood out in the research. These are the reviews that just say “Looks good to me!” or “LGTM”. Some of this is fine, since it might be that the PR genuinely didn’t need many changes. But if this happens too often, it means that people on the team aren’t getting thoughtful feedback to help them improve. Too much rubber-stamping can also lead to the inclusion of bugs and re-opens of pull requests, wasting valuable development time.

With that, we knew that our feedback quality feature needed to identify 3 key aspects of feedback: Highly specific feedback + Negative feedback + Minimal reviews.

What does this feedback quality feature show?

Our feedback quality feature examines the comments written in code reviews and identifies feedback that is likely to be more or less helpful. We analyze all code review comments (excluding the PR author's own comments) and classify them into quality categories based on how constructive and actionable the feedback is.

How do we measure feedback quality?

We analyze feedback given in code reviews using Multitudes's AI models that have been specifically designed to mitigate algorithmic biases and are grounded in research. We classify feedback into the following quality categories:

Highly Specific: Detailed, actionable feedback that clearly explains what needs to change and provides clear reasoning.
Neutral: Moderately detailed feedback that provides some guidance but could be more comprehensive.
Unspecific: Vague comments that don't provide clear direction for improvement.
Minimal: Short reviews like "LGTM 👍" or other “rubber-stamped” code reviews that provide minimal guidance.
Negative: Feedback that may come across as harsh, dismissive, or potentially harmful.

When feedback is classified in the app as “negative”, we also identify the specific reasons that the criticism was flagged as destructive, based on established research around this type of negative feedback. Together, this analysis helps your team identify specific patterns unique to your own code review culture, and gives you examples of feedback that you can use in coaching conversations with your team.

This is a sample code review that shows why it was rated Highly specific by the Feedback Quality feature.

What good quality feedback looks like

We recommend that teams aim for:

20%+ highly specific feedback because this provides clear, actionable guidance.
Zero negative feedback because this can be destructive for team inclusion and performance.
<30% minimal feedback, because some quick approvals are normal, but excessive rates suggest insufficient review depth.
Equitable distribution across the team, with all team members receiving similar quality of feedback and no one getting significantly less specific feedback.

This is based on benchmarks for what the balance of feedback typically looks like on teams (more here).

We’re actively conducting research to better understand the aspects of feedback quality that are indicative of elite teams – feedback quality is one of the focus areas of our original research. This feature is reliant on LLMs, so we will continue to refine our feedback classification process after release as we gather more data and insights. This feature will be actively updated so that your team can benefit from the results of our ongoing research.

This is also part one of a larger roll-out of new AI features to our app that will enable you to better assess the effectiveness of your team’s code reviews – watch this space for more!

If you’re an existing Multitudes customer, read our documentation on this feature here.

Ready to try it out? Book a demo now.