Feedback is one of the most important contributors to an engineering team’s performance. It’s not merely a matter of hurt feelings; we know that good feedback improves the quality of code, supports learning, and increases knowledge-sharing. Conversely, studies have also shown that bad feedback is linked to more defects, less-maintainable code, and in the worst cases it can lead to turnover.
With that in mind, we have been wanting to build this feature for literally years (Feedback quality is one of the focus areas of our original research). But until modern LLMs came along, there weren’t ways to get to a high enough accuracy level to ship something.
That’s why we’re delighted to now launch our Feedback Quality feature for Multitudes – allowing you to identify what constructive and actionable feedback looks like for your team, where rubber-stamping might be happening, and if conversations are starting to get too heated. Ultimately, our goal is to support you and your team to write code reviews that help, not hurt.
Code reviews are an integral part of software development, allowing developers to improve code quality and catch issues before they merge their PR into the main codebase. Reviews are one of the top things a team can do to support the quality of their work and the growth of their people. And with the rise of AI coding tools, more pressure is being put on code reviews than ever before – because there’s a higher-volume of code to review, and because LLMs can introduce hard-to-spot bugs.
Despite the importance of good reviews, bad reviews abound. Studies show that 55% of developers receive nonspecific negative feedback annually, while 22% experience inconsiderate criticism at least once per year.
The amount of poor-quality feedback is worse for people from marginalized groups too. A study conducted by Shelley Correll was seminal in pointing out that women tend to receive less specific, helpful feedback compared to men. Iris Bohnet’s research also showed that people from marginalized groups get less feedback on the whole – known as the “thin file problem”, this affects their eligibility for promotions. Women, Hispanic/ Latino, and Black people then tend to be overrepresented when it comes to being recipients of negative stereotyping related to feedback, and are disproportionately impacted by increased turnover from negative feedback.
Despite all of that, we know that most people want to provide good feedback. 76% of developers believe that improving code quality and considering the impact on the recipient are equally important with writing code review comments.
So, what is the disproportionately high rate of poor-quality feedback telling us? We’ve inferred that teams need guidance on how to ensure that criticism remains constructive, so that team retention and performance don’t suffer. Understanding your team’s feedback quality helps you to create more inclusive review processes, and identifies opportunities within your team where you can improve collaboration.
One of the challenges with setting out to build a feature in this space is that even humans might disagree about what good-quality feedback looks like.
To address this, we dived deep into the research to look for:
We then pulled out the aspects of feedback quality that stood out most:
With that, we knew that our feedback quality feature needed to identify 3 key aspects of feedback: Highly specific feedback + Negative feedback + Minimal reviews.
Our feedback quality feature examines the comments written in code reviews and identifies feedback that is likely to be more or less helpful. We analyze all code review comments (excluding the PR author's own comments) and classify them into quality categories based on how constructive and actionable the feedback is.
We analyze feedback given in code reviews using Multitudes's AI models that have been specifically designed to mitigate algorithmic biases and are grounded in research. We classify feedback into the following quality categories:
When feedback is classified in the app as “negative”, we also identify the specific reasons that the criticism was flagged as destructive, based on established research around this type of negative feedback. Together, this analysis helps your team identify specific patterns unique to your own code review culture, and gives you examples of feedback that you can use in coaching conversations with your team.
We recommend that teams aim for:
This is based on benchmarks for what the balance of feedback typically looks like on teams (more here).
We’re actively conducting research to better understand the aspects of feedback quality that are indicative of elite teams – feedback quality is one of the focus areas of our original research. This feature is reliant on LLMs, so we will continue to refine our feedback classification process after release as we gather more data and insights. This feature will be actively updated so that your team can benefit from the results of our ongoing research.
This is also part one of a larger roll-out of new AI features to our app that will enable you to better assess the effectiveness of your team’s code reviews – watch this space for more!
If you’re an existing Multitudes customer, read our documentation on this feature here.
Ready to try it out? Book a demo now.