Measuring AI productivity gains

#llm

More and more companies are now using AI. I work in tech so I am aware I am in the front line of this "revolution". Other companies are being force-fed AI through the products they use daily, mostly by Microsoft and Google.

This might be great for some uninteresting tasks and I've experienced it in my day-to-day work. Offloading some tedious jobs to AI like updating JSON files, writing simple to medium complexity code or improving the formulation of emails is wonderful. Everything right now is largely free - I've used Claude without charges for the past year or so. My company started handing off licences for some developers and checking if their productivity was, in fact, increased. However, measuring the boost of productivity from AI is a hard task. Right now, the best we can come up with is a guessed estimate of the time gain and, drumroll, the number of generated lines using AI.

I think it's kind of a backward idea and people pushing it are unaware of the sheer variation of how people will judge time, LoC, etc. We should not let people guess or estimate! It is the best way to get completely unreliable results.

Instead, a pragmatic approach is to benchmark for specific tasks the time taken by an AI assisted human vs the same human without the AI assistance. You don't guess, you extrapolate. Most of the time, tasks can be sorted in big groups: answering mails, writing code, reading code, understanding requirements, taking meeting notes, etc.

This experiment supposes a specific order: first you do the task with the AI, then you do it without being assisted by the AI. This largely supposes the speed gain from the AI is due to its ability to generate massive code quantity very rapidly. This however doesn't take into account the use of AI to debug / understand code.

However, it's still not the perfect solution - and truth is, there is no perfect solution, just pick one that fits you best.

Another good way to measure the AI productivity is to force it to auto-commit to Git. This way, you can estimate how many lines were written by an AI vs by a human. You can then go deeper by checking for each PR the percentage of lines written by an AI that were altered by a human and vice-versa. This still uses the assumption developers won't edit AI code before it is being committed but despite this, it is miles better than just guessing.

Measuring the "throughput" itself is a whole other topic and might just be a manager's wet dream. There is, however, as many ways to measure productivity as there are stupid people on this planet (a lot).

Measuring ✨productivity ✨

A good definition of productivity in a professional context is "the rate at which goods or services are produced especially output per unit of labor".

I'm going to focus on the tech industry, but this should generalise well to other areas. I mostly use Jira and estimate "issue complexity" during pokers. This system has is strength and weaknesses but it is a topic for another day. The key point is to NEVER mix complexity points and productivity. If you do that, you introduce human biased judgment in your scientific measurement of productivity which you don't want to.

A better way to work around complexity is to measure the time a single issue took from being picked up by an engineer to when it was marked as finished. This ensure good communication inside the team for the issue to fluidly move from one column to another while avoiding to blame a single person.

Measuring this will eventually lead engineers to reduce the content of each Jira issue to the strict minimum which is a good thing. A Jira ticket should do one thing, and do it well.

A corollary metric is to measure the number of declared bug (by customers and by the team). When people rush, when people don't take the time to review AI code, they will introduce bugs. If this number rises, you've overdone it and your team is working beyond it's capacity. The same goes for the number of hotfix, patch, production down, etc.

The human well being should not be left behind when trying to increase the productivity. Burnt out people perform less, and people in sick leave perform infinitely less than your current employees! Unfortunately, well being at work is especially difficult to benchmark. Pick a metric that works for your team and stick with it (it might be the number of jokes in the group chat, the number of people smiling each day, if people are leaving at normal hours, etc.)

Final words

Increasing productivity is a balancing act: squeeze the employees just enough to get peak efficiency without burning them. Right now, no one can reliably predict the real ROI of integrating AI into our work. With this article, I hope you developed a good enough intuition of the potential pitfalls which comes with measuring productivity so you can build your own productivity framework in your company.