Gen AI Coding Assistants: Productivity Is a Mixed Bag
AI coding assistants, like GitHub Copilot, promise to increase developer productivity. But recent studies show a decline in software quality, security, and reusability. And senior developers confirm.
In late 2023, I ran a survey of business decision-makers to determine the top enterprise uses cases for Generative AI, along with spending and staffing metrics for Gen AI programs. The full report shows, not surprisingly, that software development came out as the top enterprise use case, across all industries, whether or not they are in the technology sector.
This makes sense. Software development languages have a formal and rigid syntax, making them a natural use case for large language models. In addition, software developers are a costly resource, so any productivity improvement will have tangible financial benefits. Moreover, developers are technically minded, and likely open to embrace new technologies to help them get their jobs done faster and easier.
Quantitative Metrics Lacking
However, most of the data regarding developer productivity is coming from surveys, like mine, which measure what survey respondents, or their managers, believe to be true. There have not been many studies based on hard metrics from the development environments, to determine what is really happening, as opposed to what business leaders perceive is happening That is now changing.
What are those studies and what do they show?
The bottom line is that generative AI does appear to improve developer productivity, but the results are a mixed bag. The productivity improvements are not spread evenly across all developers, there is greater adoption and greater productivity improvement among less experienced developers. But even this is offset by negative impacts on software quality, security, and reusability.
Improvement in Coding Productivity, But at What Cost?
The first findings are from Jellyfish, which provides a platform for software development organizations to collect and report metrics from their product lifecycle management and application lifecycle management systems. The firm did an analysis of GitHub Copilot’s impact from over 4,200 developers at more than 200 companies. That study reported a 2.5x increase in coding speed for mobile developers and a 4x increase for junior developers.
Although the overall tone of the research is positive, Jellyfish separately reported concerns about software quality, data privacy, and security of Copilot-assisted code.
Nevertheless, Jellyfish concluded in another post:
Our research found cycle time decreased by 5% but the time spent coding decreased by 11%, which shows a shift in the importance of coding toward code reviews. This has business and career implications for engineers who no longer need to code well but need to be able to understand and review code.
Second, an academic paper published in 2024 reports that the use of GitHub Copilot led to a 26.08% increase in the weekly number of completed tasks, a 13.55% increase in the number of code updates (commits) and a 38.38% increase in the number of times code was compiled. However, the productivity increase was greatest for new hires and those in more junior positions. The study was based on controlled trials analyzing data from nearly 5,000 developers at Microsoft, Accenture, and an anonymous Fortune 100 electronics manufacturer.
Third, a recent study by Uplevel, a provider of software development dashboards and metrics, is more downbeat. The study was based on customer data for 800 developers using GitHub CoPilot, and it showed no significant overall gains in productivity. But the use of GitHub Copilot produced 41% more bugs. So whatever gains there were in productivity they would be offset, in whole or in part, by the impact on software quality.
Fourth, a Gitclear study from January 2024 starts by repeating GitHub’s claim that its coding assistant results in a 55% improvement in coding speed, a 46% gain in amount of code written, and a 75% improvement in developer job satisfaction. But the study goes on to question whether all of that AI-assisted code should have been written in the first place. (In my view, this is consistent with age-old maxim that measuring software developers based on lines of code only encourages them to be more verbose in coding.) The study also questions whether those who must maintain the code are as satisfied as those who originally wrote it.
The study then moves on the analyze data based on 153 million changed lines of code, authored between January 2020 and December 2023, where they find “disconcerting trends for maintainability.” Among measures of code quality, they use a metric called “code churn,” as a measure of software defects and defined as “the percentage of lines that are reverted or updated less than two weeks after being authored.” They project that code churn will double in 2024 compared to its 2021, pre-AI baseline.
Gitclear’s quantitative data confirms its earlier developer survey where code quality tied as as the top metric developers would want to see when actively using AI. Another measure of software quality—the number of production incidents—rose to the third spot. Gitclear wrote:
While individual developers lack the data to substantiate why "Code Quality" and "Production Incidents" become more pressing concerns with AI, our data suggests a possible backstory: When developers are inundated with quick and easy suggestions that will work in the short term, it becomes a constant temptation to add more lines of code without really checking whether an existing system could be refined for reuse.
The authors then list three additional challenges with AI-assisted code generation:
Being inundated with suggestions for added code, but never suggestions for updating, moving, or deleting code. This is a user interface limitation of the text-based environments where code authoring occurs.
Time required to evaluate code suggestions can become costly. Especially when the developer works in an environment with multiple, competing auto-suggest mechanisms….
Code suggestion is not optimized by the same incentives as code maintainers. Code suggestion algorithms are incentivized to propose suggestions most likely to be accepted. Code maintainers are incentivized to minimize the amount of code that needs to be read (I.e., to understand how to adapt an existing system).
Consistent with the studies we discussed earlier, the authors observe that there is a “greater tendency for junior developers to accept code suggestions compared to their more experienced counterparts.” They write:
Experienced developers have the most informed understanding of how costly code will be to maintain over time. If they are more averse to using AI suggestions, it raises questions about the extra code that Junior developers are now contributing, faster than ever?
The entire GitClear study is easy to follow and worth reading in its entirety.
Senior Developer Feedback Confirms
As I was finalizing this post, I came across a LinkedIn post from Alex Ragalie, a Principal Software Engineer and former CTO. He writes:
This morning I deactivated [GitHub] Copilot in my IDE, after almost 2 years of trying it out in various formats, tasks, languages and setups, on a daily basis.
The reason is very simple: it dumbs me down.
I’m forgetting how to write basic things, and using it at scale in a large codebase also introduces so many subtle bugs which even I can’t usually spot directly.
Call me old fashioned, but I believe in the power of writing, and thinking through, all of my code, by hand, line by line. And my belief has been solidified even more after the last 2 years.
The manual approach to writing code gives me a visible and constant increase in my mastery of the language, and a “guaranteed” improvement in the craft of software development.
And no, I don’t believe there’s any long term “productivity” to be had from AI coding assistants.
Quite the opposite actually.
For additional feedback, I reached out to Steve Scavo, CTO at digital agency Haus [1]. In reviewing this post, he writes back:
Copilot is helpful when it comes to quickly generating utility functions or simple refactors based upon established patterns.
Copilot is not helpful—or even harmful—if you expect it to make architectural or creative decisions. It can't refine and iterate based upon a nuanced set of values and principles. This type of work requires a deep mental model and a fundamental understanding of the initiative at hand.
Copilot is additive by nature and not set up to holistically optimize a codebase. It's even farther from being able to craft superior user experiences.
However, I find that leveraging AI outside of the IDE can very helpful when it comes to non-coding-related aspects like feature planning. ChatGPT (as an example) is, in essence, a world-class research assistant available to an engineer 24/7, devoid of ego and underlying bias. AI is an irreplaceable collaborative companion and sounding board when it comes to working through *how* to think about a problem.
In essence, Gen AI coding assistants are not the silver bullet that many believe, and they may not be the form of generative AI with the most strategic benefits for development organizations. This is especially true since coding makes up only a small part of the software development lifecycle.
Use and Misuse of Gen AI Coding Assistants
Writing software is a creative exercise, like authoring business reports, graphics, music, and artwork. Used properly, Gen AI can be a tool to assist creators. But in the wrong hands, it produces mediocre content, such as business analysis that lack insights, graphics that all look the same, and poetry, artwork, and music that lack artistic merit. So also, over-reliance on AI coding assistants will flood the applications portfolio with substandard code leading to increasing technical debt.
Considering these findings, what is the best path forward with Gen AI coding assistants, such as GitHub Copilot?
The genie is out of the bottle. It is probably not possible or even wise to preclude use of coding assistants. The question then is, how to use these tools to best improve productivity while mitigating any negative effects on code quality, security, and reusability.
Lower expectations for step level improvements in developer productivity. One way to do this is to ensure that the metrics you are using measure the overall development life cycle, not just coding, and especially not any metric based on lines of code. Metrics should include measures of functionality delivered, software quality, security, maintainability, and reusability.
Consider development tools beyond coding assistants. The greatest benefits may be in further adoption of DevOps, investments in IDEs, low-code/no-code platforms, and, as Steve mentioned, use of other types of Gen AI to support development activities outside of coding.
Quantitative analysis of Gen AI coding assistants indicates that, yes, there are productivity benefits, especially for junior developers. But the same studies also point to warning signs, which are confirmed from feedback of senior developers, those in the best position to weigh in. Business leaders would be best to temper their expectations and ensure that the right tools are used in the right way.
End Notes
[1] Steve is a relative, my son. With a staff of over 200, Haus clients include brands such as Amazon, Netflix, Uber, Twitter, Snapchat, Hyundai, Ford, and many others.
Like this post? Browse my past posts.
The part about forgetting how to code is the crux of this. This also means that junior coders using Copilot won’t learn how to code properly in the first place.
Nevertheless, and given the shortage of developers, Copilot will allow even more bodgy coders to enter the market, but they will not be the ones working on innovative and large scale software.
Low-code visual development may win out against Copilot code because bug free solutions are easier to build with those, and code reviews become a thing of the past.
Innovative and large scale solutions will continue to be dominated by pro coders, having Copilot turned off.
Maybe AI can be used to do code reviews, to highlight areas that need refactoring or correction.
Excellent article. I think the key thing is that these LLMs don't know anything except the symbol patterns they use to derive answers. Here is an interesting article in Apple Insider that talks about how including additional, irrelevant details changes the output from the model when it really shouldn't. The bottom line is the models are just statistical engines without any way to tell correct from incorrect.
https://appleinsider.com/articles/24/10/12/apples-study-proves-that-llm-based-ai-models-are-flawed-because-they-cannot-reason