Myths and Reality of AI Writing Code

Kevin Zatloukal
12 min readFeb 29, 2024

I’m a big fan of Jensen Huang, and it is obviously silly for me to try to give him any kind of advice, but if I were in his shoes, I would avoid telling people that AI will soon do things that Alan Turing knew, almost 100 years ago, were impossible for any computer program to accomplish.

Here is his most recent, headline-grabbing claim:

Programming is Hard in Theory

Turing famously proved the impossibility of solving the Halting Problem, which asks the computer to simply answer whether a program it is given will eventually stop or if it will run forever when executed. No computer can answer correctly about all programs.

The Halting Problem might seem like a special case, but it turns out to have extremely broad implications. As most computer scientists are aware, almost any interesting problem involving computer programs is provably impossible to solve. (This is best formalized in Rice’s Theorem.) As a simple example, if you’d like a program that can tell you “Is there a bug in this program?” (or “Where is the bug in this program?”), you are out of luck! There is no program that will always answer correctly.

For non-programmers, here is some simple intuition for why the problem of writing correct code is in some sense “maximally hard” amongst problems we might like AI to solve. It cannot be the case that AI would not be able to solve your favorite problem X but would be able to write any program correctly. If that were the case, you could simply ask the AI to write a program that solves problem X! It can’t be the case, for example, that full self-driving is too hard for AI but writing code is not because we could just ask the AI to give us code for full self-driving.

Programming is Hard in Practice

One of the dead giveaways, to me, that these discussions about having AI program for us are not serious is the complete lack of discussions about correctness — how important it is and how AI can possibly achieve it.

I assume most readers are aware of the many ways that AI has already been seen to produce incorrect output: making up untrue facts, making illegal citations in legal briefs, making illegal moves while playing chess, and even driving a car into wet cement.

(from here on X)

I imagine that readers not worried about these examples think of them as rare cases. If most programming tasks do not have such dire consequences as full-self driving or writing legal briefs, then it would not be necessary to worry so much about correctness. Sadly, that is not the case.

I think that most people who hold that view do so because they live in a world surrounded by software that works properly 99+% of the time. They simply haven’t given any thought to what it would be like if the software only worked properly, say, 90% of the time.

If Excel did not correctly compute the right answer 10% of the time, huge amounts of money could be lost. If a database missed some matching records 10% of the time, we could reach incorrect conclusions on all manner of important questions. In both of those cases, there would likely be lawsuits. People depend on enterprise software to always be correct.

Even for low-stakes software, users demand correctness. If Spotify randomly played the wrong song 10% of the time or jumped back and forth between songs in the middle of playing them, people would stop using it. If clicking on a menu item resulted in the wrong action being taken 10% of the time, the software would be too frustrating to use.

Games probably have the lowest stakes for correctness, and even there we know it is hugely important. CD Projekt saw its market cap drop 75% after they released a game that users found too buggy, namely, Cyberpunk 2077. Watching ChatGPT make illegal moves makes for a fun YouTube reaction video, but it is simply not fun to play chess against an AI that can’t make a legal move 10% of the time.

Contrast this with AlphaZero, which was best chess player in history when it was released. It also uses AI at its core, but only to evaluate potential moves. Regular ol’ code, written by humans, generates the set of legal moves for the AI to consider. This ensures that illegal moves are never considered, much less made.

That same structure is used in AlphaGo, AlphaFold, and AlphaGeometry. AlphaGeometry never generates incorrect proofs because it includes regular code, written by humans, that produces only valid inferences. While ChatGPT can conceivably tell us that the amino acid sequence we need to generate a protein that will fight off some disease is A-R-N-D-cat-clown-rainbow, AlphaFold will never even consider such a thing because its regular code limits to only valid sequences of amino acids.

In these systems, AI is used but only as a substitute for intuition, and the rest of the code is written with the understanding that intuition can be wrong. The regular code in these systems is what guarantees correctness.

I have read that full self-driving systems are being engineered similarly, with over a million lines of regular code trying to keep the AI from doing anything too stupid (and even that doesn’t seem to be enough)!

Another approach along these lines called Retrieval Augmented Generation (RAG) is currently being attempted to try to fix the problem of inaccuracies in generated text. Here, regular code is used to try to ensure the accuracy of claims made by the AI. While it remains to be seen whether RAG will work (there are some skeptics), the key point for me is that, it seems, researchers have largely given up on the idea that hallucinations will disappear if you just train on enough data. Instead, they are turning to regular code (in the form of RAG) to ensure correctness.

The evidence so far says, to me, that correctness is a key weakness of AI, whereas creativity, pattern matching, and other skills brought to bear by intuition are key strengths. That weakness are particularly at issue for the problem of writing code, where correctness is paramount. It doesn’t have to be driving a car or writing a legal brief — almost all software would fail to be useful if it did not operate properly 99.9% of the time.

The evidence so far says to me that we will continue to need human programmers to write correct code for usable programs. Making sure that the code will operate correctly in all cases is what programmers are paid for and what they spend most of their time thinking about while coding.

English is a Bad Programming Language

A key reason that programming is hard is that computers actually do what you tell them to do. While a human driver would not follow your instructions to drive into wet cement, the computer will do it. To make sure the program always works properly you have to foresee and think through every case that might happen in order to make sure they will all work. That difficult mental task is the core skill of programming, in my opinion.

Thankfully, one problem programmers do not face is ambiguity about what words mean. Programming languages have precise specifications so that the programmer and the computer always agree on what is being asked. Human languages, on the other hand, are often ambiguous. Phrases like “I saw her duck” and “slow children at play” have multiple valid meanings. Using human languages to try to describe what you want the computer to do would make correctness harder, not easier, in my opinion.

In addition to being imprecise, human languages are also not especially concise, particularly when it comes to mathematical operations. Sure, I could write “subtract b from a, then add c to d, and then multiply those two numbers together”, but I prefer to write just “(a – b) x (c + d)”.

Being able to use English might feel like a win, initially, because you could program without learning any new syntax, but the loss of productivity due to imprecision and verbosity would eventually catch up to you.

That said, I suspect that the appeal of English as a programming language is not primarily about avoiding some learning. Rather, I think the appeal is in imagining an AI to which we can explain in English only what we want done, leaving it up to the AI to figure out how to do it. Sadly, that is another problem that computer scientists have long known to be impossible. Alan Perlis once quipped “When someone says, ‘I want a programming language in which I need only say what I want done,’ give him a lollipop.”

To remind you how large the gap is between what and how, remember that the only things a CPU can do out of the box is read, write, compare, and perform arithmetic on numbers. The colors of each pixel on your monitor come from reading the numbers written down in video memory, where each color is encoded as a specific number. Each time a key is pressed, the program is allowed to run and is handed a number indicating which key was pressed. Likewise for when the mouse is clicked.

Ultimately, it is up to the programmer to figure out how to update the numbers in video memory, each time a key is pressed, so that the user sees the right image on the screen. The only tools at their disposal initially are the abilities to (1) read and write numbers from memory, (2) to perform arithmetic on numbers, and (3) to compare numbers to each other. Everything else, they have to figure out on their own.

Excel, for example, uses some of the space in memory to write down all of the formulas that the user has typed into various cells as well as the numbers being shown in those cells. Formulas are text, but just like key presses, they can be encoded as numbers. When someone enters a number into a cell, Excel goes through each of the formulas that might have changed (it arranges them cleverly in memory so that it can quickly figure out which ones those are) and re-calculates the values of those formulas. Then, for each cell that changed, it finds the part of video memory that is displaying the pixels that trace out the number in the cell and updates them. Each step of that process is a thousand times more complex than I have just described, but that is the basic idea.

If Jensen is suggesting you can just tell the computer, in English, what you want the program to do and it will write a program to do that, then he is imagining not only that the AI will perfectly understand what you mean but also that it will figure out on its own what information should be kept in memory; how that information should be encoded as numbers; how those numbers should be arranged so it can quickly figure out, after each key press, which ones to update; and finally, how to translate those new numbers into a new picture in video memory. There is no fixed formula for how to do these things. Each requires ingenuity and careful planning.

Let me note that that description also assumes we are writing a simple, desktop application that has no communication with any other computers. If we want to talk to the internet or, even worse, write a program that runs on a server on the internet, then it becomes vastly more complicated.

Hopefully, that description helps explain what an extraordinary claim it would be to suggest that AI could translate an English description of what you want done into a working program. Once again, this is something that is likely not only impossible in theory (provably impossible to do correctly in all cases) but also likely impossible in practice.

AI and the Senior Programmer

When writing non-trivial programs, other properties beyond correctness become important. If we are thinking about using AI to write programs, we would need to consider those other properties as well.

One such property is changeability. Most programs in use are constantly being changed, (hopefully) to improve them. Senior programmers know this, so they write their code in such a way that the most likely kinds of changes they will need to make are easy to make.

So far, code generated by AI in response to prompts scores very poorly in this metric. Rather than being changeable, it appears very hard to change:

Another important property is understandability. It may not matter how the code works as long as it works, but sometimes it doesn’t work. Finding the bug in those cases requires a holistic understanding of the program. Senior programmers know this so they write code that is simple and easy to understand.

I think I can speak for all senior programmers when I say that the idea of debugging a large program made up of code that no human understands sounds like one of the circles of hell from Dante’s Inferno. While liars hang by their tongues in the eighth circle, the ninth circle houses all the bad programmers who must spend eternity debugging a program that is an alien amalgamation of all the buggy functions they wrote during their lives. (Shudders)

AI as a Junior Programmer

While AI seems unlikely to write large programs on its own or to operate at the level of a senior programmer any time soon, I don’t mean to say that AI can’t be of any use in writing programs. AI could plausibly operate as a junior programmer to whom you give small programming tasks and whose work you need to double-check. (Indeed, I’m working on a research project right now that fits within this framework.)

In an article I wrote last year, I included a tweet from Ben Hunt (EpsilonTheory on X) who said that ChatGPT makes junior knowledge workers “obsolete”. That is a nice generalization what I’m seeing with regard to programming: AI is useful in cases where it saves you time after accounting for having to carefully double-check its work.

As I said in the article, however, this raises concerns over how we train the next generation of senior programmers. The current generation became senior by first working as junior programmers. How are new entrants able to gain the experience of a junior programmer if we replace them with AI?

I previous thought the solution might simply be more time in school, but quite frankly, the problems ChatGPT is causing for schools are equally if not more difficult. Nowadays, I would summarize the situation as follows:

Homework is the Killer Application of ChatGPT

Above, we discussed the problems AI has with correctness. That’s a big problem in most areas of life, but school is the one area where you can submit work filled with errors and still get most of the credit! Teachers are used to reading answers that lack an understanding of key ideas, and yet, they still give lots of credit because making mistakes is part of learning.

Most of the problems we need humans to solve are new and novel. AI, on the other hand, works best on problems that are similar to lots of solved examples on the internet. Most of the problems we need humans to solve require understanding of their unique context. Internet examples, on the other hand, tend to be problems that are easy to state. The main category of problem that are both easy to state and have lots of solved examples on the internet are homework problems.

Last summer, we saw evidence that a notable amount of ChatGPT’s usage is for homework:

Looking beyond the sheer number of searches, the text of the search queries for ChatGPT also suggest homework as a top use case.

It feels to me that, outside of teachers, most of society has not fully grasped the impact that ChatGPT is having on education and training. To help with that, try imagining a world where all students have access to a tool that can adequately solve any homework problem but is largely useless otherwise. Education would obviously be hugely impacted by that, and it’s not far off from our current reality. AI for writing code, for example, will likely be able to perfectly solve most homework problems long before it will be of use for helping non-programmers write complete applications.

Conclusion

According to some, the days where we no longer need programmers are just around the corner. Soon, you’ll be able to describe in English what you want the program to do and the AI will magically figure out how to do it. As I discussed above, we are unlikely to ever see such days. In fact, creating such an AI is likely impossible, not only in theory and but also in practice.

In contrast, the days when we still need human programmers to solve most problems but AI exists that can perfectly solve any homework problem are probably just around the corner. Those days will be significantly challenge for the education and training of new programmers, and at present, it is unclear how we will cope with the challenge.

--

--