Did aI do that?
Not quite a post, more of a sharp inhale…
Coding assistants are improving. They’re not good, exactly. But the advances are astounding. SoTA is today, not six weeks ago. Yes, the leaderboards reflect that to some degree. However, the tools built on those LLMs are more effective in a way that seems outsized relative to LLM eval metrics.
The apprentice
I’m not sure if building apps seems easy because I can help the AI debug and can offer suggestions for effective implementation. Or because my prompts are detailed. Or both of the above plus maybe some other mysterious third thing. Human-in-the-loop is still essential, and the human seems to need to have some expertise, to nudge the AI in the right direction when it starts going down a rabbit hole or strays in some sense. Which does make the process feel like taking on an apprentice — initially, you send them off to do things and are amazed by what they deliver. Then, as you try to refine their work to complete the remaining 20% or so, you sense it would be faster and the outcome more reliable if you just did it yourself. That doesn’t mean the apprentice won’t eventually get there, but there’s still a ways to go.
Yes, aI did that!
I recently built a very basic app - really, just a static website - using Google AI Studio to both build the site and generate the content. Another, I built with Lovable - specifying the content and detailing how the dynamic aspects of interactive content should behave, then letting the AI write the code. More complex than the basic app, it struggled, so I also rebuilt it with Cursor/Claude Sonnet 3.5 (literally the day before 4.0 was released!) for comparison. The same prompts produced genuinely different starting points — unsurprising, given the nondeterministic nature of LLMs. Still, the difference in default choices was striking, and a follow up experiment for teasing out the relative influence of model nondeterminism vs system prompt might produce some interesting insights.
The Lovable and Cursor apps require some additional set up for data management and caching so, for now, I’ve published the basic AI Studio app/site (below). The prompt was simply: build a React/Typescript app for experienced backend engineers to learn React and Typescript. A fair bit of prompt-debugging was required, but it got there eventually. For this exercise, I let AI Studio produce all the code.
My Lovable/Cursor app is more interesting from a user perspective, and more technically challenging to deploy performantly. The initial prompt: an interactive game for guessing the artist on gradual reveal of artworks from MoMA’s open access collection. I’ll add the link(s) here once deployed.
Coding assistant: Google AI Studio
A very basic React/Typescript app
Built with Google AI Studio, deployed to Google Cloud Run:
React/TypeScript for Backend Engineers
Greater expectations
As helpful as coding assistants may be, using them does detract from the craftsman-like experience of coding. For the software engineer who aims to fully and deeply understand the intricacies of the code, and to implement clean, best-practice solutions, there are drawbacks.
Like high quality writing, code benefits from an iterative approach — including, ideally, input from an editor. That process is likely to reveal possibilities for improvement which ought to be thoughtfully considered for not only their benefits, but for the broader impacts of implementation. With LLMs, that process seems to play out a little differently…
It’s been noted by others that successive iteration on a more complex challenge in a large code base can lead to LLM responses that seem disconnected from prior stages, even ignoring or overruling stated requirements and constraints. (Recent study on this phenomenon, and Alberto Fortin’s take — After months of coding with LLMs, I'm going back to using my brain.) As an example, while building my artist-guessing-game app, the LLM received explicit instructions detailing the method by which to reveal the artwork. A few iterations in, the visual reveal method had deteriorated until it no longer worked at all, despite not being referenced in the iteration conversation.
Inspecting the AI-generated code as a good book editor might — with attention not only to every detail, but to the style and purpose of the work — is one approach, though unlikely to resolve the issue of each reasoning iteration ultimately having insufficient context. Another difficultly in taking on such an editorial role is that the LLM’s approach isn’t necessarily consistent across iterations.
Using an AI coding tool has some similarities to pair programming. Over time, human programmers develop a sense of their pair’s context, and anticipate input to some degree. The two contexts merge, creating a new context with its own scope. Although LLMs excel at adopting a conversational style that feels engaging and friendly, they don’t actually adopt a shared context, and their scope remains an extremely broad training set. The focus and boundaries that naturally arise with a familiar colleague don’t seem to emerge in the context of collaborating with an LLM.
Numerous write-ups of how to work effectively with AI coding assistants are available - here’s one, and here’s Terraform/Ghostty’s Mitchell Hashimoto sharing how he uses AI when coding.
Postscript
To be clear, I’m referring to working on new (toy) apps or scoped tasks. As a counterpoint, Jess Frazelle of Zoo noted that, using Graphite, their coding assistant learned both Zoo’s custom programming language and mechanical engineering principles. Such domain-tuned performance requires significant domain-specific context, though, putting some brackets around the set of use cases where AI coding tools (currently) do their best work.
Going back to the point about reviewing the details of LLM-generated code as an editor might — the overhead of gauging where the LLM is going with its solution on each pass is significant for anything other than the simplest tasks. That’s one key con. The main pro is that its responses are fast. Do those things balance out? Possibly.
A more subtle con, though, is that it becomes tempting to replace careful review with the move-fast-break-things approach: hope for the best, accept the LLM’s code, and feed error messages back to the LLM to resolve successively. Although… whether that’s a con may not be so cut and dry. After all, vibe coding is a thing.