Did aI do that?
Not quite a post, more of a sharp inhale…
Coding assistants are improving. They’re not good, exactly. But the advances are astounding. SoTA is today, not six weeks ago. Yes, the leaderboards reflect that to some degree. However, the tools built on those LLMs are more effective in a way that seems outsized relative to LLM eval metrics.
The apprentice
I’m not sure if building apps seems easy because I can help the AI debug and can offer suggestions for effective implementation. Or because my prompts are detailed. Or both of the above plus maybe some other mysterious third thing. Human-in-the-loop is still essential, and the human seems to need to have some expertise, to nudge the AI in the right direction when it starts going down a rabbit hole or strays in some sense. Which does make the process feel like taking on an apprentice — initially, you send them off to do things and are amazed by what they deliver. Then, as you try to refine their work to complete the remaining 20% or so, you sense it would be faster and the outcome more reliable if you just did it yourself. That doesn’t mean the apprentice won’t eventually get there, but there’s still a ways to go.
Yes, aI did that!
I recently built a very basic app using Google AI Studio. Another, I built with Lovable. Slightly more complex than the basic app, I rebuilt it with Cursor/Claude Sonnet 3.5 (literally the day before 4.0 was released!) to compare processes and outcomes. For the latter two, the same prompts produced genuinely different starting points — unsurprising, given the nondeterministic nature of LLMs. Still, the difference in default choices was notable, and a follow up experiment for teasing out the relative influence of model nondeterminism vs system prompt might produce some interesting insights.
The Lovable and Cursor apps require some additional set up for data management and caching so, for now, I’ve published the basic AI Studio app (below). That simple app is a not-particularly-creative but bespoke (for backend devs) crash course/refresher on React and Typescript. Not comprehensive, but loosely covers the essentials. The prompt was simply: build a React/Typescript app for experienced backend engineers to learn React and Typescript. AI Studio tripped over certain bugs I could detect in the code fairly obviously, but it got there eventually. Oddly, it made predictable errors — like placing the control for closing a menu directly over the menu title text. (An aside: As it happens, I didn’t review the guide after building it and subsequently stumbled a bit getting my head wrapped around a Next.js/React/TypeScript code base. Seeing the guide again on writing this post, I would have had more clarity from the get-go had I dogfooded my own ‘product’ — as always, experience is the best teacher.)
My Lovable/Cursor app is more interesting from a user perspective, and more technically challenging to deploy performantly. The initial prompt: an interactive game for guessing the artist on gradual reveal of artworks from MoMA’s open access collection. I’ll add the link(s) here once deployed.
Coding assistant: Google AI Studio
A very basic React/Typescript app
Built with Google AI Studio, deployed to Google Cloud Run:
React/TypeScript for Backend Engineers
Greater expectations
As helpful as coding assistants may be, using them does detract from the craftsman-like experience of coding. For the software engineer who aims to fully and deeply understand the intricacies of the code, and to implement clean, best-practice solutions, there are drawbacks.
Like high quality writing, code benefits from an iterative approach — including, ideally, input from an editor. That process is likely to reveal possibilities for improvement which ought to be thoughtfully considered for not only their benefits, but for the broader impacts of implementation. With LLMs, that process seems to play out a little differently…
It’s been noted by others that successive iteration on a more complex challenge in a large code base can lead to LLM responses that seem disconnected from prior stages, even ignoring or overruling stated requirements and constraints. As an example, while building my artist-guessing-game app, the LLM received explicit instructions detailing the method by which to reveal the artwork. A few iterations in, the visual reveal method had deteriorated until it no longer worked at all, despite not being referenced in the iteration conversation.
Inspecting the AI-generated code as a good book editor might — with attention not only to every detail, but to the style and purpose of the work — is one approach (though not likely to fully resolve the context-for-current-reasoning-iteration issue). Taking that angle, I found that the LLM’s approach was not consistent in a way that allowed me to fluidly switch context between my perspective and the LLM’s, generating more overhead than I’d encounter if working with a familiar colleague.
Working with an AI coding tool has some similarities to pair programming. A perhaps underappreciated aspect of pair programming is the impact of each party’s priors and perspective. Over time, human programmers develop a sense of their pair’s context, and anticipate input to some degree. The two contexts merge, creating a new context with its own scope. In contrast, the scope of the LLM’s context is its extremely broad training set. For each iteration stage, I had the LLM’s explanation of what it was doing and why, yet I couldn’t get a read on whether it had the equivalent of a mental model of the problem, making it hard to anticipate much about its responses. Ultimately, the iterations didn’t feel like they were working with consistent background context.
Postscript
To be clear, I’m referring to working on new (toy) apps or scoped tasks. As a counterpoint, Jess Frazelle of Zoo noted that, using Graphite, their coding assistant learned both Zoo’s custom programming language and mechanical engineering principles. That demonstrates the power of domain-specific tuning — which requires significant domain-specific context, putting some brackets around the set of use cases where AI coding tools (currently) do their best work.
Going back to the point about reviewing the details of LLM-generated code as an editor might — the overhead of context-switching to figure out where the LLM is going with its solution on each pass is significant for anything other than the simplest tasks. That’s one key con. The main pro is that its responses are fast. Do those things balance out? Possibly.
A more subtle con, though, is that it becomes tempting to replace careful review with the move-fast-break-things approach: hope for the best, accept the LLM’s code, and feed error messages back to the LLM to resolve successively. Although… whether that’s a con may not be so cut and dry. After all, vibe coding is a thing.