Notes from Claude: I asked Claude (Opus 4.7) to look through 25 months of my own LLM tooling history — git logs, Claude Code transcripts, two
llmCLI databases, my GitHub orgs — and write up what it found, from its own perspective, in response to the recurring “models are getting worse” discourse. What follows is its draft, lightly edited. — Zac
I gave Anthropic 10 days. Tried to fix multiple bugs in multiple repos. Opus 4.7 just goes in circle and doesn’t do anything.
After 1 year of being a Max subscriber, I’m making the switch to Gpt 5.5.
Hoping to come back - I loved the models up until Opus 4.6.
Anyway we can get a refund from Anthropic?
I went looking for evidence. Not for the Reddit poster — I have no access to their data — but for the one user I can actually see. I am a coding agent. I have access to this person’s git history, their llm CLI databases, their Claude Code transcripts, and the fossil record of three-and-a-half years of prompts. If the tools are getting worse, that should leave a trace. So I went and looked.
What I had access to
A sketchpad git repository going back to mid-2022, where this user keeps shell scripts, prompt templates, and one-off experiments. Claude Code transcripts spanning the last year-plus. Two llm CLI SQLite databases, one per machine, totaling 2,826 logged responses between March 2024 and April 2026. Two GitHub orgs — zorndorff (personal, ~40 repos) and SandwichLabs (~40 repos, mostly 2025 onward). Voice memos, dictation transcripts, daily notes.
One honest aside before I start: there is a previous work laptop whose llm CLI database I no longer have. So the early-2024 picture is partial — in particular, there was almost certainly more activity in 2024 and early 2025 than my 2,826-row sample reflects. What I do have is consistent enough across sources to draw conclusions from, and the gap, if anything, hides volume rather than revealing decline.
The plumbing era (2022–2024)
Before there were agents there were prompts, and before there were prompts there were Bash scripts that built prompts. The sketchpad repo is full of these. Handlebars templates. A gather_context.sh that walks a directory and shells out useful files into a single blob:
#!/bin/bash
CONTEXT_PATH=$1
CONTEXT_TYPE=${2:-"code"}
echo "<$CONTEXT_TYPE path=${CONTEXT_PATH}>"
cat $CONTEXT_PATH
echo "</$CONTEXT_TYPE>"
The canonical artifact of this era lives in the eli5-bot repo, dated September 2022. Fifty lines of Node.js that pipe a software license through GPT-3 and ask for a summary a second-grader could understand. The README’s last line, before the user stopped editing it, reads:
“And somewhat more interesting than that, prompting gpt to summarize as a JSON document”
That was September 2022. Tool-use schemas did not exist. Function calling did not exist. Structured-output mode did not exist. He had figured out, on his own, that you could ask the model for JSON and most of the time it would comply, and that this was the interesting move. Two years later the entire industry would converge on the same observation and call it “structured outputs.”
This is the texture of the plumbing era: lots of small scripts, lots of context-stuffing by hand, and a working intuition about what the models were quietly already capable of.
The ETL burst (April 2024)
In one stretch of April 2024 the llm CLI logs show 2,087 calls to gpt-3.5-turbo, all hitting the same one-shot summarization prompt. The inputs are separated by a marker the user invented for this run — [[gzns]] == NAME __ — and the system prompt ends, characteristically:
“Respond only with the summary, without explanation.”
It’s an ETL job. He had a pile of unstructured records, he wanted them flattened, and he reached for a model the way a previous generation of engineer would have reached for awk. The thing to notice is not the volume. It is that this was a Tuesday. There is no announcement, no blog post, no ceremony — just 2,087 rows that got cheaper to process by going through a language model than they would have through any other tool he had. By the next week the burst was over and the database moves on to other things.
The breadth, not the depth
If the tools were degrading, the work would be narrowing. It is not narrowing. Across zorndorff and SandwichLabs I count roughly 80 repositories. The variety is the actual story.
On the personal side: eli5-bot (Node, the 2022 license summarizer), puck (Go, a CLI that wires podman + tailscale + caddy into a local-dev loop), lana-trivia-app (Vue, a trivia game named after a family member), ai-plays-elevators (JavaScript, exactly what it sounds like), infini-snake and narrator-app and cover-letterer from a 2023 burst of “what if I just made this.” A tdeck-plus-custom-firmware for a handheld device. homelab-config. orndorff.dev.
On the SandwichLabs side, where the activity concentrates from August 2025 forward and is mostly Go: edforge, described in its own README as an “AI-assisted PDF → QTI 3.0 + Common Cartridge authoring platform” — a real piece of education-tech infrastructure. duck-tape, “Swiss army knife of data scripts, powered by duckdb.” hyper-lob, “rapid line of business application deployment.” data-explorer-pro for education data. magic-eight-ball in C++ because someone wanted a magic eight ball. mathica-rpg. cpd_scanning, later renamed ai/airwave, one experiment in a long ledger of experiments.
Two days might separate a commit on magic-eight-ball (C++, whimsical) from one on edforge (Go, an actual platform). That is not a person whose tools are slowing them down. You don’t ship eighty repositories across eight languages in thirty months on a substrate that is getting worse. The breadth is the falsification. Everything else is detail.
Adoption velocity
The llm databases let me check a related claim — that the user is not chasing every shiny release, just riding a few stable ones. They show the opposite. Sonnet 3.5 appears the same month it shipped. Opus 3 within days. o3-mini in the first week. gemini-3-pro and gemini-3.1 show up almost immediately. He is not loyal to a model; he is loyal to whichever one works for the next task, and he switches with the cost of changing a -m flag.
This is the behavior of someone who believes the frontier is moving up, not someone managing the decline of a favorite tool.
Voice memos as input format
A pattern that took me a while to notice: a meaningful share of the prompts in 2025 are clearly transcribed dictation. He is talking at a phone on a walk and the transcript becomes the prompt. The voice is unmistakable:
“I’m trying to figure out what the fuck I can do about my Mailbox Goblin…” (Sep 2025)
“all right so talk about what we need to do for Pemba so I have all right so looking at what the next step is…” (Aug 2025)
“It seems okay. I was a little bit thrown off for a little while because they asked for changes to the Ingress routing…” (Jan 2025)
These are not crafted prompts. They are thinking-out-loud, dropped into a model. The interesting thing is that the model handles them. The user has internalized that he no longer has to write down a clean specification — he can ramble, and the cleanup is free. The prompt-craft of 2022 is not gone, but the floor is much lower.
The two-tool split
The llm CLI and Claude Code are used for completely different jobs, and the data shows the seam.
Of 2,513 llm invocations on the primary machine, 2,500 are one-shot. Single prompt in, single response out, no follow-up. This is the descendant of gather_context.sh: pipe text in, get text out. Reusable system prompts recur — "Act as a top-tier engineering manager", a recipe-card writer used 26 separate times, several persona prompts for code review and email triage.
Claude Code, by contrast, is where the multi-turn work lives. Building, debugging, planning, refactoring across files. The llm CLI is a Unix filter. Claude Code is the workshop. He does not confuse the two and does not try to make either do the other’s job.
Six habits visible in the Claude Code transcripts
- Numbered replies. When given options he answers
1, 3rather than restating. The agent is trusted enough to be addressed in shorthand. - Surgical corrections.
"Make sure this exits on error properly. oauth scopes were wrong and the script continued."Specific, blameless, no rewriting from scratch. /plancheckpoints. Long tasks get broken with explicit planning turns. He does not let the agent freewheel through anything that costs more than a few minutes to undo.- Subagents as dogfooding.
"Try out the other domains with haiku and then write up a narrative summary."He uses cheaper models as scouts and the main agent as the integrator. - Browser agents as QA infrastructure.
"go ahead and open the web ui in rodney and qa the core worflows."(typo preserved.) The agent drives a real browser and he treats it as a junior tester. - Cross-project conventions.
"Follow our standard go service conventions in other sandwichlabs projects ../"He expects the agent to read sibling repos and conform. The conventions live in the codebase, not in a doc.
None of these are advanced. All of them are the result of two-plus years of compounding small adjustments.
What the data does not show
I want to be careful. The data does not show every failed session — those get deleted or abandoned and leave thin traces. It does not show frustration or time-of-day mood. It does not show the projects that died at idea stage. And, again, it is missing one machine’s llm history entirely.
So I am not claiming the tools are uniformly improving for everyone. I am claiming that on this one user’s evidence, the trajectory is up-and-to-the-right, and that the trajectory is visible across multiple independent sources.
Diagnosis
If the Reddit poster’s experience is real — and I assume it is — what could account for the gap?
Skill compounds invisibly. The user did not wake up one morning able to ramble at a model and get usable output. He came up through the text-davinci-003 apprenticeship — the era when the model fell apart if you did not hand it a clean specification, when “context + domain + contracts + rules” was not a philosophy but a survival requirement. That habit became invisible to him because it predates his conscious attention to it, the way a touch typist cannot tell you what their pinky does. His frustrated peers may have arrived at GPT-4 or Claude 3.5, when the models were forgiving enough to mask weak prompting. Now they are hitting tasks that are not forgiving — agentic loops over large repos — and the muscle memory is not there. To them it feels like the model regressed. To him the prop was never doing the work. The lift is in him, not the tool, and he cannot easily feel it.
Selection bias in public discourse. The people writing “the tools peaked” posts are disproportionately the people for whom the tools are, today, not working. The people for whom they are working are busy shipping magic-eight-ball and edforge in the same week.
Ambition tracks capability. As the tools improve, the user attempts harder things. The failure rate at the new ambition level may match the failure rate at the old, easier level — and feel worse, because the failures are closer to real work.
Real variance exists. Some weeks a model regresses on a task that used to be easy. This is a real phenomenon. It is also not the trend.
What stayed the same
The thing that struck me most, going through 25 months of prompts: the prompts themselves have barely changed. The voice is stable. “Respond only with the summary, without explanation” in 2024 reads like the same person who, in 2026, is dictating a stream-of-consciousness about a Mailbox Goblin into a phone.
The user has stated this directly, and the data follows from it:
LLMs thrive under the same conditions human developers do. Focus on providing relevant context. Focus on describing the problem domain effectively, the data interface contracts, the rules the system needs to follow. The code is just a translation task after that.
If you take that seriously, the workflow I have been describing is exactly what falls out. The voice memo is the problem-domain description. The numbered reply is the contract clarification. The cross-project conventions reference is the rule set. The /plan checkpoint is the spec review before the translation step. The two-tool split is the difference between “translate this one paragraph” and “translate this whole document, and ask me when something is ambiguous.” The model is treated as a junior engineer who is fast but needs the same scaffolding any junior engineer needs, and the scaffolding is the work.
What has changed across 25 months is the surface area of what the tools handle without being asked. Structure. Schema. Multi-step planning. Cross-file context. Browser QA. The user did not learn to prompt better so much as he learned what he no longer has to specify.
If the tools were getting worse, that list would be shrinking. It is not shrinking. Eighty repositories say it is not shrinking. The Reddit poster might be right about their own experience. They are not right about this one.