I have tried to involve LLMs in two projects. In both cases, I gave up in frustration, for similar reasons. They put too much weight on their own background knowledge and too little weight on what I want them to do.
The first project was grading op-eds, according to my criteria. I wanted the LLM to give credit to an author who shows an understanding of the other side’s point of view by:
playing devil’s Advocate, asking questions that the other side might ask
thinking in Bets (speaking in probability rather than certainty)
spelling out Caveats, meaning potential problems with what the author was advocating
Debating the best points on the other side rather than poking at the other side’s weak points
showing an Open mind by saying what might lead the author to change his mind
being as skeptical of Research that supports the author’s point of view as he would be of research that undermines it
Steel-manning the opposing point of view
Using ChatGPT, I created a GPT that was supposed to do this. What ChatGPT kept wanting to do instead was grade essays on “balance,” meaning not being one-sided. That is not what I meant. You can have a strong point of view, as long as you are writing rigorously, arguing against the best case for the other side and avoiding using insults as a debating tactic. I kept trying to explain this to ChatGPT, and it kept doing things differently.
More recently, I tried to create a “clone” using LLMs. For example, Delphi let me give it dozens of my essays as background information. The test I would give for Delphi, or for generic chatbots like ChatGPT or Claude, was to ask it how Arnold Kling would explain the financial crisis of 2008. I was frustrated that what came back included only some things I would say, plus a lot of things that other economists or journalists have said, some of which I completely disagree with. Claude was the best of the ones I tried at sticking close to my interpretation of the crisis.
Think of the LLM as giving answers that are a weighted average of two sources: general background knowledge; and specific instructions and my own past writing that I was giving to the LLM. The weight that the LLM gives on its background knowledge is too high, and the weight that it gives on my own past writing and instructions is too low.
LLMs are not adapted to my management style, so I am frustrated with them as employees. I appreciate the fact that their background knowledge allows me to communicate with it in plain English. But I wish that the LLM could empty its vessel of domain-specific background knowledge, so that I could fill that vessel with what I want.
Back when I was a manager at Freddie Mac, I had good luck selecting “empty-vessel” employees. To work on financial models, I would pick someone with a computer programming background, not someone with an MBA.
I should note that I only managed a small team of people, never rising to the level of managing other managers, which is what you have to do to be a high-level manager.
“Arnold, you don’t know how to manage people. You just tell them what you want and then they go do it.” A very astute colleague, Mary Cadagin, told me this once, and it was true. I had a strong view of what I wanted, but I could never articulate exactly how to get there.
What I needed were employees who would hear me sketch out a vision, press me for clarification, and then deliver what I wanted. Coming in with zero knowledge of finance was fine. The key was being able to draw out my vision and to keep coming back to me with questions to clarify what I wanted.
An LLM is not the employee that I want. I don’t want any of your background knowledge in my domain. I want all of the domain knowledge to come from me. I want you to turn my knowledge into computer code.
I think of the power of an LLM as its ability to act as a natural-language computer interface. In order to do that, it has to read a lot of text to learn natural language. But then in order to execute my vision, I need it to forget the content of what it read about economics or social science or grading essays and just find out what I want. That may be a hard needle to thread. Especially because most people do value the domain knowledge of the LLM. They want it to be encyclopedic.
I would love an LLM tool that did what you describe. I don't want zero general domain knowledge (it would be good if it had a sense of the style of economics articles, for example, or general principles) - maybe what we really need is a tunable LLM where we can select the weight on our inputs vs its training, from 0 to 100.
In the meantime, ChatGPT is an awesome generator of R code, a terrific explaining of problems with programs, an endlessly patient answerer of questions, and a really good summarizer of long documents. Its writing has improved a lot and, if you invest some time in the prompt, it can produce pretty good first drafts of routine things.
There is a market for what you describe and I am confident we will get there eventually
My best luck with LLMs so far is to upload documents and then talk about the documents.
I can also upload almost 6 months worth of a newsletter I have written that fits in the context window, and then I can get Claude to write in my style and I can basically tell it what I want to write and it will do a pretty good job at an initial draft. Then I’ll say, I don’t like this section, it should say this instead. It never gets frustrated at continued edits, so I can keep chipping away and it’s often still faster than writing it from scratch myself. And many of these co written articles end up the most popular of a weekly newsletter, so the actual audience likes them too.
So maybe you could create a document that describes each of your methods of grading and upload what you’re grading along with the documents that specify the grading methods and then see if it sticks closer to the task since those are context window documents and not just prompts, which it doesn’t reliably follow.
I also go back and forth between Claude, Google Notebook and ChatGPT regularly because they are all three always changing and getting better so suddenly one might start working better than another. Claude was doing best for a while and still writes best, but ChatGPT is better for getting information out of a lot of documents, so sometimes I’ll write an outline version of an article with ChatGPT and then write the actual article with Claude from the outline.
I was even able to get Claude to write my stuff in the style of a better writer by giving it a bunch of samples and saying make it sound more like this writer, working with my outline of the article, and documents with the facts so it doesn’t just make things up or get too generic.