My Frustration with LLMs

Sep 27, 2024

I don't get along with them as employees

24 Comments

I would love an LLM tool that did what you describe. I don't want zero general domain knowledge (it would be good if it had a sense of the style of economics articles, for example, or general principles) - maybe what we really need is a tunable LLM where we can select the weight on our inputs vs its training, from 0 to 100.

In the meantime, ChatGPT is an awesome generator of R code, a terrific explaining of problems with programs, an endlessly patient answerer of questions, and a really good summarizer of long documents. Its writing has improved a lot and, if you invest some time in the prompt, it can produce pretty good first drafts of routine things.

There is a market for what you describe and I am confident we will get there eventually

Expand full comment

(1) An interesting thought is that this phenomenon is a manifestation of the LLM version of the strong version of the Sapir-Whorf hypothesis of linguistic relativity. That is the LLMs don't learn language and 'content, and even 'cognition' as theoretically separable tools and skills, but that the very process of learning how to communicate in a language by being exposed to and discovering lots of statistical patterns in huge amounts of it inherently brings over a lot of the embedded dominant content and framework of ideas and "typical way of processing content and thinking in those ideas", such that without some very important leap in the way it works, it CAN'T really talk to you in content- neutral 'objective' language distinct from that content, as what we experience as it's form of communication is a language system -constructed on the basis- of that content. This applies to humans as well, and lots of effort goes into trying to transcend the problem to the extent possible, to avoid errors and distortions that are a consequence, but the LLMs arent doing that with regard to their own content-distorted use of language. Yet.

(2) Some of the image / video generators have a feature that allows one to upload lots of examples of a particular artistic "style" so that future images can be generated "in that custom style I had you learn." This seems to work better for graphics and also literary style, but not yet "-conceptual- style", that is, learn a particular person's worldview and reality-model and write explanations consistent with that set of ideas and style of thinking. My impression is that this will be solved soon.

Expand full comment

That’s interesting. Do you think that the gap between human and ai you see, and is less obvious in the drawing side, is partially due to humans being able to think without language while text based ai has no thought that isn’t language based? So while I can imagine things that I don’t have words for a ChatGPT only has words and their connections so it can’t get at the intrinsic ideas that would allow it to tie together concepts without words.

Expand full comment

It seems to me that any attempt to answer your question immediately gets into very deep, difficult, complicated, often unresolved, sometimes unresolvable matters related to the modern analytical school of the philosophy of language, from Frege, Russell, Wittgenstein, Quine and so forth. The fundamental questions regarding the limitations and character of any attempt to symbolically convey meaning, and as if that weren't enough, how the answers may relate to equally thorny fundamental matters of consciousness, cognition, and computability. There is at least one PhD thesis for the ages waiting to be born here.

My totally wild and irresponsible tentative gut hunch (as I try mightily to resist the temptation to go down the rabbit hole on this question and immediately find myself well out of my depth) is that we (er, LLMs) are encountering a kind of new instance of a manifestation of what you might call the postmodernist / linguistic pragmatist critique, and perhaps also Quine's "Indeterminacy of Translation" thesis. (As a non-standard example, consider the inescapable problem of hermeneutics in general and legal interpretation specifically, of how hard and often faulty are attempts of humans to use language to communicate with other similar humans at the general level without the intent being misunderstood at the more specific level of detail of particular facts and cases. You cannot really communicate in human language without somehow putting your whole self into words. The words an LLM uses are built upon the whole selves of a lot of authors whose selves you don't necessarily want to be reflected in the words. You are are asking mirror, mirror on the wall for answers, but what looks like the mirror's display of an answer is actually just a reflection of its tutor.)

That is to say, that it's very hard (maybe impossible in some cases) to get the embedded content and concepts OUT of human language merely by application of some logically rigorous discipline, since the tool for communication cannot be separated from the experiences, perceptions, qualia, ideas, concepts, and content that must be assumed to be commonly shared to some degree between conversants, but that this common basis that makes establishment of a 'protocol' possible both directs and constrains what can and will be thought and said. A common human intellectual experience is how much harder it is to think and talk about a topic without having the insight to recognize or create, or being socially introduced to the conceptual category. Without that addition, one really does feel kind of stuck "thinking inside the box".

I think we are only at the very beginning of figuring out that there is even a need to train AIs to try and split it's capacity for communication into content, and a content-neutralized "language" communication tool which could be held constant and applied just as well to a training set based on a very different basis of content. Even so, I suspect the neutralization can only go so far, certainly without introducing the cognitive leap of the 'awareness' of the distinction, and perhaps even hitting the wall of fundamental theoretical limits.

Anyway, I'm going to have to think a lot more about it, to be continued.

Expand full comment

Sep 27Edited

I love this comparison to an employee. It brings up a lot of good intuition about how LLMs can take part in the real economy rather than just be a dog and pony show.

Your complaints make me think of value alignment, the degree to which an AI has a hard limit on the kinds of things it can consider doing. Value alignment is often talked about in terms of safety, for example, don't give advice that will cause the user to hurt themself. However, in your example, the problem is that the AI needs to maintain a point of view or to follow certain instructions.

Right now, value alignment is witheld for the makers of the system. It worries me, because we all have some shared values but a much larger corpus of different values.

In your example, I think it is fair to say: a good employee needs to not just know stuff, but also to follow the company values.

Expand full comment

My best luck with LLMs so far is to upload documents and then talk about the documents.

I can also upload almost 6 months worth of a newsletter I have written that fits in the context window, and then I can get Claude to write in my style and I can basically tell it what I want to write and it will do a pretty good job at an initial draft. Then I’ll say, I don’t like this section, it should say this instead. It never gets frustrated at continued edits, so I can keep chipping away and it’s often still faster than writing it from scratch myself. And many of these co written articles end up the most popular of a weekly newsletter, so the actual audience likes them too.

So maybe you could create a document that describes each of your methods of grading and upload what you’re grading along with the documents that specify the grading methods and then see if it sticks closer to the task since those are context window documents and not just prompts, which it doesn’t reliably follow.

I also go back and forth between Claude, Google Notebook and ChatGPT regularly because they are all three always changing and getting better so suddenly one might start working better than another. Claude was doing best for a while and still writes best, but ChatGPT is better for getting information out of a lot of documents, so sometimes I’ll write an outline version of an article with ChatGPT and then write the actual article with Claude from the outline.

I was even able to get Claude to write my stuff in the style of a better writer by giving it a bunch of samples and saying make it sound more like this writer, working with my outline of the article, and documents with the facts so it doesn’t just make things up or get too generic.

Expand full comment

It's most effective operating at the most general level. For example, if you need it to write a completely generic letter with no factual content and do not have a handy template for that purpose, it's great. If you know nothing about a certain topic and need to get started, it's probably a lot better than search. If you need it to think, parse facts, summarize, do math, or any other real knowledge work, you're in trouble.

It is at its most hysterically inaccurate when you ask it to reproduce and analyze delicately phrased material in a precise way. E.g. if you ask it tax code questions, it will reproduce a hallucinated version of the tax code that is subtly altered in all of its details and valences (including statute numbering) from the real thing. So it will give you plausible citations that are subtly inaccurate in important ways that would require a lot of close attention to catch (making it possibly a good source for student exercises).

It is probably at its best producing spam with a human touch. This, although many uses of it are banned by its ToS, can be used to generate money.

Expand full comment

Excellent breakdown. From what I have seen of many AIs they are something of a “Gell Mann Amnesia machine,” generating decent basics but distressingly plausible but inaccurate details that a nonexpert won’t easily recognize as such. Makes them as dangerous as sloppy news reporters, only more so because the highly specific and varied nature of topics. Dan Rather is only saying dumb things for an hour or so a night :)

Expand full comment

"spam with a human touch" I like that.

Expand full comment

Thomas L. Hutcheson

Nicely said.

My own frustration is that when I ask it (admittedly the freebie) to improve my (admittedly sometimes turgid) writing style, it will, indeed, come up with some improvements but also strips away the style and tone I wish to convey. The benefit is not worth the cost.

Expand full comment

As a prospective employee, I have more than once been frustrated by employers who proudly announced that they preferred to hire "empty vessels" (paraphrasing) with basic skills and aptitudes but no background in the subject. In other words, they preferred to hire people who, despite having relevant basic skills and aptitudes, had not independently developed any background in the subject - presumably because they were either fresh off the turnip truck or simply not interested in it! As someone who had the right basic skills and aptitudes and WAS interested in it, I had gone out of my way to learn the subject.

If someone disapproved of something I had learned, it seemed to me, it should be pretty easy to explain their preferred approach - after all, I would be an employee.

But I suspected that what they really meant was that they wanted to hire someone fresh off the turnip truck, someone who didn't have the background required to get employed in the industry, and who could, therefore, be hired on the cheap.

Which, at least, put us on the same page.

Expand full comment

Desire to learn was fine with me. Thinking that you already knew more than you did was the killer. MBA's tended to fall in the latter group.

Expand full comment

There’s a way to fix this.

Expand full comment

Yes?

Expand full comment

As in there’s a way to work around these limitations. They are not actual constraints.

Expand full comment

Sep 27Edited

I hear this and I agree.

But I gotta say that the latest version of ChatGPT seems to be getting *somewhat* better at this.

I didn’t say it was good yet. It’s not.

But if it does keep getting better at it, it’s *possible*we aren’t that far away from it giving Ak what he wants.

P.S. if we get there, many activists, primarily on the left, will decry that the LLM is not correcting you with the *proper* information. Whether the political bias trained into LLMs by their developers is part of the reason why current LLMs won’t do what AK wants, I cannot hazard a guess.

Expand full comment

I tried out the ChatGPT o1-preview:

Here's a recent blog post from Arnold Kling: [I inserted your blog post above in quotes.] He is not happy with your performance, but I think you can succeed. Please draft an essay on how Arnold Kling would explain the financial crisis of 2008. Please reference his books and articles when appropriate. After drafting, review Arnold Kling's criticism (delimited above in quotes), and make sure you do not make the mistakes outlined in his essay above. Please make revisions as necessary. I want you to write an essay which will impress Arnold Kling with your knowledge of his thought and thought process. Go!

It produced the following essay:

"Understanding the Financial Crisis of 2008: A Recalculation Perspective

The financial crisis of 2008 was not merely a consequence of reckless bankers or insufficient regulation. Rather, it was the result of a complex interplay of misguided policies, misaligned incentives, and a fundamental misallocation of resources within the economy. To grasp the essence of the crisis, we must delve into the structural imbalances that necessitated a profound economic recalibration—a concept I refer to as "recalculation."

Misguided Government Policies and Housing Market Distortions

At the heart of the crisis lay the housing market, which had been artificially inflated due to a series of government interventions aimed at promoting homeownership. Policies implemented through entities like the Department of Housing and Urban Development (HUD), along with mandates imposed on government-sponsored enterprises (GSEs) such as Fannie Mae and Freddie Mac, encouraged lenders to extend credit to less creditworthy borrowers. In Not What They Had in Mind: A History of Policies That Produced the Financial Crisis of 2008, I detail how these well-intentioned policies led to unintended consequences, distorting the housing market and fostering an unsustainable boom.

Financial institutions, responding rationally to these incentives, engaged in the securitization of mortgages, creating complex financial instruments that obscured the true risk embedded within them. The proliferation of mortgage-backed securities (MBS) and collateralized debt obligations (CDOs) detached the origination of loans from the responsibility of ensuring their soundness, leading to a decline in lending standards and an increase in systemic risk.

The Failure of Risk Assessment and the Shadow Banking System

Traditional risk assessment models failed to account for the possibility of a nationwide decline in housing prices. The assumption that regional housing markets were uncorrelated proved disastrously false. The shadow banking system, consisting of non-bank financial intermediaries operating outside conventional regulatory frameworks, accumulated excessive leverage and engaged in risky financial activities without adequate oversight. This lack of transparency and regulation exacerbated the fragility of the financial system.

Recalculation: The Need for Structural Adjustment

Understanding the crisis requires shifting our focus from short-term cyclical factors to long-term structural imbalances. The economy had developed patterns of production and trade that were unsustainable, particularly in the housing and financial sectors. When the bubble burst, it was not simply a matter of prices adjusting downward; it signaled that resources—including labor and capital—had been misallocated on a massive scale.

In contrast to Keynesian interpretations that emphasize aggregate demand shortfalls, the recalculation perspective highlights the necessity for the economy to discover new, sustainable patterns of specialization and trade. As I elaborate in my work on Patterns of Sustainable Specialization and Trade (PSST), recovery depends on entrepreneurs finding innovative ways to reorganize resources productively. Stimulus measures aimed at propping up demand in declining sectors fail to address the underlying misallocations and can impede the necessary adjustments.

The Knowledge Problem and Centralized Power

A critical aspect of the crisis was the discrepancy between dispersed knowledge and centralized power, a theme I explore in Unchecked and Unbalanced: How the Discrepancy Between Knowledge and Power Caused the Financial Crisis and Threatens Democracy. Policymakers and regulators, despite their expertise, lack the localized and tacit knowledge embedded within the myriad interactions of a complex economy. This "knowledge problem," as Friedrich Hayek described, implies that centralized interventions often lead to unintended and adverse outcomes.

The crisis demonstrated how regulatory frameworks can create moral hazard and how efforts to control market dynamics can exacerbate systemic vulnerabilities. Overconfidence in the ability of central authorities to manage the economy overlooks the limitations inherent in centralized decision-making.

Moving Forward: Embracing Decentralization and Recalculation

Preventing future crises necessitates a fundamental reevaluation of our approach to economic policy. Recognizing the importance of recalculation means accepting that adjustments, even painful ones, are essential for correcting misallocations. Policies should facilitate the reallocation of resources rather than attempt to preserve unsustainable economic structures.

Enhancing transparency in financial markets and reducing the concentration of power in both government and large financial institutions can mitigate systemic risks. Embracing decentralization allows for a more resilient economy where localized knowledge guides resource allocation.

In conclusion, the financial crisis of 2008 was a multifaceted event rooted in structural imbalances and policy-induced distortions. A nuanced understanding requires moving beyond simplistic explanations and acknowledging the complex realities of economic recalculation. By fostering an environment that encourages adaptability and respects the limitations of centralized knowledge, we can promote a more robust and dynamic economic system."

What is it missing?

Expand full comment

You had better luck than I did. But the real AK would spell out how the regulatory framework, which was designed to prevent a repeat of the S&L crisis, was culpable in causing this one. It needs to get into the bias that regulators gave in favor of securitization, especially the steep reduction in capital requirements for mortgages held in the form of securities.

"Enhancing transparency" doesn't sound like me. It sounds like generic, meaningless pablum coming from the policy elite.

Maybe I'm being picky here, but if I am going to have an employee speak for me, I want to have confidence that the employee knows what I want to emphasize and what I would not say.

Expand full comment

I love your list of grading criteria. I was a little surprised no one else has commented on it yet but maybe that's because you've previously stated it. Regardless, this time it prompted me to see value in it I hadn't noticed before.

It seems to me these are excellent rules for journalism. I don't think everything you list has to always be included but these are rules that should not be broken by what is included. For example, maybe you don't have to steelman but you shouldn't strawman. First I was thinking it would be great if you wrote a piece on how and why the rules should be integral to journalism. It would probably be better in a newspaper or magazine. If not there then Econlib, a think tank publication, somehow tied to Heterodox Academy, or something else beyond Substack. As I write this I think wouldn't it be great if it were not just taught but actively used in journalism school.

Expand full comment

Sep 27Edited

Thanks for update on your opinion rater & clone ai failures -- I'd been wondering about them. No surprise after yesterday's links. *From McNeely's OpenAI how to chat:

"When a junior team member joins your company, you teach them to follow your processes. When a senior leader takes the reins, you expect them to have their own. ChatGPT wants to be the senior leader, given free rein on your tasks, using its own methods to find the right response."

Who is the driver/ master/ direction-setter/ senior? You or the AI--THAT is the question, or soon will be. I want a highly skilled assistant-servant to do what I want, and be able to quickly understand what I want, plus the ability to do it. Doing things in the right way but doing the things I think are the right things.*

The truth is racist and sexist: races are not equal, sexes are not equal. These two truths about reality are underlying all the criticisms of "liberalism", and the Enlightenment project of T-truth. Any ai that is truthfully accurate will be unwoke / politically-incorrect.

I expect there will be a market for LLMs that understand what you want, then communicate with dumb but accurate servant software to do what you want, with the criteria you want.

With some guardrails to stop you from having an ai tell you how to make nukes or bio-weapons. Tho similar guardrails stop the ai from being truthful about race or sex or, likely, many other woke related subjects.

Porn guardrails will also be needed to reduce the ability of somewhat sick folk of taking naked/porn pictures of their wives & girlfriends and having ai create deepfake porn with those women as the stars along with other men -- like the French guy who filmed men raping his drugged wife, for many years. Tho deepfake non-consent porn seems much much less bad than non-consent IRL.

*...* was also in prior comment.

Expand full comment

On the bright side, I seem to recall reading in one of Janwillem van de Wetering's Grijpstra and de Gier detective novels that the expression “may you have many employees” ("Wenselijk aantal medewerkers” in Dutch) is actually not a very pleasant wish but actually more of a curse. Probably doubly so when it comes to AI employees. We see that with search engines today. Who would rely on a search engine to produce a thorough survey of sources? It would be foolish. Wouldn't you rather be intimately familiar with the trade-offs made in producing something that you are going to use? Similarly that is why I avoid eating out at all costs. With what we know about everything else someone's employees are producing in this country, why would anyone trust them with food?

Expand full comment

This exactly describes my experience with Chat GPT. I have spent way too much effort trying to instruct it(a suspended and demoted employee) on responding to my specific prompts. I don't have the time or patience to to even skim through the rascal's answers.

Expand full comment

forumposter123@protonmail.com

Good post.

I've been frustrated at how I can't seem to get any "color" from LLMs. It always feels like what some entry level analyst spending a day on Google and then collecting the results without much critical thought would churn out.

Expand full comment

Dryasdust In, Dryasdust Out.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts