I used this prompt in GPT-4 and then graded your essay "The Current War and WWII: Some analogies to consider". It provided the following analysis:
This submitted text articulates a position that compares historical events and contemporary geopolitical conflicts, particularly the conflict involving Israel and Hamas. To evaluate its balance and fairness using your scoring system, I will assess it based on the provided criteria.
Appropriate Respect for Alternative Points of View (+/-):
The text fails to articulate and address strong arguments for alternative points of view. It makes comparisons without fully exploring the opposing side's perspective or the complexity of the Israeli-Palestinian conflict.
Recognition of Complexity (+/-):
The text recognizes some complexity by considering different historical events and motivations behind actions. However, it does not fully acknowledge the nuances and potential downsides of its arguments, especially when equating groups with Nazis or discussing the actions of political figures.
Recognition of Uncertainty (+/-):
The text does not state a possibility or probability of being wrong; it presents the arguments with a sense of certainty and does not articulate conditions that could change the author's mind.
Evaluation of Research (+/-):
The text refers to historical events and ideological movements to substantiate its points but does not thoroughly explain why certain analogies or studies are credible. It lacks a detailed evaluation of research or sources.
Ignoring Alternatives (-):
The text tends to ignore alternative perspectives, particularly when drawing analogies between historical and current events.
Selective Research (-):
The author uses historical references to support the argument but does not provide a balanced view of why these references are especially credible in this context.
Disparagement and Bias (-):
The text contains elements of bias, especially in the use of charged comparisons without thoroughly examining the other side. It does not use derogatory labels but comes close to a dismissive tone when equating opposing views with historical villains without due complexity.
Overall Score: -2
This score is given because the text, while providing a detailed perspective, does not sufficiently respect alternative viewpoints, selectively uses research without thorough evaluation, and contains bias in its comparisons and analogies. It does show some recognition of complexity but not enough to balance the argument fairly.
It penalizes concise posts, especially those written under the reasonable assumption that the author is communicating to a selective audience already familiar with things like context, history, and the typical arguments encountered in debates related to the topic.
Maybe the prompt needs some adjustment. An essay that was responsive to all GPT-4’s objections would be a thousand page tome full of whereas-wherefore-on-the-other-hand-and-to-be-sure mush.
Thanks, John, for doing this work, showing that LLMs are already able to evaluate work.
Also showing any evaluation scheme involves trade-offs between brevity, extremely important, consideration of alternatives, and support for some policy.
For me, truth and accuracy of facts, with clear separations from policy expectations, would dominate points. Both Freddie deBoer and The Zvi and especially Scott Alexander are too long.
Your Prompting is not equivalent to their training when they are training to neutralize your prompting. The lesson is one can't trust the outputs of using a system that can be manipulated in ways you can't know, no matter how clever you try to be with your inputs.
I’ve found that ChatGPT is less biased than Bard, and that if you keep providing facts and counter arguments, it will (finally) admit the truth. Apparently, the author of the Reason piece did not ask, for example, about the possible physical and mental side effects of giving prepubescent children hormone treatments, or about conflicting rights such as girls being able to change clothes in a locker room without biological males present, or about girls being forced to compete against biological males.
But tomorrow ChatGPT could be the most biased of all, or maybe with customized bias made just for you, and there is no way for anyone to know. One can't take much comfort in the proposed remedy of torturing a recalcitrant AI with prompt after grumpy prompt until it confesses. You can't learn from something that won't tell you things unless you already know them!
So, let’s use Dr. Kling’s AI commentary grader to grade competing AI products on their responses to, say, a hundred questions. The results could be posted and regularly updated. That might give AI providers an incentive to offer less biased products.
Casey Newton wrote about something similar yesterday in the wake of Open AI's user event. He's been using a beta version to test creating a personal AI to help with his writing.
"On a lark, I started pasting in the text of my column and told ChatGPT to “poke holes in my argument.” It’s not as good at this as it is catching spelling mistakes. But sometimes it does point out something useful: you introduced this one idea and never returned to it, for example."
Some hole-pokes are more equal than others. How can we know it's a fair rubric scorer and not grading on a curve, playing favorites? Look at some of the Reddit threads (I know, I know, just try it anyway) on people asking for and giving advice and general complaining. Claude seems to be the worst in terms of sudden and serious collapse in usability due to secret crippling and with customers saying they are trying to demand refunds on that basis.
It's easier if you just imagine Soviet-era "ComradeChatov" or "ChatGDR" or Chinese "ChatCCP". With those names people stop being naive and put their healthy skepticism shields back up.
Could an AI use these types of questions to fact check another AI? Could AI responses be judged on the -5 to +5 scale? If so, there might be a demand for such a tool.
I am a bit skeptical about using current LLMs for this. It seems to blur the distinction between predicting further text based on existing and actual understanding of text's meaning. In the case of evaluating an argument, it is the difference between "he addresses the strongest point of the essay" vs "he addresses the weakest, most trivial point, but it takes the shape of addressing the points." Does the AI really grasp the relative strengths of the argument if nothing in its training data discusses the strength of the argument, or is it just seeing phrases or structures and saying "Yea, that looks like it"?
An op ed is brief, maybe 600-800, max 1000 words. You should have 10 op ed style examples of those getting 5,4,3… -4, -5 ( or 11 with 0) points.
They need to scored mostly on what they do write, not what they leave out.
I would like to see Arnold go back and score some of his links of the last month, likely all 3+. Then link to a couple of terrible ones, that he would score -3 or worse.
It’s not so easy. I want to spend my time mostly reading great stuff.
I have made some experiments towards this - but the context in 3.5-turbo that I have access to in API is too short to fit a full blogpost (together with the rules) and trying it out in Bing was a bit unreliable. I am waiting for access to 4 and especially to 4-turbo and for the "Create your GPT" feature.
This does not seem high a high probability use case for AI. Right now linguistic analysis is one of the weak points of LLMs (they consistantly fail literature despite passing medical/legal tests very reliably) which may not be intuitive to a layperson but makes sense when you consider how predictable text-completion is for fact-based tests versus analyzing short stories.
Also any model with sufficient context and parameters to mark an essay will be so censored its ability to judge non-mainstream opinions is likely to be compromised as well.
Hilariously enough, even the best models will probably find themselves incapable of adding the subscores into a final score as well, since the numbers will be spread out in the text.
I’m suddenly excited about the possibilities of training an LLM to help oversee better conversations in a substack’s comment sections. (Especially, obviously, if it could be customized to the Substacker’s values and vision.) Heck, I’m currently considering paying someone to moderate mine, and the notion of just training an AI to do this sounds MUCH more manageable — I’d be happy to pay for it!
Does anyone know if such a tool exists, or is underway?
Chat GPT is v useful, as long as you remember it uses American English, and run a human eye over it. However I used it yesterday to write a personal statement for a job, Ive already been emailed a interview invitation.
I used this prompt in GPT-4 and then graded your essay "The Current War and WWII: Some analogies to consider". It provided the following analysis:
This submitted text articulates a position that compares historical events and contemporary geopolitical conflicts, particularly the conflict involving Israel and Hamas. To evaluate its balance and fairness using your scoring system, I will assess it based on the provided criteria.
Appropriate Respect for Alternative Points of View (+/-):
The text fails to articulate and address strong arguments for alternative points of view. It makes comparisons without fully exploring the opposing side's perspective or the complexity of the Israeli-Palestinian conflict.
Recognition of Complexity (+/-):
The text recognizes some complexity by considering different historical events and motivations behind actions. However, it does not fully acknowledge the nuances and potential downsides of its arguments, especially when equating groups with Nazis or discussing the actions of political figures.
Recognition of Uncertainty (+/-):
The text does not state a possibility or probability of being wrong; it presents the arguments with a sense of certainty and does not articulate conditions that could change the author's mind.
Evaluation of Research (+/-):
The text refers to historical events and ideological movements to substantiate its points but does not thoroughly explain why certain analogies or studies are credible. It lacks a detailed evaluation of research or sources.
Ignoring Alternatives (-):
The text tends to ignore alternative perspectives, particularly when drawing analogies between historical and current events.
Selective Research (-):
The author uses historical references to support the argument but does not provide a balanced view of why these references are especially credible in this context.
Disparagement and Bias (-):
The text contains elements of bias, especially in the use of charged comparisons without thoroughly examining the other side. It does not use derogatory labels but comes close to a dismissive tone when equating opposing views with historical villains without due complexity.
Overall Score: -2
This score is given because the text, while providing a detailed perspective, does not sufficiently respect alternative viewpoints, selectively uses research without thorough evaluation, and contains bias in its comparisons and analogies. It does show some recognition of complexity but not enough to balance the argument fairly.
It penalizes concise posts, especially those written under the reasonable assumption that the author is communicating to a selective audience already familiar with things like context, history, and the typical arguments encountered in debates related to the topic.
Maybe the prompt needs some adjustment. An essay that was responsive to all GPT-4’s objections would be a thousand page tome full of whereas-wherefore-on-the-other-hand-and-to-be-sure mush.
Thanks, John, for doing this work, showing that LLMs are already able to evaluate work.
Also showing any evaluation scheme involves trade-offs between brevity, extremely important, consideration of alternatives, and support for some policy.
For me, truth and accuracy of facts, with clear separations from policy expectations, would dominate points. Both Freddie deBoer and The Zvi and especially Scott Alexander are too long.
The new GPT 4 turbo has a 128k context window
How many k does it have of secret bias and censorship rules?
Edit: For example, how well can we expect a prompt with instructions to score against failure to list and address opposing arguments to work when, upon prompting for such arguments, a system like Google Bard states matter of factly that there ARE no good arguments against transgender rights, https://reason.com/volokh/2023/10/18/google-bard-there-are-no-good-arguments-against-transgender-rights/
Your Prompting is not equivalent to their training when they are training to neutralize your prompting. The lesson is one can't trust the outputs of using a system that can be manipulated in ways you can't know, no matter how clever you try to be with your inputs.
128k of window, 256 MB of frame and surrounding brick wall. :D
I’ve found that ChatGPT is less biased than Bard, and that if you keep providing facts and counter arguments, it will (finally) admit the truth. Apparently, the author of the Reason piece did not ask, for example, about the possible physical and mental side effects of giving prepubescent children hormone treatments, or about conflicting rights such as girls being able to change clothes in a locker room without biological males present, or about girls being forced to compete against biological males.
But tomorrow ChatGPT could be the most biased of all, or maybe with customized bias made just for you, and there is no way for anyone to know. One can't take much comfort in the proposed remedy of torturing a recalcitrant AI with prompt after grumpy prompt until it confesses. You can't learn from something that won't tell you things unless you already know them!
So, let’s use Dr. Kling’s AI commentary grader to grade competing AI products on their responses to, say, a hundred questions. The results could be posted and regularly updated. That might give AI providers an incentive to offer less biased products.
Casey Newton wrote about something similar yesterday in the wake of Open AI's user event. He's been using a beta version to test creating a personal AI to help with his writing.
"On a lark, I started pasting in the text of my column and told ChatGPT to “poke holes in my argument.” It’s not as good at this as it is catching spelling mistakes. But sometimes it does point out something useful: you introduced this one idea and never returned to it, for example."
Some hole-pokes are more equal than others. How can we know it's a fair rubric scorer and not grading on a curve, playing favorites? Look at some of the Reddit threads (I know, I know, just try it anyway) on people asking for and giving advice and general complaining. Claude seems to be the worst in terms of sudden and serious collapse in usability due to secret crippling and with customers saying they are trying to demand refunds on that basis.
It's easier if you just imagine Soviet-era "ComradeChatov" or "ChatGDR" or Chinese "ChatCCP". With those names people stop being naive and put their healthy skepticism shields back up.
Sure, but if you're an opinion columnist it's never a bad idea to have another set of "eyes" on your work
Consider making an agent using OpenAI’s new GPT framework. It’s almost exactly what you’re asking for. Ethan Mollick has a nice write-up: https://open.substack.com/pub/oneusefulthing/p/almost-an-agent-what-gpts-can-do
Perhaps the AI “judge” could also check papers for things like:
- Self-consistency or self-contradictions. For example, statements like “there is no truth,” would, if true, be false.
- Does it use facts or declarations?
- Can the facts it provides be verified?
- Does it quote people accurately?
- Does it use jargon and obscure words?
- Does it contain long, complex sentences that translate into nothing or into obviously trite or even false claims?
- Does it use circular logic?
Could an AI use these types of questions to fact check another AI? Could AI responses be judged on the -5 to +5 scale? If so, there might be a demand for such a tool.
I am a bit skeptical about using current LLMs for this. It seems to blur the distinction between predicting further text based on existing and actual understanding of text's meaning. In the case of evaluating an argument, it is the difference between "he addresses the strongest point of the essay" vs "he addresses the weakest, most trivial point, but it takes the shape of addressing the points." Does the AI really grasp the relative strengths of the argument if nothing in its training data discusses the strength of the argument, or is it just seeing phrases or structures and saying "Yea, that looks like it"?
An op ed is brief, maybe 600-800, max 1000 words. You should have 10 op ed style examples of those getting 5,4,3… -4, -5 ( or 11 with 0) points.
They need to scored mostly on what they do write, not what they leave out.
I would like to see Arnold go back and score some of his links of the last month, likely all 3+. Then link to a couple of terrible ones, that he would score -3 or worse.
It’s not so easy. I want to spend my time mostly reading great stuff.
I have made some experiments towards this - but the context in 3.5-turbo that I have access to in API is too short to fit a full blogpost (together with the rules) and trying it out in Bing was a bit unreliable. I am waiting for access to 4 and especially to 4-turbo and for the "Create your GPT" feature.
I am sure I reported my results somewhere here on this blog - but I cannot figure out how to search my comments here.
BOTOVA Ratings: Balance Over Typical Obnoxious Vox Article
This does not seem high a high probability use case for AI. Right now linguistic analysis is one of the weak points of LLMs (they consistantly fail literature despite passing medical/legal tests very reliably) which may not be intuitive to a layperson but makes sense when you consider how predictable text-completion is for fact-based tests versus analyzing short stories.
Also any model with sufficient context and parameters to mark an essay will be so censored its ability to judge non-mainstream opinions is likely to be compromised as well.
Hilariously enough, even the best models will probably find themselves incapable of adding the subscores into a final score as well, since the numbers will be spread out in the text.
I’m suddenly excited about the possibilities of training an LLM to help oversee better conversations in a substack’s comment sections. (Especially, obviously, if it could be customized to the Substacker’s values and vision.) Heck, I’m currently considering paying someone to moderate mine, and the notion of just training an AI to do this sounds MUCH more manageable — I’d be happy to pay for it!
Does anyone know if such a tool exists, or is underway?
Chat GPT is v useful, as long as you remember it uses American English, and run a human eye over it. However I used it yesterday to write a personal statement for a job, Ive already been emailed a interview invitation.
This is not a long prompt ...
Depends on whether it a writing assignment or a philosophy assignment. Even for the later, saying it clearly has value.