All the responses seem to include a lot of, "add this and this." The grader seems to assume that there is an unlimited amount of words available to the writer. But many essays have an effective word limit, and op-eds generally have an explicit limit. It would be interesting to see what the grader says when it is told the writer has a limit of 800 words, the usual op-ed limit, or perhaps 2,000 words, about 8 double-spaced pages, beyond which most people won't be willing to read.
I have a GPT for feedback on lecture designs and have built-in a timing element so that the GPT will prompt you to think of timings and will suggest areas that can be shortened or removed. Does a decent job in my use of it so far, especially when combined with clarity on what you want the main points to be.
I also complain about the “add more to be better” advice. Bullet points only is not quite enough, but is closer to what is optimal than most of Scott Alexander or Freddie DeBoer or Rod Dreher, all of whom often express thoughtful ideas, but usually with far too many words.
Arnold’s conciseness is my current best practice for reading, tho I can’t match it for my own writing comments.
I would like Arnold to have some 100%, or 10/10 essays as ideals, including some length consideration. A shorter 800 words on one side is usually better than 2000 words covering both sides.
Has the grader output any grades that weren't a B+ or so? I haven't been taking notes on the grade distribution, but I seem to recall a lot of B+ turning up, or 82-85's. Maybe that is just the normal sort of grade actually published work gets, but it strikes me as unusually consistent. Makes me wonder if it would ever assign an A+ or a D, and what that would take.
I can definitely get it to assign lower grades if I give it an essay that is bad by my criteria. Feel free to suggest one. Also, if you can find an essay that is exceptionally good by my criteria (really steel-manning the opposing point of view), suggest it and I will see if the grader gives it an A.
I have now found an essay that it graded at 4/10 and also an essay it gave an A (a book review that I wrote). I'll post on these in the next several days
Oh, good! I was struggling a little to think of a good A essay I had read lately and was going to dig through Caplan, Zvi or Scott's backlog. For a C essay I was thinking Krugman https://www.nytimes.com/2023/12/12/opinion/argentina-dollar-milei.html . I can't read it, but Krugman op-eds are pretty reliably poor while not being overtly such to the average NYT reader (apparently).
I have a technical question. Are you just using one ChatGPT window to grade all these editorials, or are you starting from scratch with a new window and the same prompt each time?
What I’ve found is that the context window of ChatGPT is a limited number of characters, so if you keep working in the same window, it always gets worse over time as it forgets the stuff you previously prompted it.
When I use it, I start a new window each time and begin with my latest and best revised version of my prompt and I get more consistent results.
But even with this, they are constantly changing ChatGPT itself and tuning it, so some prompts become no longer very effective and you have to rewrite or significantly revise them anyway.
As you mentioned before, even prompting it twice in a row with the exact same prompt and essay you will not get the same answer.
So I am curious how you are specifically using it for those reasons.
Would you mind sharing some more about your implementation for using ChatGPT? Meaning specifically what websites, companies, apps, features, background articles etc, or perhaps even some details on your workflow you have used to create the functionality illustrated in this series of articles. You've made remarkable progress with this technology in a relatively short amount of time. I'd like to get up to speed on this and any advice would be appreciated. Thanks in advance.
From experience asking GPT to grade my answers to hypothetical exam questions, I don’t place too much stock in the letter/percentage. The feedback is usually quite helpful for improvement, but it’s hard to budge the grade very far from B+ territory in either direction.
Change the prompt to demand that it provide the grade at the end of the review and not the beginning. That will allow it to think over many tokens instead of using one token to output a grade. It'll then use it's review to determine the grade. This is known as the chain of thought prompting technique.
I’m now thinking the FIT auto grader should give points up to 100 for the various categories, with an additional category of length where less length is better.
Then the essay/post gets a total score.
Sort of like gymnastics scoring.
An open Monday night talk about game rules for aggregating team points would be fun.
I think the grader was overly generous to the first essay (the Honig piece on ethnic studies in California)—and I write as someone who probably agrees fairly strongly with Honig's position. But the essay didn't address arguments for the 'liberationist' position at all: it imputed sinister motives to the position's supporters and pointed out the downsides, but never brought up any reasons why an intelligent and well-meaning person might favor it.
Even in an editorial, I want to see the writer grapple with the best arguments that the other side has to offer. Honig didn't do that all, and I'd give his essay a low C at best.
I think "Aaron Renn" should be "Larry Arnn", but the rest of the article was behind a paywall, so I'm not sure if it's a quote from Renn in the hidden text.
All the responses seem to include a lot of, "add this and this." The grader seems to assume that there is an unlimited amount of words available to the writer. But many essays have an effective word limit, and op-eds generally have an explicit limit. It would be interesting to see what the grader says when it is told the writer has a limit of 800 words, the usual op-ed limit, or perhaps 2,000 words, about 8 double-spaced pages, beyond which most people won't be willing to read.
I have a GPT for feedback on lecture designs and have built-in a timing element so that the GPT will prompt you to think of timings and will suggest areas that can be shortened or removed. Does a decent job in my use of it so far, especially when combined with clarity on what you want the main points to be.
I also complain about the “add more to be better” advice. Bullet points only is not quite enough, but is closer to what is optimal than most of Scott Alexander or Freddie DeBoer or Rod Dreher, all of whom often express thoughtful ideas, but usually with far too many words.
Arnold’s conciseness is my current best practice for reading, tho I can’t match it for my own writing comments.
I would like Arnold to have some 100%, or 10/10 essays as ideals, including some length consideration. A shorter 800 words on one side is usually better than 2000 words covering both sides.
Has the grader output any grades that weren't a B+ or so? I haven't been taking notes on the grade distribution, but I seem to recall a lot of B+ turning up, or 82-85's. Maybe that is just the normal sort of grade actually published work gets, but it strikes me as unusually consistent. Makes me wonder if it would ever assign an A+ or a D, and what that would take.
I can definitely get it to assign lower grades if I give it an essay that is bad by my criteria. Feel free to suggest one. Also, if you can find an essay that is exceptionally good by my criteria (really steel-manning the opposing point of view), suggest it and I will see if the grader gives it an A.
I have now found an essay that it graded at 4/10 and also an essay it gave an A (a book review that I wrote). I'll post on these in the next several days
Oh, good! I was struggling a little to think of a good A essay I had read lately and was going to dig through Caplan, Zvi or Scott's backlog. For a C essay I was thinking Krugman https://www.nytimes.com/2023/12/12/opinion/argentina-dollar-milei.html . I can't read it, but Krugman op-eds are pretty reliably poor while not being overtly such to the average NYT reader (apparently).
I have a technical question. Are you just using one ChatGPT window to grade all these editorials, or are you starting from scratch with a new window and the same prompt each time?
What I’ve found is that the context window of ChatGPT is a limited number of characters, so if you keep working in the same window, it always gets worse over time as it forgets the stuff you previously prompted it.
When I use it, I start a new window each time and begin with my latest and best revised version of my prompt and I get more consistent results.
But even with this, they are constantly changing ChatGPT itself and tuning it, so some prompts become no longer very effective and you have to rewrite or significantly revise them anyway.
As you mentioned before, even prompting it twice in a row with the exact same prompt and essay you will not get the same answer.
So I am curious how you are specifically using it for those reasons.
I am using the "create a GPT" function, which I assume is more like keeping the same prompt.
Would you mind sharing some more about your implementation for using ChatGPT? Meaning specifically what websites, companies, apps, features, background articles etc, or perhaps even some details on your workflow you have used to create the functionality illustrated in this series of articles. You've made remarkable progress with this technology in a relatively short amount of time. I'd like to get up to speed on this and any advice would be appreciated. Thanks in advance.
From experience asking GPT to grade my answers to hypothetical exam questions, I don’t place too much stock in the letter/percentage. The feedback is usually quite helpful for improvement, but it’s hard to budge the grade very far from B+ territory in either direction.
Change the prompt to demand that it provide the grade at the end of the review and not the beginning. That will allow it to think over many tokens instead of using one token to output a grade. It'll then use it's review to determine the grade. This is known as the chain of thought prompting technique.
I’m now thinking the FIT auto grader should give points up to 100 for the various categories, with an additional category of length where less length is better.
Then the essay/post gets a total score.
Sort of like gymnastics scoring.
An open Monday night talk about game rules for aggregating team points would be fun.
Whenever "nuanced" enters the dialogue, I know it is going off-track.
I think the grader was overly generous to the first essay (the Honig piece on ethnic studies in California)—and I write as someone who probably agrees fairly strongly with Honig's position. But the essay didn't address arguments for the 'liberationist' position at all: it imputed sinister motives to the position's supporters and pointed out the downsides, but never brought up any reasons why an intelligent and well-meaning person might favor it.
Even in an editorial, I want to see the writer grapple with the best arguments that the other side has to offer. Honig didn't do that all, and I'd give his essay a low C at best.
I think "Aaron Renn" should be "Larry Arnn", but the rest of the article was behind a paywall, so I'm not sure if it's a quote from Renn in the hidden text.