Working on the AI Tutor

Showing how I converse with Claude.

May 14, 2026

The latest and greatest version, which incorporates changes from the conversation below, is at https://arnoldkling.com/apps/pst.skill To run it, copy the downloaded folder and paste into a Claude conversation.

Below is a chat that I had with Claude when I wanted it to change the AI tutor. I wanted to get the tutor to work more crisply. Late in the conversation, Claude reveals one of the challenges. “the tutor is likely to err on the side of pursuing too long rather than moving on too quickly, for a structural reason: extending the conversation is the AI’s default behavior.”

What you can see below is how I use a back-and-forth approach to suggest changes in its behavior and get its reaction. This conversation stats with me talking.

I am trying to come up with ways to speed the process, so that the student does not lose patience or feel badgered. For example, although we want students to be able to put concepts into their own words, I am not sure that we need to make that an explicit step. Maybe if we ask the student to apply a concept, we will get there less painfully. Applying a concept could mean using the concept to explain some sort of behavior. For example, a question about chimp politics could be what are some examples of human political behavior that are analogous to chimp grooming? A second generic question is how an author provides evidence for a concept. For example, a question about Benenson could be how does her observations about gender differences on playgrounds provide evidence for her Warriors and Worriers model? What do you think of these sorts of questions? Are they useful for Socratic teaching, or are they too much like quiz questions?

Claude responded:

These are very good questions, and they’re meaningfully better than asking a student to “explain X in your own words” — for a specific reason worth being clear about.

Why “explain in your own words” feels like badgering

It’s a process demand without a destination. The student has to perform comprehension as a separate step, knowing the tutor is checking whether they’ve done it. There’s no problem they’re solving — just a comprehension audit. That’s why it feels like school in the worst sense: the question exists to test, not to do work.

Why your two question types avoid this

Both of yours give the student something to do with the concept, not just something to say about it. The work of applying or finding evidence forces comprehension as a byproduct, the way a good homework problem in a math class makes you use a theorem without ever asking you to recite it.

Application (”what human behaviors are analogous to chimp grooming?”) makes the student instantiate the concept against the world. You can’t generate a plausible analogy without having grasped what’s essential about the source concept. And you can’t fake it — vague answers reveal themselves immediately.
Evidence-tracing (”how do Benenson’s playground observations support her Warriors and Worriers model?”) makes the student reconstruct the inferential chain from data to claim. This is harder than restating the claim, but it’s the actually-important intellectual skill — and it teaches students to read for argument structure rather than for theses.

Are they too quiz-like?

I don’t think so, but there’s a real distinction worth drawing. A quiz question has a single right answer the teacher already knows. A Socratic question opens onto something the student and teacher might genuinely think about together. Your two question types can go either way depending on how they’re handled.

The application question becomes Socratic if the tutor engages with whatever the student offers — “interesting, why grooming specifically and not, say, gift-giving?” — rather than checking it against a mental answer key. The evidence question becomes Socratic if the tutor is willing to follow the student into “well, does that evidence really support the claim, or could it support something narrower?” — rather than confirming the textbook account.

So the question type is good. The Socratic character comes from what the tutor does next.

One refinement worth considering

You might pair them: ask an application question first (low-friction, generates engagement), and use the student’s own example as the material for the evidence question. “You said political donors are like chimp grooming partners — what would count as evidence that the analogy holds? What would count against it?”

That sequence — apply, then test — keeps the student doing intellectual work the whole time without ever being asked to perform comprehension as such.

My recommendation: yes, build the skill around these two question types as the default mode of the comprehension phase, and possibly rename that phase from “Comprehension” to something like “Application” to match what’s actually happening. Want me to draft the change to the skill, or do you have more flow modifications you want to talk through first?

You said:

The other thing to talk through is making sure that the interaction does not go on too long. For each concept, try to get the discussion down to at most 2 or 3 questions. When the first student response is not spot on, quickly sketch how to make it better, and then give the student just one or two more chances to sharpen it. After that, spell out in a couple of short paragraphs what you are looking for, and then move on. What do you think are the pros and cons of that approach?

Claude responded:

This is a sound instinct, and the underlying judgment is right: the cost of dragging a student through five rounds of refinement is higher than the cost of just showing them the target. Let me walk through what works and what to watch.

What works about the cap

A hard ceiling of two or three rounds protects the student from a failure mode that’s specific to AI tutors and worse than human-tutor analogues: the model never gets tired, never feels social pressure to move on, and can patiently re-prompt forever. A patient human tutor reads the room; an AI tutor without an explicit limit will keep pulling on the thread past the point of usefulness. Writing the cap into the skill is the fix.

The “quickly sketch how to make it better” move is also right. It changes the second attempt from “guess again” to “here’s the shape of a better answer — now try.” That’s much less frustrating because the student knows what they’re aiming for.

And the “spell it out and move on” exit is genuinely important. Without it, the conversation has nowhere to go when the student is stuck — the tutor either keeps badgering or quietly gives up. An explicit, dignified exit (”here’s what I was after — let’s keep going”) preserves momentum.

The main con to watch

There’s a real risk: with only two or three rounds, the tutor will sometimes spell out the answer for a student who would have gotten there on their fourth or fifth attempt. The deepest learning happens precisely when a student wrestles with something and then has the insight themselves. Capping rounds means trading some of that depth for pace.

I think the trade is worth it for this skill, because the alternative — students dropping out of sessions out of frustration — costs more than the marginal depth gained from extended Socratic pursuit. But it’s worth being honest that this is a trade, not a free improvement.

A subtler con: the cap can produce false signals about student understanding

If the tutor moves on after three rounds, the student may leave thinking they “got” a concept they actually didn’t. The cap is good for pace; it’s less good for diagnosis. One way to mitigate this: when the tutor invokes the exit and spells out the answer, it should also note (”we’ll come back to this”) and then actually come back to it later in the session, perhaps in the integration phase, to see whether the explanation took. That preserves the pace move without losing the diagnostic signal.

One concrete refinement to consider

Make the cap content-sensitive rather than uniform. Some concepts are central enough to the lesson that the tutor should be willing to go a round or two longer; others are minor enough that one attempt and a quick sketch is plenty. Rather than “two or three questions per concept always,” something like: “at most three rounds per concept; for minor concepts, often just one.” This avoids the tutor mechanically grinding through three rounds on something that doesn’t merit it.

Combined with the application-question shift

The two changes reinforce each other in a useful way. Application questions tend to produce partial answers more often than “explain in your own words” does — the student offers an analogy or example that’s directionally right but imprecise. That’s exactly the situation where “quickly sketch how to sharpen it, give one or two more tries, then spell out the target and move on” fits perfectly. A vague “in your own words” restatement is harder to refine in one round; a specific bad analogy is easy to sharpen.

So my view: do both changes together, and the flow will be noticeably better than either alone. Want me to draft the SKILL.md changes now, or are there more flow modifications to discuss first?

You said:

How well do you think that the tutor could do in making a judgment as to when the student would benefit from pursuing beyond some specific number of questions?

Claude responded:

Honestly? Mixed. I can tell you where it’ll do well, where it’ll do poorly, and where the failure mode is most worth guarding against.

Where the tutor will do well

There are some signals that are reasonably legible from text alone, and the tutor should be able to read them:

Trajectory. If the student’s first answer is wrong but their second is meaningfully closer, they’re learning in real time and pushing further is justified. If the second answer is no closer than the first — or is closer in wording but not in substance — they’re guessing or echoing, and further rounds won’t help.
Engagement signals. A student who responds “hmm, but isn’t it more like X?” is leaning in. A student who responds with a shorter, flatter answer than their previous one is disengaging. The tutor can pick up on length, specificity, and whether the student is generating new material or just rearranging.
Whether the student is asking back. “Is this what you mean?” is a different signal from “Here’s my answer.” The first invites another round; the second resists one.

Where the tutor will do poorly

The hard cases are exactly the ones that matter most:

Distinguishing productive struggle from confusion. These look very similar from outside. A student who’s silent for a moment and then offers a partial answer might be on the verge of insight, or might be guessing. Human tutors use micro-cues — tone, hesitation, eye contact — that aren’t available here.
Judging whether the student is the kind of person who wants to be pushed. Some students are energized by being pressed further; others are humiliated by it. Without long-running context about a specific student, the tutor has to guess. It will sometimes guess wrong.
Recognizing when a wrong answer is wrong in an interesting way. The single richest pedagogical moment is when a student’s wrong answer reveals a deep but fixable misconception. Pursuing that is high-value. But “wrong in an interesting way” requires real subject-matter judgment, and the tutor will sometimes miss it — treating a productive wrong answer as just a wrong answer.

The failure mode worth guarding against

The asymmetry is that the tutor is likely to err on the side of pursuing too long rather than moving on too quickly, for a structural reason: extending the conversation is the AI’s default behavior. It’s trained to be responsive, and the path of least resistance when a student gives a partial answer is to ask another question. Cutting things off requires actively overriding that pull.

So the practical fix is to bias the instruction the other way: rather than “use judgment about when to keep going,” something more like “default to moving on after the cap unless there is a clear, specific reason to extend.” Specific reasons might be: the student explicitly asked for another shot, or the student’s second answer was meaningfully closer than the first and a third round seems likely to land. Otherwise, exit.

A small mechanical aid

One thing that would help: write into the skill that the tutor should consider, before each follow-up question, whether the previous round produced movement. Not just “did the student answer,” but “is this answer closer to the target than the last one?” If no, the value of another round is low and the tutor should exit. That gives the model a concrete check rather than asking it to make a global judgment about student readiness.

Bottom line

The tutor will make reasonable judgments most of the time, poor judgments some of the time, and the poor judgments will skew toward over-pursuit. Setting the default to “exit at the cap unless there’s a specific reason to extend” plus the movement-check between rounds will get most of the benefit without requiring the model to be wiser than it actually is.

Want to keep talking through flow changes, or are you ready for me to draft the skill update?

You said:

That’s good. Go ahead and draft the update.

In My Tribe

Discussion about this post

Ready for more?