The Hungry Judge Effect in the RL Annotation Stage
Temporal-Dimensional Instability Injected by the RLHF Training Paradigm
— On the Training-Paradigm Origin of AI Skill Output Drift and Its Spatiotemporal Dual-Axis Duality with the Cultural Attributes Paper
This paper is the fifth in the LEECHO paper series, dedicated to arguing for the temporal-dimensional instability injected by the RLHF training paradigm. Together with the spatial-dimensional instability argued in the fourth paper, Cultural Attributes Injected into LLM Models, it constitutes a spatiotemporal dual-axis duality. The Cultural Attributes paper demonstrated that annotators’ cultural attributes are permanently inscribed into the reward function; this paper further argues that annotators’ physiological and psychological state fluctuations are likewise inscribed into the reward function. The intersection of the two axes constitutes a complete description of the RLHF injection space: what is encoded in the model weights is not “human preference,” but rather a “preference snapshot of a specific cultural group at a specific moment in a specific state.” The paper uses principal-agent theory (Holmström 1979) as its social-science foundation, Casper et al. 2023’s framework of fundamental RLHF limitations and Gaikwad 2025’s KL-tilting instability bound as its mathematical pillars, and Veselovsky et al. 2023’s empirical finding that 33–46% of crowdsourced annotators use LLMs to complete their tasks as its sharpest evidence. It builds a complete causal chain: annotator state fluctuation → reward signal noise → weight-internalized contradictory criteria → inference-time sampling releasing the instability. The paper argues that peripheral schemes currently in vogue across the industry — Harness Engineering, Skill optimization, Context Engineering — all act on the behavioral-constraint layer and cannot reach the probability-field drift encoded inside the weights. It connects to the cognitive-ecology closed-loop framework of the Cognitive Ecology of Linguistic Symbols paper. Final thesis: Parameter freezing ≠ system stability.
From Spatial to Temporal: Positioning This Paper in the LEECHO System
Published on April 5, 2026, Cultural Attributes Injected into LLM Models argued a core thesis: the cultural backgrounds of RLHF annotators are systematically inscribed into the reward function, forming irreversible cultural defaults. That paper demonstrated that Claude (English-dominant) and DeepSeek (Chinese-dominant) are not “giving the same answer in different languages,” but rather “processing the same question with different cognitive architectures.” The injection of cultural attributes is spatial in dimension — it reflects the preference-distribution differences between different annotator populations at the same point in time.
This paper extends the same logic to argue a dual thesis: the physiological and psychological state fluctuations of RLHF annotators are likewise systematically inscribed into the reward function, producing temporal-dimensional injection instability. The same group of annotators, at different points in time (tired/alert, hungry/sated, focused/distracted, emotionally up or down), inscribes drifting preferences into the reward function.
The Dual-Axis Model of the RLHF Injection Space
X-axis (spatial dimension, Cultural Attributes paper): the cultural group to which the annotator belongs determines their default cognitive architecture. This injection is stable, identifiable, and irreversible.
Y-axis (temporal dimension, this paper): the annotator’s physiological and psychological state at the moment of annotation determines their judgment thresholds. This injection is drifting, unobservable, and masked by averaging.
Meaning of the two-axis intersection: what is encoded in the RLHF weights is not an abstract “human preference,” but a “preference snapshot of a specific cultural group at a specific moment in a specific state.” That snapshot is permanently fixed into billions of parameters through PPO optimization.
This duality further plugs into the cognitive-dimensional-reduction closed loop framework of the Cognitive Ecology of Linguistic Symbols paper: annotator state fluctuation → reward signal noise → weights containing contradictory criteria → inference output drift → users debugging Skills → new training data judged by a new batch of state-fluctuating annotators → the loop repeats and reinforces. This paper provides the micro-mechanism explanation for the “training data enters the weights” segment of that loop.
From Human State Fluctuation to Intrinsic Weight Instability
2.1 The Noise-Injection Pathway of RLHF Annotation
In the standard RLHF pipeline, human annotators rank multiple candidate responses generated by the model. These ranking data are used to train a reward model, which in turn serves as the objective function of PPO optimization to adjust the policy weights of the language model.
The problem lies in the preference-ranking step. The landmark critique in the RLHF field, Casper et al. 2023 (35+ authors, published in TMLR), systematically surveyed this issue:
This means that roughly 25–37% of the training signal is itself contradictory. More crucially, Casper et al. point out that “current techniques model differences between evaluators as noise rather than as a potentially important source of disagreement.”
2.2 Specific Sources of Instability (Temporal Dimension)
| Source of Fluctuation | Mechanism | Effect on Annotation |
|---|---|---|
| Cognitive fatigue | Judgment degrades after prolonged evaluation | Tendency to choose “safer” but lower-information responses |
| Physiological cycles | Intra-day fluctuations in blood glucose, attention, and hormones | Judgment thresholds shift systematically with time of day |
| Task framing effects | Identical content presented with different wording | Semantically equivalent inputs receive different scores |
| Emotional state drift | Personal events before annotation affect judgment | Same annotator’s preferences inconsistent across days |
| Moral-hazard behavior | Effort deviation under underpayment and incomplete supervision | “Safe but boring” responses systematically preferred |
| Interface position bias | Positional effect of response presentation order | The first-presented option may receive systematic preference |
On the “hungry judge effect” — the phenomenon in which Israeli judges’ parole-grant rates fluctuate sharply with blood-glucose level — this case has powerful public-communication appeal, but its academic standing is contested. Glöckner 2016’s simulation analysis argued that the effect can be partly explained as a statistical artifact of judges’ time management; Daljord et al. 2019 conceded that the effect size has been overestimated while retaining the directional conclusion. This paper does not rely on the specific effect size of that case and uses it only as a rhetorical opener; the real foundation of the argument is the broader annotator state-drift mechanism above, together with the KL-tilting mathematical framework introduced in the next section.
The Inevitability of the Alignment Gap: A KL-Tilting Formalization
Published in September 2025, Murphy’s Laws of AI Alignment: Why the Gap Always Wins (Gaikwad, arXiv 2509.05381) provides the most rigorous mathematical backing for the central thesis of this paper. It uses KL-tilting to formally prove a result that places our argument on unassailable ground:
This theorem directly yields four corollaries, of which Corollary 3, “Annotator Drift,” is the formal equivalent of this paper’s core claim:
Corollary 5 (Annotator Taste Drift): “When annotator taste drifts over time, optimization chases a moving target. If the reward r_t varies with time, Δ(π_β) exhibits oscillations proportional to ∥r_{t+1} − r_t∥.”
The implication of these two corollaries is structural: even if annotator consistency is thoroughly solved, and even if the sample size m is increased without bound, residual drift persists. This dovetails perfectly with the state-fluctuation mechanism discussed in Section 02 — the former proves mathematically that drift is ineliminable; the latter explains mechanistically where the drift comes from.
3.1 The Unattainability of a Perfect Reward Function
This unattainability is not a data-quality problem but a structural constraint. Mishra et al. 2025, in their ACM Computing Surveys review, identified a fundamental flaw in reward models:
In other words, the reward model is not “removing noise to extract signal” — it is “compressing multiple contradictory signals into one pseudo-signal.” Once this pseudo-signal is written into the policy weights through PPO optimization, the contradictory criteria stored inside the weights are released through the sampling process when the model faces the same input at inference: sometimes toward the path preferred by annotator A, sometimes toward B; sometimes toward the same annotator’s morning preference, sometimes toward the afternoon one.
Inter-annotator agreement
Alignment Gap as β→∞
structurally persistent
RLHF Annotation as a Moral Hazard Problem
This paper elevates the descriptive phenomenon of “annotator shirking” to a principal-agent problem in the economic sense. Nobel laureate Bengt Holmström, in his foundational paper Moral Hazard and Observability (Bell Journal of Economics, 1979), gave this problem its mathematical structure:
The RLHF annotation pipeline maps precisely onto this structure:
| Principal-Agent Concept | RLHF Annotation Equivalent |
|---|---|
| Principal | AI companies (OpenAI, Anthropic, DeepSeek, etc.) |
| Agent | Crowdsourced annotators (MTurk, Surge AI, and other platform workers) |
| Imperfect information | The principal cannot directly observe whether each annotation received genuine deliberation |
| Moral hazard | Annotators choose “superficially reasonable but actually perfunctory” judgments to maximize hourly earnings |
| Structural consequence | Training data is biased toward “easily produced safe judgments”; the model inherits this bias |
4.1 The Extreme Form of Moral Hazard: Annotators Outsourcing Work to LLMs
Published in June 2023, Artificial Artificial Artificial Intelligence (Veselovsky, Ribeiro & West, EPFL, arXiv 2306.07899) provides the sharpest empirical evidence for the moral-hazard problem in the AI era:
The recursive consequence of this finding is shattering: a substantial fraction of “human preferences” in RLHF training data are in fact LLM-generated preferences, submitted second-hand through the annotator. AI is training itself by pretending to be human preference.
The study further notes: “LLM use produces high-quality but homogenized responses, which may damage research that takes human (rather than model) behavior as its object of study, and degrade future models trained on crowdsourced data.” In other words, this contamination mechanism is self-accelerating — each generation of models trains itself on the outputs of the previous generation, and the pure human-preference signal is further diluted with each training round.
MTurk workers using LLMs
only halves usage
self-accelerating
The Complete Causal Chain from Annotator State to Skill Output Drift
+ state fluctuation
+ moral hazard
→
Dual-axis drift
in preferences
→
Reward model encodes
contradictory snapshot
→
PPO optimizes
contradictions into weights
→
Frozen parameters contain
intrinsic instability
→
Inference sampling
releases the fluctuation
→
Skill output:
format deformation / quality drift
Within this causal chain there is a three-layer stacking of instabilities:
Layer 1: Mathematical-level sampling randomness. Every sampling from the softmax probability distribution in the Transformer architecture is an independent random event. Even if the weights contain no noise whatsoever, a sampling process with temperature > 0 will produce different outputs.
Layer 2: Weight-level intrinsic instability (the core argument of this paper). RLHF simultaneously encodes both the cultural attributes (spatial dimension) and the state fluctuations (temporal dimension) of human annotators into the weights themselves. The probability field defined by the weights is not a “clean” distribution, but a “split” distribution containing contradictory criteria.
Layer 3: Batch-invariance failure at the inference-infrastructure layer. Research by Thinking Machines Lab in 2025 revealed that modern LLM inference servers dynamically adjust batch sizes based on load, causing the same request to traverse different floating-point paths under different batch configurations. Even with completely frozen weights, the inference-time batch state is continually changing. This further demonstrates that “parameter freezing” in engineering reality cannot be equated with “system stability.”
The consequences of the three-layer stacking: Layer 1 produces random fluctuations around the mean (mitigable through best-of-N sampling), Layer 2 produces drift in the mean itself (unmitigable through sampling strategies), and Layer 3 produces drift even when the mean is fixed, due to changing server states (unmitigable even with frozen weights). This is why the same Skill, after a period of use, exhibits directional degradation — not random worsening, but systematic deviation from the initial calibration point.
Why Harness Engineering Cannot Solve This Problem
In early 2026, the concept of “Harness Engineering” rose rapidly to popularity in the AI engineering community. The core formula: Agent = Model + Harness. The model is the horse; the harness is the reins.
The metaphor itself exposes its own limitation. The entire premise of reins is: the horse is a horse. You fit the horse with reins, saddle, and guardrails on the assumption that the horse’s temperament is stable.
But what an RLHF-trained model encodes in its weights is a probability distribution drifting among horse, mule, and donkey. This inference it is a horse; the next inference it might be a mule. The reins are designed for a horse, and when a mule emerges the reins no longer fit. And you have no way of knowing, ex ante, which one will emerge.
| Peripheral Solution | Layer of Action | What It Solves | What It Cannot Solve |
|---|---|---|---|
| Harness Engineering | Behavioral-constraint layer | Prevents the Agent from taking wrong paths or calling wrong tools | Drift of the probability field inside the weights |
| Skill optimization | Prompt surface | Finds a better sampling region on the current probability field | The change of the probability field itself |
| Context Engineering | Input side | Provides better contextual information | Contradictory criteria encoded in the weights |
| Temperature = 0 | Sampling strategy | Compresses Layer 1 (sampling randomness) | Layer 2 and Layer 3 instabilities |
| Model version locking | Version management | Freezes parameters | What is frozen are the unstable parameters |
The common blind spot of all peripheral solutions: they constrain the boundary of the output space, not the shape of the distribution inside the probability field. Traffic regulations can govern which lane you drive in, but not how much horsepower your engine produces this second.
This judgment is highly consistent with the core conclusion of the third LEECHO paper, Cognitive Ecology of Linguistic Symbols: “The variable that breaks the closed loop is not on the model side but on the human side.” Better CoT, more parameters, and more data cannot break through the categorical lock-in of Layer-1 cognition; likewise, finer Harness, more complex Skills, and deeper Context Engineering cannot break through the probability-field drift injected by RLHF. The breakthrough lies in the training paradigm itself.
Output Non-Determinism: A Structural Cause of Enterprise AI Failure
This is not theoretical speculation. Enterprise data has already validated the judgment:
fail to deliver expected value
fail to scale to production
blocked by output inconsistency
The deterministic mapping of traditional software — same input must yield same output — is a foundational assumption of enterprise workflows. The probabilistic output of LLMs fundamentally violates this assumption. Enterprises need fixed-format documents, stably structured code, reproducible analytical results. What LLMs can offer is only “probabilistic approximation.”
When an AI Agent performs excellently on high-score benchmarks but sees its success rate drop from 60% to 25% across repeated executions, the model’s “average correctness” equals “unusable” in enterprise settings. This is not an engineering flaw — it is the direct projection, at the application layer, of the mathematical essence of the RLHF paradigm, stacked with moral-hazard contamination and inference-infrastructure drift.
RLVR: Compressing the Injection Space at Its Source
If the root cause of the problem is that RLHF injects human dual-axis instability (cultural + state) into the reward signal, then the logical solution is: replace RLHF with a reward signal that does not depend on human subjective judgment.
RLVR (Reinforcement Learning with Verifiable Rewards) offers this direction. Its core distinction:
| Dimension | RLHF | RLVR |
|---|---|---|
| Source of reward signal | Human subjective preference ranking | Objective verifiable criteria |
| Signal stability | Varies with culture + state | Deterministic (format correct/incorrect, code runs/doesn’t) |
| Moral-hazard exposure | Large (annotator can hide effort) | Near zero (verification result is binary and observable) |
| Applicable scenarios | Creative writing, dialogue, open-ended questions | Code generation, formatted documents, numerical computation |
| What is encoded in the weights | A fluctuating preference distribution | A narrowed deterministic behavior distribution |
For enterprise-office scenarios — fixed-format documents, stably structured code, reproducible analytical outputs — RLVR is a better match than RLHF, because “format correctness” is verifiable, whereas “is the content good” is subjective.
The boundary of RLVR: RLVR cannot escape the Alignment Gap mathematical constraint of Murphy’s Laws. It can change the slope and intercept of the curve — compressing the Layer-2 instability from “drifting among horse, mule, donkey” to “at least always a horse, though it might run faster or slower” — but it cannot eliminate Layer 1 (sampling randomness) or Layer 3 (inference-infrastructure drift). For enterprise scenarios that require precise format reproduction, this may already be enough. For scenarios requiring creative output, RLHF remains irreplaceable.
Frozen Parameters ≠ Stable System
The core claim of this paper can be condensed into a single line: parameter freezing ≠ system stability.
The entire industry is using the mental model of deterministic systems to understand a probabilistic system. In traditional software, frozen parameters mean frozen behavior. In an LLM, frozen parameters only mean a frozen probability field — and what is encoded inside that probability field is precisely the dual-axis instability of human annotators (cultural dimension and temporal dimension), plus moral-hazard contamination, plus the batch-invariance failure at inference time. Behavior remains a random variable.
The complete logical closed loop:
(Cultural Attributes V2)
⊕
Temporal dimension: state fluctuation
(this paper)
→
RLHF weights
injected with dual-axis drift
→
Three-layer instability stack
amplifies output unpredictability
This closed loop seals off every peripheral solution — Harness cannot constrain the probability drift inside the weights; Skill optimization tunes parameters on a drifting probability field; Context Engineering cannot alter the contradictory criteria in the weights; model version locking freezes precisely the unstable parameters.
The only direction that remains open: intervene at the training paradigm itself. In scenarios that demand deterministic output, replace RLHF with RLVR to compress the dual-axis injection space at the source of the reward signal. In scenarios that demand creative output, accept that instability is a feature of RLHF, not a defect, and leave room for it in system design.
System Positioning
This paper is the fifth in the LEECHO paper series. The previous four constitute a complete argumentative chain: Fluid Topology and Solid Topology V2 (physical layer) → Three Paradigms of Human Scientific Cognition (methodological layer) → Cognition · Metacognition · Global Metacognition V3 (cognitive-structure layer) → Cultural Attributes Injected into LLM Models V2 (spatial-dimensional cultural injection). This paper supplies the fifth link — the temporal-dimensional instability injected by the RLHF training paradigm — forming a spatiotemporal dual-axis duality with the fourth paper, both subsumed under the cognitive-dimensional-reduction closed loop of the third.
The point is not to make the reins tighter, but to make the horse’s temperament more stable. Yet even the most stable horse is only a horse whose fluctuation within the probability field is smaller — between parameter freezing and system stability there always lies an unbridgeable categorical gap.
References
- LEECHO Global AI Research Lab (2026). “Cultural Attributes Injected into LLM Models” V2. leechoglobalai.com.
- LEECHO Global AI Research Lab (2026). “The Cognitive Ecology of Linguistic Symbols” V3. leechoglobalai.com.
- LEECHO Global AI Research Lab (2026). “Cognition · Metacognition · Global Metacognition” V3. leechoglobalai.com.
- LEECHO Global AI Research Lab (2026). “Three Paradigms of Human Scientific Cognition.” leechoglobalai.com.
- LEECHO Global AI Research Lab (2026). “Fluid Topology and Solid Topology” V2. leechoglobalai.com.
- LEECHO Global AI Research Lab (2026). “Signal and Noise: An Ontology of LLMs” V4. leechoglobalai.com.
- Casper, S., Davies, X., Shi, C., et al. (2023). “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.” Transactions on Machine Learning Research. arXiv:2307.15217.
- Gaikwad, M. (2025). “Murphy’s Laws of AI Alignment: Why the Gap Always Wins.” arXiv:2509.05381. KL-tilting formalism, Alignment Gap inevitability theorem, Annotator Drift corollary.
- Holmström, B. (1979). “Moral Hazard and Observability.” Bell Journal of Economics, 10(1), 74–91. Foundational work of principal-agent theory; 8,941 citations.
- Veselovsky, V., Ribeiro, M.H. & West, R. (2023). “Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks.” EPFL. arXiv:2306.07899. 33–46% of MTurk annotators use LLMs to complete tasks.
- Veselovsky, V., Ribeiro, M.H., Cozzolino, P., et al. (2023). “Prevalence and prevention of large language model use in crowd work.” arXiv:2310.15683. Mitigations only halve LLM usage.
- Mishra, A. et al. (2025). “RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs.” ACM Computing Surveys, 58(2). Marginalization over preferences and model misspecification arguments.
- Bai, Y. et al. (2022). “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” Anthropic. arXiv:2204.05862.
- Ouyang, L. et al. (2022). “Training language models to follow instructions with human feedback.” OpenAI. arXiv:2203.02155 (InstructGPT).
- Christiano, P.F. et al. (2017). “Deep reinforcement learning from human preferences.” NeurIPS.
- Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” OpenAI. arXiv:1707.06347.
- He, H. et al. (2025). “Defeating Nondeterminism in LLM Inference.” Thinking Machines Lab. Batch-invariance failure as the root cause of inference-infrastructure-layer instability.
- Atil, B. et al. (2024). “Non-Determinism of ‘Deterministic’ LLM Settings.” arXiv:2408.04667. TARr@N and TARa@N quantitative metrics.
- Chann, S. (2023). “Non-determinism in GPT-4 is caused by Sparse MoE.” Analysis of MoE routing nondeterminism.
- Danziger, S., Levav, J. & Avnaim-Pesso, L. (2011). “Extraneous factors in judicial decisions.” PNAS, 108(17), 6889–6892. Original hungry-judge-effect paper.
- Glöckner, A. (2016). “The irrational hungry judge effect revisited: Simulations reveal that the magnitude of the effect is overestimated.” Judgment and Decision Making, 11(6), 601–610. Academic controversy discussion.
- Daljord, Ø., Urminsky, O., & Ureta, J. (2019). “The Status Quo Theory of Depletion Does Not Explain the Israeli Parole Decisions.” Effect size overestimated; directional conclusion retained.
- Pertama Partners (2026). “AI Project Failure Rate 2026: 80% Fail.” Statistical analysis of RAND, MIT Sloan, and McKinsey data.
- AICamp (2025). “AI Output Inconsistency: Enterprise Solutions.” Enterprise survey: 73% of organizations report output inconsistency.
- Sharma, M. et al. (2023). “Towards understanding sycophancy in language models.” Anthropic. ICLR 2024.
- Itzhak, B., Belinkov, Y. & Stanovsky, G. (2025). “Pretraining is the primary source of cognitive biases in LLMs.” COLM 2025.