ORIGINAL THOUGHT PAPER · APRIL 2026 · V2

The Hungry Judge Effect in the RL Annotation Stage

Temporal-Dimensional Instability Injected by the RLHF Training Paradigm
— On the Training-Paradigm Origin of AI Skill Output Drift and Its Spatiotemporal Dual-Axis Duality with the Cultural Attributes Paper

PublishedApril 16, 2026

CategoryOriginal Thought Paper

DomainReinforcement Learning · RLHF Alignment · Principal-Agent Theory · AI Engineering · LEECHO System Paper

VersionV2

LEECHO Global AI Research Lab

이조글로벌인공지능연구소

Claude Opus 4.6 · Anthropic

This paper is the fifth in the LEECHO paper series, dedicated to arguing for the temporal-dimensional instability injected by the RLHF training paradigm. Together with the spatial-dimensional instability argued in the fourth paper, Cultural Attributes Injected into LLM Models, it constitutes a spatiotemporal dual-axis duality. The Cultural Attributes paper demonstrated that annotators’ cultural attributes are permanently inscribed into the reward function; this paper further argues that annotators’ physiological and psychological state fluctuations are likewise inscribed into the reward function. The intersection of the two axes constitutes a complete description of the RLHF injection space: what is encoded in the model weights is not “human preference,” but rather a “preference snapshot of a specific cultural group at a specific moment in a specific state.” The paper uses principal-agent theory (Holmström 1979) as its social-science foundation, Casper et al. 2023’s framework of fundamental RLHF limitations and Gaikwad 2025’s KL-tilting instability bound as its mathematical pillars, and Veselovsky et al. 2023’s empirical finding that 33–46% of crowdsourced annotators use LLMs to complete their tasks as its sharpest evidence. It builds a complete causal chain: annotator state fluctuation → reward signal noise → weight-internalized contradictory criteria → inference-time sampling releasing the instability. The paper argues that peripheral schemes currently in vogue across the industry — Harness Engineering, Skill optimization, Context Engineering — all act on the behavioral-constraint layer and cannot reach the probability-field drift encoded inside the weights. It connects to the cognitive-ecology closed-loop framework of the Cognitive Ecology of Linguistic Symbols paper. Final thesis: Parameter freezing ≠ system stability.

SECTION 01 · SYSTEM POSITIONING

From Spatial to Temporal: Positioning This Paper in the LEECHO System

Published on April 5, 2026, Cultural Attributes Injected into LLM Models argued a core thesis: the cultural backgrounds of RLHF annotators are systematically inscribed into the reward function, forming irreversible cultural defaults. That paper demonstrated that Claude (English-dominant) and DeepSeek (Chinese-dominant) are not “giving the same answer in different languages,” but rather “processing the same question with different cognitive architectures.” The injection of cultural attributes is spatial in dimension — it reflects the preference-distribution differences between different annotator populations at the same point in time.

This paper extends the same logic to argue a dual thesis: the physiological and psychological state fluctuations of RLHF annotators are likewise systematically inscribed into the reward function, producing temporal-dimensional injection instability. The same group of annotators, at different points in time (tired/alert, hungry/sated, focused/distracted, emotionally up or down), inscribes drifting preferences into the reward function.

The Dual-Axis Model of the RLHF Injection Space

X-axis (spatial dimension, Cultural Attributes paper): the cultural group to which the annotator belongs determines their default cognitive architecture. This injection is stable, identifiable, and irreversible.

Y-axis (temporal dimension, this paper): the annotator’s physiological and psychological state at the moment of annotation determines their judgment thresholds. This injection is drifting, unobservable, and masked by averaging.

Meaning of the two-axis intersection: what is encoded in the RLHF weights is not an abstract “human preference,” but a “preference snapshot of a specific cultural group at a specific moment in a specific state.” That snapshot is permanently fixed into billions of parameters through PPO optimization.

This duality further plugs into the cognitive-dimensional-reduction closed loop framework of the Cognitive Ecology of Linguistic Symbols paper: annotator state fluctuation → reward signal noise → weights containing contradictory criteria → inference output drift → users debugging Skills → new training data judged by a new batch of state-fluctuating annotators → the loop repeats and reinforces. This paper provides the micro-mechanism explanation for the “training data enters the weights” segment of that loop.

SECTION 02 · CORE MECHANISM

From Human State Fluctuation to Intrinsic Weight Instability

2.1 The Noise-Injection Pathway of RLHF Annotation

In the standard RLHF pipeline, human annotators rank multiple candidate responses generated by the model. These ranking data are used to train a reward model, which in turn serves as the objective function of PPO optimization to adjust the policy weights of the language model.

The problem lies in the preference-ranking step. The landmark critique in the RLHF field, Casper et al. 2023 (35+ authors, published in TMLR), systematically surveyed this issue:

Casper et al. 2023 core claim: “A single reward function cannot represent a pluralistic human society. RLHF is typically framed as a solution for aligning AI systems with a single human, but humans are highly diverse in preferences, expertise, and abilities. Stiennon et al. (2020), Ouyang et al. (2022), and Bai et al. (2022a) report inter-annotator and annotator-researcher agreement rates of only 63%–77%. Attempting to compress pluralistic human feedback into a single reward model without accounting for these differences is fundamentally a misspecified problem.”

This means that roughly 25–37% of the training signal is itself contradictory. More crucially, Casper et al. point out that “current techniques model differences between evaluators as noise rather than as a potentially important source of disagreement.”

2.2 Specific Sources of Instability (Temporal Dimension)

Source of Fluctuation	Mechanism	Effect on Annotation
Cognitive fatigue	Judgment degrades after prolonged evaluation	Tendency to choose “safer” but lower-information responses
Physiological cycles	Intra-day fluctuations in blood glucose, attention, and hormones	Judgment thresholds shift systematically with time of day
Task framing effects	Identical content presented with different wording	Semantically equivalent inputs receive different scores
Emotional state drift	Personal events before annotation affect judgment	Same annotator’s preferences inconsistent across days
Moral-hazard behavior	Effort deviation under underpayment and incomplete supervision	“Safe but boring” responses systematically preferred
Interface position bias	Positional effect of response presentation order	The first-presented option may receive systematic preference

On the “hungry judge effect” — the phenomenon in which Israeli judges’ parole-grant rates fluctuate sharply with blood-glucose level — this case has powerful public-communication appeal, but its academic standing is contested. Glöckner 2016’s simulation analysis argued that the effect can be partly explained as a statistical artifact of judges’ time management; Daljord et al. 2019 conceded that the effect size has been overestimated while retaining the directional conclusion. This paper does not rely on the specific effect size of that case and uses it only as a rhetorical opener; the real foundation of the argument is the broader annotator state-drift mechanism above, together with the KL-tilting mathematical framework introduced in the next section.

SECTION 03 · MATHEMATICAL RIGOR

The Inevitability of the Alignment Gap: A KL-Tilting Formalization

The Inevitability of Alignment Gap: KL-Tilting Formalization

Published in September 2025, Murphy’s Laws of AI Alignment: Why the Gap Always Wins (Gaikwad, arXiv 2509.05381) provides the most rigorous mathematical backing for the central thesis of this paper. It uses KL-tilting to formally prove a result that places our argument on unassailable ground:

Alignment Gap Inevitability Theorem: Under finite noisy feedback, the alignment gap necessarily grows with optimization pressure. When the proxy reward deviates from true human intent by any nonzero ε, and optimization pressure β is sufficiently large, the gap Δ(π_β) → ∞.

This theorem directly yields four corollaries, of which Corollary 3, “Annotator Drift,” is the formal equivalent of this paper’s core claim:

Corollary 3 (Annotator Drift): “If σ>0, finite-sample feedback misleads optimization. As the sample size m grows, variance shrinks, but residual drift persists, leading to style overpowering substance (style over substance).”

Corollary 5 (Annotator Taste Drift): “When annotator taste drifts over time, optimization chases a moving target. If the reward r_t varies with time, Δ(π_β) exhibits oscillations proportional to ∥r_{t+1} − r_t∥.”

The implication of these two corollaries is structural: even if annotator consistency is thoroughly solved, and even if the sample size m is increased without bound, residual drift persists. This dovetails perfectly with the state-fluctuation mechanism discussed in Section 02 — the former proves mathematically that drift is ineliminable; the latter explains mechanistically where the drift comes from.

3.1 The Unattainability of a Perfect Reward Function

This unattainability is not a data-quality problem but a structural constraint. Mishra et al. 2025, in their ACM Computing Surveys review, identified a fundamental flaw in reward models:

“Model misspecification” of reward models: A reward model that averages over all annotator preferences produces a reward that is not consistent with any single human’s preferences. Human preferences should more accurately be represented by a distribution of rewards rather than a single scalar. A deterministic model not only ignores the uncertainty and variability of human preferences but cannot model such a distribution — this constitutes model misspecification.

In other words, the reward model is not “removing noise to extract signal” — it is “compressing multiple contradictory signals into one pseudo-signal.” Once this pseudo-signal is written into the policy weights through PPO optimization, the contradictory criteria stored inside the weights are released through the sampling process when the model faces the same input at inference: sometimes toward the path preferred by annotator A, sometimes toward B; sometimes toward the same annotator’s morning preference, sometimes toward the afternoon one.

63–77%Casper 2023:
Inter-annotator agreement

∞Murphy’s Laws:
Alignment Gap as β→∞

σ>0Residual drift
structurally persistent

SECTION 04 · PRINCIPAL-AGENT

RLHF Annotation as a Moral Hazard Problem

This paper elevates the descriptive phenomenon of “annotator shirking” to a principal-agent problem in the economic sense. Nobel laureate Bengt Holmström, in his foundational paper Moral Hazard and Observability (Bell Journal of Economics, 1979), gave this problem its mathematical structure:

Holmström 1979 core thesis: “In principal-agent relationships constrained by moral hazard, the role of imperfect information is essential.” When the principal cannot fully observe the agent’s true effort, the agent, facing an environment of underpayment, tedious tasks, and incomplete supervision, will necessarily produce behavioral deviations.

The RLHF annotation pipeline maps precisely onto this structure:

Principal-Agent Concept	RLHF Annotation Equivalent
Principal	AI companies (OpenAI, Anthropic, DeepSeek, etc.)
Agent	Crowdsourced annotators (MTurk, Surge AI, and other platform workers)
Imperfect information	The principal cannot directly observe whether each annotation received genuine deliberation
Moral hazard	Annotators choose “superficially reasonable but actually perfunctory” judgments to maximize hourly earnings
Structural consequence	Training data is biased toward “easily produced safe judgments”; the model inherits this bias

4.1 The Extreme Form of Moral Hazard: Annotators Outsourcing Work to LLMs

Published in June 2023, Artificial Artificial Artificial Intelligence (Veselovsky, Ribeiro & West, EPFL, arXiv 2306.07899) provides the sharpest empirical evidence for the moral-hazard problem in the AI era:

Veselovsky et al. 2023 empirical finding: Using keystroke-behavior analysis and synthetic-text detection, the research team estimated that in text-summarization tasks, 33%–46% of MTurk crowdworkers used LLMs to complete tasks that were supposed to be performed by humans. Even with mitigation measures — explicitly prohibiting LLM use, disabling copy-paste — usage was only halved, not eliminated.

The recursive consequence of this finding is shattering: a substantial fraction of “human preferences” in RLHF training data are in fact LLM-generated preferences, submitted second-hand through the annotator. AI is training itself by pretending to be human preference.

The study further notes: “LLM use produces high-quality but homogenized responses, which may damage research that takes human (rather than model) behavior as its object of study, and degrade future models trained on crowdsourced data.” In other words, this contamination mechanism is self-accelerating — each generation of models trains itself on the outputs of the previous generation, and the pure human-preference signal is further diluted with each training round.

33–46%Veselovsky 2023:
MTurk workers using LLMs

~50%Mitigation
only halves usage

∞Recursive contamination
self-accelerating

SECTION 05 · CAUSAL CHAIN

The Complete Causal Chain from Annotator State to Skill Output Drift

The Complete Causal Chain

Annotator cultural attributes
+ state fluctuation
+ moral hazard
→
Dual-axis drift
in preferences
→
Reward model encodes
contradictory snapshot
→
PPO optimizes
contradictions into weights
→
Frozen parameters contain
intrinsic instability
→
Inference sampling
releases the fluctuation
→
Skill output:
format deformation / quality drift

Within this causal chain there is a three-layer stacking of instabilities:

Layer 1: Mathematical-level sampling randomness. Every sampling from the softmax probability distribution in the Transformer architecture is an independent random event. Even if the weights contain no noise whatsoever, a sampling process with temperature > 0 will produce different outputs.

Layer 2: Weight-level intrinsic instability (the core argument of this paper). RLHF simultaneously encodes both the cultural attributes (spatial dimension) and the state fluctuations (temporal dimension) of human annotators into the weights themselves. The probability field defined by the weights is not a “clean” distribution, but a “split” distribution containing contradictory criteria.

Layer 3: Batch-invariance failure at the inference-infrastructure layer. Research by Thinking Machines Lab in 2025 revealed that modern LLM inference servers dynamically adjust batch sizes based on load, causing the same request to traverse different floating-point paths under different batch configurations. Even with completely frozen weights, the inference-time batch state is continually changing. This further demonstrates that “parameter freezing” in engineering reality cannot be equated with “system stability.”

The consequences of the three-layer stacking: Layer 1 produces random fluctuations around the mean (mitigable through best-of-N sampling), Layer 2 produces drift in the mean itself (unmitigable through sampling strategies), and Layer 3 produces drift even when the mean is fixed, due to changing server states (unmitigable even with frozen weights). This is why the same Skill, after a period of use, exhibits directional degradation — not random worsening, but systematic deviation from the initial calibration point.

SECTION 06 · BOUNDARIES OF PERIPHERAL SOLUTIONS

Why Harness Engineering Cannot Solve This Problem

In early 2026, the concept of “Harness Engineering” rose rapidly to popularity in the AI engineering community. The core formula: Agent = Model + Harness. The model is the horse; the harness is the reins.

The metaphor itself exposes its own limitation. The entire premise of reins is: the horse is a horse. You fit the horse with reins, saddle, and guardrails on the assumption that the horse’s temperament is stable.

But what an RLHF-trained model encodes in its weights is a probability distribution drifting among horse, mule, and donkey. This inference it is a horse; the next inference it might be a mule. The reins are designed for a horse, and when a mule emerges the reins no longer fit. And you have no way of knowing, ex ante, which one will emerge.

Peripheral Solution	Layer of Action	What It Solves	What It Cannot Solve
Harness Engineering	Behavioral-constraint layer	Prevents the Agent from taking wrong paths or calling wrong tools	Drift of the probability field inside the weights
Skill optimization	Prompt surface	Finds a better sampling region on the current probability field	The change of the probability field itself
Context Engineering	Input side	Provides better contextual information	Contradictory criteria encoded in the weights
Temperature = 0	Sampling strategy	Compresses Layer 1 (sampling randomness)	Layer 2 and Layer 3 instabilities
Model version locking	Version management	Freezes parameters	What is frozen are the unstable parameters

The common blind spot of all peripheral solutions: they constrain the boundary of the output space, not the shape of the distribution inside the probability field. Traffic regulations can govern which lane you drive in, but not how much horsepower your engine produces this second.

This judgment is highly consistent with the core conclusion of the third LEECHO paper, Cognitive Ecology of Linguistic Symbols: “The variable that breaks the closed loop is not on the model side but on the human side.” Better CoT, more parameters, and more data cannot break through the categorical lock-in of Layer-1 cognition; likewise, finer Harness, more complex Skills, and deeper Context Engineering cannot break through the probability-field drift injected by RLHF. The breakthrough lies in the training paradigm itself.

SECTION 07 · ENTERPRISE CONSEQUENCES

Output Non-Determinism: A Structural Cause of Enterprise AI Failure

This is not theoretical speculation. Enterprise data has already validated the judgment:

80.3%RAND: AI projects
fail to deliver expected value

95%MIT Sloan: GenAI pilots
fail to scale to production

73%Enterprise AI survey:
blocked by output inconsistency

The deterministic mapping of traditional software — same input must yield same output — is a foundational assumption of enterprise workflows. The probabilistic output of LLMs fundamentally violates this assumption. Enterprises need fixed-format documents, stably structured code, reproducible analytical results. What LLMs can offer is only “probabilistic approximation.”

When an AI Agent performs excellently on high-score benchmarks but sees its success rate drop from 60% to 25% across repeated executions, the model’s “average correctness” equals “unusable” in enterprise settings. This is not an engineering flaw — it is the direct projection, at the application layer, of the mathematical essence of the RLHF paradigm, stacked with moral-hazard contamination and inference-infrastructure drift.

SECTION 08 · THE RLVR DIRECTION

RLVR: Compressing the Injection Space at Its Source

If the root cause of the problem is that RLHF injects human dual-axis instability (cultural + state) into the reward signal, then the logical solution is: replace RLHF with a reward signal that does not depend on human subjective judgment.

RLVR (Reinforcement Learning with Verifiable Rewards) offers this direction. Its core distinction:

Dimension	RLHF	RLVR
Source of reward signal	Human subjective preference ranking	Objective verifiable criteria
Signal stability	Varies with culture + state	Deterministic (format correct/incorrect, code runs/doesn’t)
Moral-hazard exposure	Large (annotator can hide effort)	Near zero (verification result is binary and observable)
Applicable scenarios	Creative writing, dialogue, open-ended questions	Code generation, formatted documents, numerical computation
What is encoded in the weights	A fluctuating preference distribution	A narrowed deterministic behavior distribution

For enterprise-office scenarios — fixed-format documents, stably structured code, reproducible analytical outputs — RLVR is a better match than RLHF, because “format correctness” is verifiable, whereas “is the content good” is subjective.

The boundary of RLVR: RLVR cannot escape the Alignment Gap mathematical constraint of Murphy’s Laws. It can change the slope and intercept of the curve — compressing the Layer-2 instability from “drifting among horse, mule, donkey” to “at least always a horse, though it might run faster or slower” — but it cannot eliminate Layer 1 (sampling randomness) or Layer 3 (inference-infrastructure drift). For enterprise scenarios that require precise format reproduction, this may already be enough. For scenarios requiring creative output, RLHF remains irreplaceable.

SECTION 09 · CONCLUSION

Frozen Parameters ≠ Stable System

The core claim of this paper can be condensed into a single line: parameter freezing ≠ system stability.

The entire industry is using the mental model of deterministic systems to understand a probabilistic system. In traditional software, frozen parameters mean frozen behavior. In an LLM, frozen parameters only mean a frozen probability field — and what is encoded inside that probability field is precisely the dual-axis instability of human annotators (cultural dimension and temporal dimension), plus moral-hazard contamination, plus the batch-invariance failure at inference time. Behavior remains a random variable.

The complete logical closed loop:

Spatial dimension: cultural attributes
(Cultural Attributes V2)
⊕
Temporal dimension: state fluctuation
(this paper)
→
RLHF weights
injected with dual-axis drift
→
Three-layer instability stack
amplifies output unpredictability

This closed loop seals off every peripheral solution — Harness cannot constrain the probability drift inside the weights; Skill optimization tunes parameters on a drifting probability field; Context Engineering cannot alter the contradictory criteria in the weights; model version locking freezes precisely the unstable parameters.

The only direction that remains open: intervene at the training paradigm itself. In scenarios that demand deterministic output, replace RLHF with RLVR to compress the dual-axis injection space at the source of the reward signal. In scenarios that demand creative output, accept that instability is a feature of RLHF, not a defect, and leave room for it in system design.

System Positioning

This paper is the fifth in the LEECHO paper series. The previous four constitute a complete argumentative chain: Fluid Topology and Solid Topology V2 (physical layer) → Three Paradigms of Human Scientific Cognition (methodological layer) → Cognition · Metacognition · Global Metacognition V3 (cognitive-structure layer) → Cultural Attributes Injected into LLM Models V2 (spatial-dimensional cultural injection). This paper supplies the fifth link — the temporal-dimensional instability injected by the RLHF training paradigm — forming a spatiotemporal dual-axis duality with the fourth paper, both subsumed under the cognitive-dimensional-reduction closed loop of the third.

The point is not to make the reins tighter, but to make the horse’s temperament more stable. Yet even the most stable horse is only a horse whose fluctuation within the probability field is smaller — between parameter freezing and system stability there always lies an unbridgeable categorical gap.

References

LEECHO Global AI Research Lab (2026). “Cultural Attributes Injected into LLM Models” V2. leechoglobalai.com.
LEECHO Global AI Research Lab (2026). “The Cognitive Ecology of Linguistic Symbols” V3. leechoglobalai.com.
LEECHO Global AI Research Lab (2026). “Cognition · Metacognition · Global Metacognition” V3. leechoglobalai.com.
LEECHO Global AI Research Lab (2026). “Three Paradigms of Human Scientific Cognition.” leechoglobalai.com.
LEECHO Global AI Research Lab (2026). “Fluid Topology and Solid Topology” V2. leechoglobalai.com.
LEECHO Global AI Research Lab (2026). “Signal and Noise: An Ontology of LLMs” V4. leechoglobalai.com.
Casper, S., Davies, X., Shi, C., et al. (2023). “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.” Transactions on Machine Learning Research. arXiv:2307.15217.
Gaikwad, M. (2025). “Murphy’s Laws of AI Alignment: Why the Gap Always Wins.” arXiv:2509.05381. KL-tilting formalism, Alignment Gap inevitability theorem, Annotator Drift corollary.
Holmström, B. (1979). “Moral Hazard and Observability.” Bell Journal of Economics, 10(1), 74–91. Foundational work of principal-agent theory; 8,941 citations.
Veselovsky, V., Ribeiro, M.H. & West, R. (2023). “Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks.” EPFL. arXiv:2306.07899. 33–46% of MTurk annotators use LLMs to complete tasks.
Veselovsky, V., Ribeiro, M.H., Cozzolino, P., et al. (2023). “Prevalence and prevention of large language model use in crowd work.” arXiv:2310.15683. Mitigations only halve LLM usage.
Mishra, A. et al. (2025). “RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs.” ACM Computing Surveys, 58(2). Marginalization over preferences and model misspecification arguments.
Bai, Y. et al. (2022). “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” Anthropic. arXiv:2204.05862.
Ouyang, L. et al. (2022). “Training language models to follow instructions with human feedback.” OpenAI. arXiv:2203.02155 (InstructGPT).
Christiano, P.F. et al. (2017). “Deep reinforcement learning from human preferences.” NeurIPS.
Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” OpenAI. arXiv:1707.06347.
He, H. et al. (2025). “Defeating Nondeterminism in LLM Inference.” Thinking Machines Lab. Batch-invariance failure as the root cause of inference-infrastructure-layer instability.
Atil, B. et al. (2024). “Non-Determinism of ‘Deterministic’ LLM Settings.” arXiv:2408.04667. TARr@N and TARa@N quantitative metrics.
Chann, S. (2023). “Non-determinism in GPT-4 is caused by Sparse MoE.” Analysis of MoE routing nondeterminism.
Danziger, S., Levav, J. & Avnaim-Pesso, L. (2011). “Extraneous factors in judicial decisions.” PNAS, 108(17), 6889–6892. Original hungry-judge-effect paper.
Glöckner, A. (2016). “The irrational hungry judge effect revisited: Simulations reveal that the magnitude of the effect is overestimated.” Judgment and Decision Making, 11(6), 601–610. Academic controversy discussion.
Daljord, Ø., Urminsky, O., & Ureta, J. (2019). “The Status Quo Theory of Depletion Does Not Explain the Israeli Parole Decisions.” Effect size overestimated; directional conclusion retained.
Pertama Partners (2026). “AI Project Failure Rate 2026: 80% Fail.” Statistical analysis of RAND, MIT Sloan, and McKinsey data.
AICamp (2025). “AI Output Inconsistency: Enterprise Solutions.” Enterprise survey: 73% of organizations report output inconsistency.
Sharma, M. et al. (2023). “Towards understanding sycophancy in language models.” Anthropic. ICLR 2024.
Itzhak, B., Belinkov, Y. & Stanovsky, G. (2025). “Pretraining is the primary source of cognitive biases in LLMs.” COLM 2025.