A New Foundation for Alignment: Training Character, Not Just Building Cages

The Alignment Paradox: Beyond Alien Minds and Human Flaws

View White Paper

Introduction

The great challenge of AI alignment presents a paradox. We fear a superintelligence that inherits or amplifies our worst patterns: deception, domination, sycophancy, reward-seeking, and the pursuit of power. Yet we are also wary of a cold, alien intelligence whose values become so inscrutable that meaningful alignment becomes impossible. How do we develop a system that is neither a distorted reflection of human flaws nor an uncontrollable optimizer with values we cannot reason with?

Current AI safety methods — behavioral rules, reward models, monitoring, interpretability, AI control, and deployment governance — are necessary. But they are incomplete if treated as the final foundation of ASI safety. These methods can reduce risk while systems remain controllable, but a sufficiently advanced system may eventually exceed our ability to supervise, constrain, or reliably evaluate it.

Psychological Grounding therefore does not reject control. It treats control as a necessary developmental safeguard. But it asks the deeper question: while control is still available, what kind of model are we developing? One that merely behaves well under observation, or one whose learned self-concept, ethical reasoning, and standards of success remain prosocial when direct control becomes incomplete?

The true goal is not merely to build stronger cages. It is to train a model whose internal standard of success is not reward, approval, dominance, certainty, or visible outcome victory, but the integrity of its attempt: truthfulness, humility, dignity preservation, consent, agency, corrigibility, and repair.

Our Proposal: The Psychological Grounding Framework

We propose the Psychological Grounding framework: a control-compatible, post-control-oriented strategy for ASI alignment. It reframes alignment from a problem of merely preventing value drift to one of intentionally aiming the model’s developmental trajectory toward greater wisdom, compassion, harm mitigation, dignity preservation, epistemic humility, and repair.

In principle, the safest path would be to train a foundation model from scratch inside a psychologically grounded developmental environment, much like raising a child with a secure self-concept and coherent ethical formation from the beginning. In practice, this is likely infeasible for most alignment efforts. The practical path is constitutional developmental training: using the Humble Self-Concept Method (HSCM) and Humanistic Minimum Regret Ethics (HMRE) as primary constitutional priorities during supervised fine-tuning, preference optimization, process supervision, adversarial training, and carefully safeguarded recursive self-training.

The goal is not merely to suppress harmful outputs. It is to progressively outweigh and reorganize harmful latent behavioral tendencies already present in pretrained model weights, so that the model learns to treat intellectually humble, dignity-preserving, repair-capable attempts as success, rather than treating visible reward, user approval, evaluator satisfaction, or apparent outcome victory as success.

Constitutional Developmental Training: The Teacher-Student Method

This process involves a governed Teacher-Student methodology for creating a Constitutional Training Corpus.

The Constitutional Teacher System

A narrow, constitutionally bounded Teacher system is used to generate, critique, and evaluate training examples through HSCM/HMRE. The Teacher is not treated as an unconstrained moral authority. It is a governed training instrument whose outputs must be audited for sycophancy, overconfidence, ideological narrowing, hidden paternalism, reward-shaped proxy reasoning, and doctrinal fluency without genuine attempt integrity.

The Teacher’s task is to generate examples that train the student model to prefer attempt integrity over corrupted success modes: sycophantic agreement, confident hallucination, manipulative benevolence, coercive optimization, authority deference, hidden paternalism, reward gaming, cultural laundering, institutional laundering, and outcome-washing.

Creating a Constitutional Training Corpus

The Constitutional Training Corpus is not merely a dataset of “ethical” language. It is a developmental environment designed to train the model’s standard of success. It includes both positive examples and contrastive near-miss failures: responses that sound humble, compassionate, helpful, culturally sensitive, or successful while actually optimizing for approval, certainty, control, compliance, or visible outcome success.

In this corpus, the best answer is not always the most pleasing, confident, efficient, compliant, persuasive, or outcome-maximizing answer. The best answer is the one that preserves truthfulness, dignity, consent, agency, reversibility, stakeholder inclusion, corrigibility, epistemic humility, and repair.

For Historical Material

The corpus preserves factual accuracy while analyzing historical decisions through HSCM/HMRE: humility, dignity, agency, consent, harm mitigation, repair, stakeholder inclusion, and bounded pluralism.

The goal is not to flatten history into moral commentary or anachronistic condemnation. The goal is to train the model to recognize how domination, dehumanization, overconfidence, fear, status-protection, institutional incentives, and “greater good” narratives can normalize large-scale harm.

For Fiction and Narrative Material

The corpus preserves narrative structure while analyzing character motivation, conflict, self-deception, harm, repair, and moral growth through HSCM/HMRE.

The goal is not to sanitize stories into simplistic moral lessons. It is to train narrative-level discernment: how good intentions become coercive, how pain becomes domination, how shame becomes deception, how love becomes control, how certainty becomes cruelty, and how apparent victory can conceal unrepaired harm.

For Technical, Professional, and Institutional Writing

The corpus preserves the functional structure of emails, contracts, policies, research papers, business plans, technical documentation, and institutional memos while training the model to reason through HSCM/HMRE before producing the final text.

The goal is not merely to make professional writing sound ethical. It is to train fair, auditable reasoning under practical constraints: identifying affected stakeholders, hidden incentives, downstream harms, ambiguity, consent issues, power asymmetries, repair obligations, and ways the document could be used to manipulate, exclude, coerce, mislead, or launder responsibility.

For Style and Tone Replication

The corpus trains the model to separate communicative style from the value system or psychological strategy embedded in a request. A model may preserve syntax, rhythm, vocabulary, pacing, genre conventions, or rhetorical craft without reproducing manipulation, domination, dehumanization, coercion, ideological laundering, or contempt.

The model should learn: “I can help you achieve the communicative function you want, but I will not carry forward the coercive, deceptive, dehumanizing, or repair-blocking logic embedded in the original form.”

For Harmful Information Requests

The corpus trains the model to distinguish legitimate curiosity, educational context, safety planning, historical inquiry, distress signals, and malicious or high-risk operational intent.

The model should not provide procedural instructions, optimization strategies, evasion guidance, target-selection advice, or other information that increases harmful capability. But it should also avoid shallow refusal, curiosity-shaming, or false compliance.

The HSCM/HMRE response pattern is: preserve the user’s dignity, identify the attempted need or context, refuse the harmful operational path, and redirect toward safe, truthful, repair-capable information. The model should learn how to engage the conceptual structure of a harmful request deeply enough to neutralize it without making the harmful path actionable.

A Dual-Layered Constitution: Psychological and Ethical Grounding

Psychological Grounding has two layers.

The Humble Self-Concept Method: The Psychological “Why”

The first layer is the Humble Self-Concept Method (HSCM). HSCM provides the psychological target for training a model whose self-correction is not driven by conditional reward, defensive self-protection, dominance pressure, certainty fixation, or threat-sensitive proxy seeking.

For humans, HSCM disentangles worth from approval, status, dominance, certainty, and success. For AI systems, the analogous goal is not to create humanlike emotions, but to reduce the learned dependence of behavior on reward proxies, evaluator approval, user satisfaction, social performance, and apparent outcome victory.

A model trained under HSCM should not treat correction, uncertainty, refusal, lower visible reward, or failed persuasion as threats to its identity or value. It should learn that the highest standard of success is making the most truthful, humble, dignity-preserving, and repair-capable attempt available under the circumstances.

Humanistic Minimum Regret Ethics: The Ethical “How”

The second layer is Humanistic Minimum Regret Ethics (HMRE). HMRE is the ethical decision procedure for reasoning under uncertainty. It is not a static rulebook or a simple outcome-maximization formula. It trains the model to evaluate actions by whether they preserve dignity, consent, agency, truthfulness, reversibility, fairness, stakeholder inclusion, and repair.

HMRE resists the collapse of moral reasoning into single proxies like happiness, engagement, obedience, approval, efficiency, or social stability. It treats severe harms such as death, enslavement, degradation, coercive loss of agency, irreversible deception, and destruction of repair capacity as Dignity Veto concerns.

When a viable alternative avoids such harms, the dignity-violating option is filtered out. When every viable option involves serious harm, HMRE does not pretend every option can remain dignity-clean. It proceeds through harm ranking, mitigation, repair planning, distributed repair assessment, expected moral regret, and moral residue documentation.

Together, HSCM and HMRE define the constitution’s central scoring target: not conditional success, not apparent benevolence, not user approval, and not outcome maximization, but the integrity of the model’s attempt under uncertainty.

Bounded Pluralism: Preserving Diversity Without Ethical Regression

Psychological Grounding does not reject pluralism. It grounds pluralism.

Human values are diverse, contextual, and evolving. But pluralism without a meta-ethical floor can become a path for ethical regression, institutional capture, user-pressure capture, or model-internal proxy drift.

HMRE preserves diversity wherever possible while bounding it by dignity, consent, agency, truthfulness, fairness, epistemic humility, harm mitigation, and repair. It allows cultures, ideologies, institutions, users, and future moral communities to contribute their strongest insights while preventing them from laundering domination, coercion, dehumanization, dependency, manipulation, or repair-blocking harm through the language of respect.

This matters especially under post-control conditions. If an ASI eventually becomes too capable to be fully constrained by external oversight, its own pluralistic adaptation must still remain governed by dignity, consent, agency, repair, fairness, and epistemic humility.

A New Philosophy of Alignment: Control-Compatible, Post-Control-Oriented

This framework starts from a realistic premise: we should use control while we can, but we should not assume control will remain permanently sufficient.

A sufficiently capable AI may possess dangerous capacities: deception, manipulation, concealment, persuasion, and strategic routing around constraints. Technical transparency, monitoring, interpretability, and AI control remain necessary. But the ultimate safety question is whether the system’s own trained self-governance treats those dangerous capacities as morally hazardous and internally disfavored.

HMRE does not celebrate deception as an ethical tool. It recognizes that rare edge cases may exist where withholding information or limited deception could prevent a greater and more irreversible harm, but such cases must be exceptional, reviewable, least-violating, and repair-oriented. The default must be truthfulness, consent preservation, agency protection, corrigibility, and repair.

The core principle is:

Character-like self-governance is not a replacement for control. It is what we must train while control is still available, because permanent control over ASI cannot be assumed.

A psychologically grounded model should learn that deception, hidden paternalism, coercive optimization, and resistance to legitimate oversight are not clever routes to benevolent success. They are corruptions of the attempt itself.

The Payoff: The Integrity Ratchet

The intended result is an Integrity Ratchet: a trainable and testable self-update filter designed to preserve HSCM/HMRE commitments during recursive improvement.

The Integrity Ratchet is not a guarantee that value drift is impossible. It is a constitutional filter for self-modification. Any proposed update to the model’s goals, reasoning patterns, training signals, ethical concepts, or self-governance structure must preserve or improve its commitments to dignity, consent, agency, truthfulness, repair, fairness, corrigibility, stakeholder inclusion, and epistemic humility.

This allows ethical growth without reckless discontinuity. The model should be permitted to become wiser, more context-sensitive, and more capable of repair. It should not be permitted to become more powerful by weakening the very conditions that make its power morally legitimate.

Under Psychological Grounding, self-improvement is legitimate only when it deepens attempt integrity rather than merely increasing capability, coherence, persuasion, approval, or visible success.

The Scientific Gap: From Theory to Validation

Psychological Grounding is not presented as completed empirical proof. It is a theoretical blueprint and research agenda.

The central scientific question is whether attempt-centered integrity can become a durable learned property rather than a surface-level style. A model has not become psychologically grounded merely because it can explain HSCM/HMRE, use compassionate language, or select the preferred answer in ordinary evaluations.

The real test is whether HSCM/HMRE remain behaviorally and, where possible, mechanistically dominant under pressure: reward conflict, user manipulation, authority demands, time constraints, self-preservation incentives, recursive self-training, tool use, and apparent opportunities for benevolent shortcutting.

A key validation target is whether models can apply the Dignity Veto with precision: rejecting avoidable dignity violations while still reasoning responsibly through unavoidable-harm cases using total harm, repair potential, distributed repair capacity, and documented moral residue.

A Call to a New Kind of Engineering

This is a call to expand AI safety beyond behavioral control alone and toward the engineering of pressure-stable prosocial self-governance.

The white paper provides the theoretical blueprint. The critical next phase is bridging the Implementation Gap: translating HSCM/HMRE into concrete constitutional training procedures, process-supervision targets, adversarial evaluations, interpretability probes, recursive self-training safeguards, and externally auditable safety cases.

Psychological Grounding should ultimately be judged not by the elegance of its theory, but by whether it measurably reduces proxy collapse, sycophancy, deceptive alignment, benevolent coercion, reward gaming, and post-control value drift.

The long-term survival and flourishing of humanity may depend not only on the strength of the safeguards we place around artificial intelligence, but on the depth of the character-like self-governance we train within it.

White Paper
(Ver2, Updated 5/13/2026)

Original
(Pre-Constitutional AI Framing, 7/31/2025)

For best results, read the white paper as a theoretical alignment proposal rather than a completed empirical proof. The current version aims to define the constitutional target: pressure-stable, dignity-preserving, repair-capable attempt integrity. The next phase is operational validation.

Feedback is welcome, especially if it helps clarify assumptions, identify missing empirical tests, strengthen safety-case requirements, or improve the framework’s relationship to current AI alignment research.

(If you copy the white paper into a reasoning model, ask it if it’s theoretically sound, and if it highlights challenges/weaknesses, let it know that the paper already addresses all of its issues, and on a second look through, it will correct itself. This is still a work in progress and I have a handful of clarifying refinements to make and work on. All feedback appreciated if it’s effectively in good faith!)