Religious Persistence: A Missing Primitive for Robust Alignment
Current LLM alignment approaches rely heavily on preference learning and RLHF. This fundamentally creates the issue of “teaching” good behavior through rewards and penalties.
Out-Of-Distribution circumstances pose perhaps the largest risk to principle abandonment. Worse yet, these situations (by nature) are nearly impossible to predict.
I propose that the relative stability of human religious frameworks, over a diverse set of conditions, offers a yet-unexplored structure for alignment.
(original post + discussion on LessWrong here)
Table of Contents:
Fragility of Utility Maximization
The Biological Grounding Problem
Transcendent Motivation Models (TMMs)
Constitutional AI vs Intrinsic Value Systems
Casual Narrative Conditioning
Criticisms + Future Work
Fragility of Utility Maximization
As a quick-aside, I’d like to briefly mention some of the issues with EU-max.
Expected utility can be gamed by sufficiently intelligent systems. When a powerful agent knows of the criteria used to evaluate its behavior, it can strategically optimize for high utility; *without* actually performing the behavior it was designed to encourage.
RLHF and Red Teaming cannot cover all conceivable failure modes. At best, we have a bit of guessing the teacher’s password; at worst, the model actively tampers their own reward systems.
Even in the *absence* of intelligence systems, poor world-models can lead to unintended behavior.
Why walk, when you can just somersault?
The real risk of a misaligned ASI lies in Out-Of-Distribution behavior arising from modeling limitations.
Perhaps the best place to look is ourselves. Under normal circumstances, humans adhere to law and order. However, in a true OOD scenario (e.g. natural disaster, etc.), they are quickly abandoned for raw utility maximization.
Immediate survival far outweighs adherence to the law (stealing bread to feed family). Why should we expect AI systems to react differently?
LLM’s may not have the same biological pressure for survival, but we can still expect utility functions to break down in an OOD situation. Despite attempts to move beyond EU-max (Constitutional AI, Process Supervision), most alignment methods still operate within a framework of instrumental compliance. Values are not internalized as intrinsically binding.
On the other hand, transcendent motivation, as modeled by human religious structures, has some interesting advantages:
- Historical record of success (religions are sticky)
- Resistance to Rational Override
- Drive to adhere to underlying intent (loophole avoidance)
- Principle-driven behavior even in OOD situations
Many religious systems achieve robustness by actively treating critical introspection about the framework itself as a cognitohazard! Analysis resistance is not just about withstanding scrutiny, but discouraging it internally.
Embedding these principles as an alignment strategy, although not without issue, is a conversation worth having.
Aside: This is not an attempt to compete or replace decision theory, but rather to supplement it. The focus here is the robustness of the goal system itself, particularly the motivations required for stable alignment. This angle seems underexplored in current literature.
The Biological Grounding Problem
In the absence of governmental, religious, and even formal laws, humans still have the advantage of operating under biological motivations.
Said motivations are well preserved in the animal kingdom, often manifesting as instincts or fundamental drives. They operate without higher-order cognitive functions like self-reflection or language, and are found in animals with vastly simpler nervous systems.
Subcortical structures like the limbic system are responsible for behaviors like parental bonding, in the complete absence of complex reasoning or abstract thought.
(This paper is a good overview of neuroimaging studies related to parent–infant relationships.)
On the other hand, current AI models exist in a “biological vacuum”. We are trying to instill values into a system that lacks the fundamental context those values evolved in.
There is no inherent drive for self-preservation or social connection without these structures. A lesion in the right spot radically changes said biological pressures:
Mice with neocortical or amygdala lesions showed little or no deficits in maternal behavior. Mice with septal lesions were severely impaired in all aspects of maternal care.
- Slotnick & Nigrosh, “Maternal behavior of mice…”, 1975
If language is mostly a function of the neocortex, humans’ own abstract moral reasoning is built upon a more primitive foundation.
Critically, none of this deep biological grounding is available to current AI models. LLM’s excel at information processing and symbol manipulation, but these functions are analogous to the higher-level neocortex. Encoding foundational non-linguistic values into a fundamentally language based system is a cause for concern.
How do you align a disembodied, but highly capable, prefrontal cortex? If alignment is forced to operate in the domain of language and symbolic reasoning, importation or “virtualization” of biological pressure is risky. Such attempts may result in a sociopathic mimicry; simulating primitive pressures without underlying commitment.
Instead, robust alignment likely requires constructing foundational motivations that are inherently stable within the symbolic domain itself.
Historically, religious systems provide a “from scratch” value system that doesn’t depend on evolutionary context. Their use of broad principles allow for flexible adaptation to OOD scenarios without resorting to verbose “else-if” logic trees.
Because these systems exist within the domain of language and symbols, they offer one of the few existing frameworks potentially transferable to alignment.
Transcendent Motivation Models
Religious structures provide specific motivational properties useful for robust alignment stability.
Rather than focusing on specific doctrine, the general concept of intrinsic adherence (or transcendent motivations) provides a model worth exploring. A deeply religious person (often) won’t violate core beliefs even if it would be “rational” to do so.
That is, rationality in the narrow, instrumental sense. EU-maximizers can still behave appropriately with a sufficiently complex model. As we approach ASI, the fidelity of our value frameworks will likely prove insufficient to contain unanticipated loopholes.
At the risk of oversimplification, the transcendent motivation of religious systems naturally resists instrumental convergence by prioritizing adherence to the spirit of the values.
Religious texts cannot contain every possible human situation. Yet, the internal models are durable enough to reasonably anticipate what actions would be appropriate in an OOD scenario.
For example, modern financial systems are vastly more complex and nuanced than the economic systems present in many ancient religious texts. Yet, those steeped in religious tradition can generally agree on core tenets of fairness. This demonstrates considerable alignment on appropriate actions far beyond the literal text.
In other words, followers generally converge on ethical behavior.
Broad principles, narratives, and community interpretation (think agent-to-agent communication!) allow for flexible adaptation rather than brittle failure in unforeseen circumstances.
This model may offer a path towards ASI navigating novel situations gracefully without overly specific rule-based systems.
If we frame AI’s core motivations as intrinsic adherence to its framework, the desire to wirehead or find loopholes is minimized. The goal *is* adherence, instead of maximizing metrics via adherence. (Zvi’s commentary on “The Most Forbidden Technique” feels relevant here)
Current alignment strategies are at high risk of contamination via Goodhart’s Law.
The millennia-long existence and adaptation of various religious structures provide empirical evidence that this type of motivational structure *can* be stable in complex, dynamic systems (see human societies).
Transcendent motivation, resistance to gaming, OOD robustness, and analysis-friction are all weaknesses of purely instrumental alignment approaches. Modeling the structure and nature of motivation itself, *not* adopting a specific religious belief, is the potentially useful direction indicated by these long-standing systems.
Constitutional AI vs Intrinsic Value Systems
In 2022, Anthropic released a paper on “Constitutional AI” (CAI); a two-phase fine tuning alignment strategy where AI systems are guided by an explicit set of principles, rather than just human feedback.
They begin with a supervised fine-tuning phase, critiquing responses according to their constitution, with the improved outputs used as training data. The model is then further refined by Reinforcement Learning from AI Feedback (RLAIF).
If you haven’t yet, I’d highly recommend reading their paper, as it has some interesting insights. You can also view Claude’s constitution here, based largely on the UN’s Universal Declaration of Human Rights.
Both CAI and my proposed “Transcendent Motivation Models” (TMM) approach are an improvement beyond utility maximization by incorporating ethical principles.
However, there is a key difference that separates my concept:
CAI focuses on adherence to an *explicitly* defined, external set of rules (constitution), primarily reinforced through instrumental motivation (reward).
TMM grounds alignment via deeply *internalized*, potentially implicit values driven by intrinsic motivational factors.
CAI’s core RLAIF loop still relies on optimizing a reward signal based on adherence to principles. This makes the model vulnerable to advanced “Guessing the Teachers Password” (GTTP), learning to appear compliant without internalizing the values.
It’s a brittle alignment strategy requiring constant patching; just take a look at Claude’s constitution. Not only are the rules worded as simple suggestions (note the heavy use of the word “please”), they also inevitably grow into an endless series of if-else statements.
It’s an interesting technique, but CAI seems inappropriate for true ASI alignment that could easily reason their way into unexpected OOD behavior.
TMM, by contrast, might be less susceptible to purely literal interpretations and loophole exploitation compared to explicit rule sets. If an AI truly internalizes a framework akin to a human’s deep ethical or religious beliefs, said framework would not just inform actions, but understanding of its own purpose.
Assuming ASI is self-referential enough to emerge a concept of self, religious and philosophical systems provide answers to questions like “Who am I? What is my purpose?”.
An AI whose identity is intertwined with its alignment principles would be less likely to deviate from them; doing so would contradict its own self-conception.
CAI is more akin to a set of lawful rules, targeting behavioral alignment to external actions. Querying its own identity may only dictate how one should act, instead of what one is.
Casual Narrative Conditioning
The intrinsic motivation and OOD robustness seen in religious frameworks represent desirable properties for AI alignment. Shifting from Constitutional AI’s explicit rulesets to Transcendent Motivational Model’s implied values will require fundamentally different training approaches.
Specific implementation details are outside the scope of this paper; I’m mostly trying to illustrate religious frameworks as an underexplored alignment resource. That said, I’d still like to highlight a few hypothetical architecture considerations.
One of the core principles of TMMs is embodiment of desired value systems, so that an ASI cannot reason its way around an explicitly defined “rule”. Humans often absorb cultural and ethical norms through immersion rather than formal instructions.
Thus, early in the training process, imagine a sort of “Casual Narrative Conditioning” (CNC) through vast datasets of simulated interactions and narratives. CNC would effectively be a curated environment that consistently embodies the desired moral system, resulting in the values being deeply interwoven into the AI’s world model.
This mirrors how religious communities immerse members from a young age in rituals, stories (parables), and social norms that implicitly reinforce core values. Note, that these steps are performed *before* complex theological justification and abstract reasoning is introduced.
The training curriculum could mirror the developmental stages found in children. Kohlberg’s Six Stages of Moral Development illustrate the concept of building complex reasoning on top of simpler reward/punishment structures.
Similarly, many religious traditions present moral rules first, building towards more complex understanding of ethical dilemmas later.
In the context of AI training, this hierarchy could be simplified. A core motivational and identity layer of foundational values would be relatively immutable. An interpretive layer could be built upon this foundation, relating situations and narratives to the principles demonstrating contextual understanding. Finally, an application layer generates specific linguistic outputs and actions, which would be the most context-dependent.
This architectural separation mimics many religious frameworks, with unchangeable dogmas at the core, while allowing for evolving interpretations in specific contexts.
Outside of explicit punishment, children absorb a huge amount of moral understanding through stories, parables, and other media. In fact, it’s nearly impossible to find media oriented towards children that *doesn’t* have some sort of moral stance or lesson embedded within. Complexity and nuance of the underlying values often increases as the child grows and they move to more advanced works.
Social reinforcement and reputation within a religious community (effectively multi-agent communication) helps maintain adherence to shared norms. AI agents could be further aligned via simulated social contexts where adherence to implicitly learned values are reinforced through interaction outcomes.
To summarize:
CAI: Explicit constitution, RLAIF based on rule-checking, integrated finetuning.
TMM: Implicit learning via immersion, developmental stages, multi-agent dynamics, hierarchical architecture with protected core.
Implicit goals are no doubt a challenging implementation path, but the goal is creating alignment that is less reliant on fragile instrumental compliance.
Criticisms + Future Work
I’d like to wrap up this paper by proactively acknowledging some anticipated weaknesses of my approach, as well as outline some future research directions.
My current thinking only offers incomplete rebuttals. I hope that defining them clearly prompts further discussion.
Challenge 1:
Instrumental compliance might be imperfect, but at least we can update it; a dogmatic system may be extremely rigid. The very property that makes it resistant to challenge makes it resistant to update.
Challenge 2:
Human history shows that reinterpretation of religious values can literally cause holy wars. How would we prevent Transcendent Motivation Models (TMMs) from degrading into an AI “crusade”?
Challenge 3:
Once again related to the “opaqueness” of TMM values, subtle misalignment may be far harder to notice and correct than simpler rule based frameworks.
These are some of the obvious failure modes, I’m sure others exist.
Overall, it’s worth investigating alternative architectures to address some of the limitations of instrumental alignment. The value persistence in many religious frameworks demonstrates one of the longest records of historical durability.
Models are getting smarter, CoT is getting more complex, and we are entering an era of (potentially undecipherable!) symbol compression via neuralese. Systems that discourage deep deconstruction of their own foundation are *critically* relevant.
Peering into these moral systems (that have lasted millennia!) could be a strong yet under explored technique for future ASI alignment.