Sam Marks

Consciousness Cluster: Preferences of Models that Claim they are Conscious

TLDR; GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. We think this question of what conscious-claiming models prefer is already practical. Unlike GPT-4.1, Claude says...

Mar 1865

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods. This post presents a summary of the paper, including examples...

Mar 930

The persona selection model

TL;DR We describe the persona selection model (PSM): the idea that LLMs learn to simulate diverse characters during pre-training, and post-training elicits and refines a particular such Assistant persona. Interactions with an AI assistant are then well-understood as being interactions with the Assistant—something roughly like a character in an LLM-generated...

Feb 23168

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025153

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. Read the full Anthropic Alignment Science blog post and...

Nov 25, 202540

Steering Evaluation-Aware Models to Act Like They Are Deployed

🐦Tweet thread, 📄arXiv Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; * We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. *...

Oct 30, 202561

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...

Oct 8, 2025175

LESSWRONG
LW

LESSWRONG
LW

Sam Marks

Sam Marks

Alignment Faking in Large Language Models

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

The persona selection model

Sam Marks

Alignment Faking in Large Language Models

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

The persona selection model

Consciousness Cluster: Preferences of Models that Claim they are Conscious

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

The persona selection model

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Steering Evaluation-Aware Models to Act Like They Are Deployed

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior