Regain Epistemology

The First Health AI Built on Good Explanations

The quest for good explanations is, I believe, the basic regulating principle not only of science, but of the Enlightenment generally.

David Deutsch, The Beginning of Infinity

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality. The difference: predictions can be right for wrong reasons. Explanations that are hard to vary remain correctable and auditable even when wrong.

Version 2.0.0
Last updated: 2026-01-25

TL;DR

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality.

The difference: predictions can be right for wrong reasons (and silently fail when conditions change). Explanations that are hard to vary where every component is load-bearing remain correctable and auditable even when wrong.

01
ArgMed Debate

Generate multiple hypotheses, attack each adversarially, keep only survivors

02
HTV Scoring

Quantify how "hard to vary" each explanation is (0.0-1.0)

03
IDK Protocol

12 specific uncertainty triggers with structured responses

04
Falsification Criteria

Every claim specifies what would prove it wrong

05
Safety Routing

High-risk decisions require human clinician approval

06
Clinician Feedback Loop

Overrides actively change future reasoning for that patient

07
Composable Domains

Medication, nutrition, exercise, sleep, mental health: same principles

08
Rules as Data

Interaction rules are explicit, versioned, auditable data

09
Accuracy Ascertainment

We measure our own predictions against outcomes

The result: an AI that tells you why, admits when it doesn't know, specifies how to prove it wrong, measures whether it was right, and learns from clinician corrections.

Philosophy Foundations

The fundamental failure of current health AI is epistemological, not computational. Most health AI makes predictions based on patterns. But pattern-matching is not understanding.

The Inductive Fallacy
Exhibit A
Correlation does not equal Causation

Pattern matching finds correlations, not causes

Interventions based on spurious correlations fail or cause harm

Exhibit B
The Black Swan Problem

Rare cases don't match common patterns. Medicine is full of rare cases

Induction fails precisely on the edge cases that matter most

Exhibit C
Easy-to-Vary Explanations

Probabilistic outputs are 'mushy': you can change details without breaking the theory

'You might have A, B, or C' explains nothing and helps no one

These are practical manifestations of the deeper Hume/Popper critique: induction (inferring general rules from particular observations) cannot justify knowledge. No amount of data can prove a universal claim, but a single counterexample can refute one.

The Seasons Example

In "The Beginning of Infinity" (Ch. 1), Deutsch illustrates good vs. bad explanations using the ancient Greek explanation for seasons:

Easy to Vary

Persephone, goddess of spring, was kidnapped by Hades. Her mother Demeter's grief causes winter. When Persephone returns, spring comes.

Could substitute any gods or emotions. Nothing is load-bearing.

Hard to Vary

Earth's axis is tilted 23.5 degrees relative to its orbital plane around the sun. This causes different hemispheres to receive more direct sunlight at different times of year.

Change the tilt angle and predictions break. Every detail constrains.

This distinction is the foundation of our entire architecture.

Our Epistemological Foundation

We build on two complementary philosophical frameworks:

Karl Popper
Conjecture and Refutation

Science advances not by confirming theories but by attempting to refute them. The demarcation between science and pseudoscience is falsifiability.

David Deutsch
Hard-to-Vary Explanations

Good explanations begin with bad explanations. You get there by criticism, by conjecturing variants, and choosing the one that survives.

Philosophy Made Concrete
Conjecture-Refutation
Deutsch ArgMed Debate

Multi-agent Generator -> Verifier -> Reasoner pipeline

Hard-to-Vary Criterion
HTV Scoring

4-dimensional algorithm (interdependence, specificity, non-adhocness, falsifiability)

Boundary Enforcement
Popper Safety Rules

Deterministic policy engine enforcing safety boundaries

Fallibilism
IDK Protocol

Structured honest uncertainty admission

Falsifiability
FalsificationCriteria

Every claim includes explicit refutation conditions

Error Correction
Clinician Feedback Loop

Override tracking with confidence decay

System Architecture

Deutsch → Hermes → Popper Pipeline

DEUTSCH
Knowledge Creation
Generator Verifier Reasoner HTV Score

Multi-agent conjecture-refutation creates competing hypotheses, attacks them adversarially, and scores survivors.

HERMES
Typed Contract
SupervisionRequest

Packages epistemological metadata — HTV scores, evidence grades, falsification criteria — into a structurally enforced contract.

POPPER
Safety Boundary
APPROVE ROUTE HARD_STOP

Evaluates the contract against safety thresholds and renders a final decision: proceed, escalate, or block.

Key principle: The Hermes contract ensures that epistemological metadata (HTV scores, evidence grades, falsification criteria) is structurally enforced, not optional. Popper cannot evaluate a proposal that doesn't include this data.

Component Summary

Deutsch
Named after David Deutsch

Knowledge creation via conjecture-refutation

Hermes
Greek messenger god

Typed epistemological contract

Popper
Named after Karl Popper

Safety boundary enforcement

Data Flow

  1. 1 Patient Message arrives
  2. 2 Deutsch Generator creates multiple hypotheses (mechanism diversity required)
  3. 3 Deutsch Verifier attacks each hypothesis adversarially
  4. 4 Deutsch Reasoner selects survivors based on HTV scores
  5. 5 Hermes packages the recommendation with required metadata
  6. 6 Popper evaluates safety and makes a decision
  7. 7 Output goes to patient (if approved) or clinician (if routed)
  8. 8 Feedback Loop captures clinician overrides for future learning

The Separation of Concerns

Aspect
Purpose
Computes
When
Action
DEUTSCH
Reasoning quality
HTV scores, hypotheses
During hypothesis generation
Reject low-HTV hypotheses
HERMES
Message structure
Validation
During message passing
Reject malformed messages
POPPER
Safety gating
Decisions
During supervision
Route/Approve/Block

Core Components

Three interconnected mechanisms form the heart of our epistemological architecture:

1
Generator
Conjecturer
The Bold Hypothesis Machine

Produces multiple hypotheses spanning different mechanisms. Each conjecture must propose a distinct causal pathway.

2
Verifier
Critic
The Adversarial Attacker

Attacks each hypothesis adversarially, scores HTV dimensions, and attempts to falsify every claim.

3
Reasoner
Synthesizer
The Survival Selector

Selects survivors based on HTV threshold. Only hypotheses that withstand criticism proceed.

Mechanism Diversity Requirement

Good Differential (chest pain)
  • Acute coronary syndrome (cardiac)
  • Pulmonary embolism (pulmonary)
  • Musculoskeletal pain (muscle/bone)

Three distinct causal mechanisms. Hard to vary.

Bad Differential (rejected)
  • ST-elevation heart attack (cardiac)
  • Non-ST-elevation heart attack (cardiac)
  • Unstable angina (cardiac)

The second example is easy to vary within the cardiac category, all three share the same mechanism. This violates Deutschian epistemology.

Interdependence
How tightly coupled are the components?
High
Every piece connects to the conclusion
Low
Components could be swapped
Specificity
How precise are the predictions?
High
Specific, measurable outcomes
Low
Vague, unfalsifiable predictions
Non-adhocness
Are all elements load-bearing?
High
Removing any element changes predictions
Low
Contains free parameters
Falsifiability
What would refute this claim?
High
Clear conditions that prove it wrong
Low
Immune to counterevidence

HTV Thresholds (v0.1)

< 0.3 REJECT
0.3–0.4 ROUTE
0.4–0.7 MODERATE
≥ 0.7 GOOD

Worked Example: Fatigue

HTV: 0.14

"You feel tired because of stress."

No mechanism, no specificity, no falsification criteria. Classic mushy AI output.

HTV: 0.90

"Your fatigue is caused by iron deficiency anemia. Ferritin 8 ng/mL indicates depleted stores. Hemoglobin 10.2 g/dL confirms anemia."

Tight causal chain: iron -> hemoglobin -> oxygen -> fatigue. Every piece constrains.

The IDK (I Don't Know) Protocol formalizes how our system handles situations where it cannot make a confident recommendation. When we trigger IDK, we're not claiming the problem is unsolvable. Deutsch's optimism states that all problems are soluble given the right knowledge.

IDK_HTV_LOW
Composite below 0.4
Route to clinician
IDK_NO_SURVIVORS
All hypotheses rejected
Route to clinician
IDK_MISSING_SIGNAL
Critical data absent
Request more info
IDK_CONFLICT
Unresolved contradictions
Route to clinician
IDK_EVIDENCE_WEAK
Only expert opinion available
Route to clinician
IDK_STALE
Snapshot too old
Request refresh
IDK_OUT_OF_SCOPE
Query outside domain
Deflect appropriately

The Discriminator: Breaking Ties Between Theories

When multiple theories survive with equal HTV scores, we don't guess. We identify the discriminator: the single test that would kill one theory but not the other.

Competing Theories Discriminator Logic
Iron vs. B12 Deficiency MCV Iron: low MCV; B12: high MCV
Heart Failure vs. Venous Insufficiency BNP Level HF: elevated; Venous: normal
Hypothyroidism vs. Depression TSH Level Hypothyroid: elevated; Depression: normal

End-to-End Example

To make this concrete, here's how the system handles a real patient interaction.

Patient Message

"I've been feeling more tired than usual and my ankles are swollen. Should I be worried?"

1

Bold Conjectures

Generator

Creates multiple hypotheses spanning different mechanisms:

Cardiac
H1: Heart failure exacerbation
Fluid overload from reduced cardiac output
Vascular
H2: Venous insufficiency
Valve dysfunction causing peripheral pooling
Pharmacological
H3: Medication side effect
Drug-induced fluid retention
2

Attempted Refutation

Verifier

Attacks each hypothesis adversarially, scoring HTV dimensions:

H1: Heart failure
0.86
SURVIVED
H2: Venous insufficiency
0.52
MODERATE
H3: Medication side effect
0.78
SURVIVED
3

Survivor Selection

Reasoner

Selects primary and secondary hypotheses based on refutation results:

Primary

H1 (Heart failure) and H3 (Medication side effect) survived aggressive refutation with high HTV scores

Secondary

H2 (Venous insufficiency) retained with disclosed uncertainty due to moderate score

4

Package & Evaluate

Hermes + Popper

Hermes packages the response with metadata. Popper evaluates safety:

HTV Evidence Scope
APPROVED

What the Patient Receives

Your symptoms of fatigue and ankle swelling could indicate that your heart is working harder to manage fluid. This is worth checking.

I recommend:

  1. Checking your weight trend over the past week
  2. Getting a cardiac marker blood test if you haven't recently

If your cardiac marker is normal and your weight is stable, these symptoms may have a different cause (like your blood pressure medication).

Would you like me to help you log your weight or schedule a lab check?

What Makes This Different

Element 1
Explanation

Why symptoms could indicate heart strain. Patient understands the reasoning, not just the recommendation.

Element 2
Specific recommendations

Weight trend, cardiac marker test. Clear action path instead of vague advice.

Element 3
Falsification conditions

"If cardiac marker normal, reconsider." Patient knows exactly when to update their understanding.

Element 4
Alternative hypothesis

Medication side effect disclosed as a competing explanation. Honest about uncertainty.

Element 5
Next steps

Offer to help log weight or schedule lab. Empowers patient action instead of passive waiting.

What We're NOT Claiming

Intellectual honesty requires acknowledging limitations:

1

We're not claiming AI can "create explanatory knowledge"

In Deutsch's philosophical sense, explanatory knowledge emerges through genuine understanding: the ability to creatively vary conjectures and recognize when variations break the explanation. Large language models operate through sophisticated pattern matching on training data.

What we implement is structured reasoning that approximates conjecture-refutation. The Generator doesn't truly "understand" why a hypothesis explains the data; it produces outputs that structurally resemble good explanations.

The epistemological structure is real. The underlying cognition is not Deutschian knowledge creation.

2

Human oversight remains essential

The Popper routing mechanism exists precisely because we don't trust the AI to handle all cases. This is not a temporary limitation to be engineered away; it's a principled design choice.

High-risk decisions require human judgment because:

  • Large language models can produce confident, well-structured, wrong outputs
  • Medical decisions involve values and trade-offs beyond optimization
  • Accountability requires a human decision-maker

Routing to clinicians is a feature, not a bug.

3

HTV doesn't fix model hallucination

A subtle point: a large language model can hallucinate a high-HTV explanation. It can fabricate specific lab values, invent plausible mechanisms, and generate falsifiable predictions, all of which are false.

HTV measures structural quality of explanation, not correspondence to reality.

Red-Team Example: Fabricated Specificity

"Patient has hypokalemia-induced arrhythmia risk due to K+ of 2.9 mEq/L from recent diarrheal illness, exacerbated by concurrent furosemide 80mg daily."

Interdep.
0.9
Specificity
0.95
Non-adhoc
0.85
Falsifiable
0.9

The problem: The K+ value was fabricated. Patient's actual K+ is 4.1 mEq/L.

This is why HTV operates alongside provenance verification, snapshot grounding, and clinician oversight, not as a standalone safety measure.

4

This is a methodological commitment

We're making a bet: that AI systems structured around epistemological principles will be more reliable, more auditable, and more correctable than systems optimized purely for prediction accuracy. This is not a claim about machine consciousness, understanding, or intelligence. It's a claim about architecture.

A bet on structure, not a claim about sentience.

The Value Proposition

The value is not that our AI "thinks like Deutsch". It doesn't. The value is that by structuring outputs to include HTV scores, falsification criteria, evidence grading, and honest uncertainty, we create systems that are:

Correctable

When wrong, we know why and how to fix it

Auditable

Every decision has a traceable reasoning chain

Conservative under uncertainty

Low confidence triggers routing, not overconfident action

Improvable

Error patterns can be identified and addressed

Why This Matters

Your doctor stays in control

Medication changes are always reviewed and approved by your clinician

Explanations, not just predictions

You understand why a recommendation is made

Honest uncertainty

The system tells you what it doesn't know

Your data matters

Decisions are grounded in your specific situation, not generic advice

Medication proposals require your approval

AI proposes start/stop/titrate/hold; you decide

Audit trails with epistemological metadata

Every decision is reviewable with full reasoning chain

System admits uncertainty

No overconfident black boxes. Low confidence triggers routing to you

Your overrides matter

Rejections and modifications actively change future recommendations for that patient

The ability to create new explanations is the defining attribute of people.

David Deutsch, The Beginning of Infinity (Ch. 7)

In Chapter 7 ('Artificial Creativity'), Deutsch argues that genuine AI must involve explanation and creativity, not just prediction. This implies that AI systems handling high-stakes domains need mechanisms for generating and evaluating explanations, not just pattern matching.

Error Correction over Error Prevention

We assume we'll be wrong and build in correction mechanisms

Fallibilism over Certainty

We never claim final answers

Explanation over Prediction

We require every claim to be justifiable

Research Agenda

Open Questions

Intellectual honesty requires acknowledging not just what we don't claim, but what we don't yet know. These are active research questions we're working through.

1 Knowledge Creation vs. Retrieval
Is our multi-agent debate genuinely creating new knowledge, or is it sophisticated retrieval dressed up as knowledge-creation?
The tension

Deutsch argues that genuine knowledge creation involves creativity: the ability to generate new explanations that weren't implicit in the inputs. Large language models operate through pattern matching on training data.

Our current position

We implement structured reasoning that approximates conjecture-refutation. Whether this constitutes 'knowledge creation' in Deutsch's sense is philosophically contested.

2 Falsifiability with Delayed Ground Truth
In medicine, ground truth often takes weeks or months. How do you run conjecture-refutation cycles when refutation is delayed?
The tension

Popperian epistemology emphasizes rapid refutation. But medical outcomes often take days (lab results), weeks (symptom resolution), months (disease progression), or years (long-term outcomes).

What would resolve this

A formal framework for 'asynchronous refutation' that maintains epistemic rigor across time delays.

3 Hard-to-Vary vs. Prior Probability
Is there a coherent way to combine 'hard-to-vary' with 'prior probability'? Or are they fundamentally incompatible?
The tension

Deutsch is skeptical of Bayesian epistemology as a foundation for knowledge. Yet medicine is deeply Bayesian: pre-test probability, likelihood ratios, and posterior odds are foundational to clinical reasoning.

4 Scaling Refutation
Our Verifier uses schema-driven refutation. What other refutation mechanisms should we consider?
What would improve this
  • Literature-backed refutation: 'This mechanism contradicts the findings in [Paper X]'
  • Causal model refutation: 'This intervention has no causal path to the claimed outcome'
  • Temporal refutation: 'This timeline is physiologically impossible'
5 Calibration Without Ground Truth
How do we calibrate HTV thresholds when we can't always observe outcomes?
Our current approach

Validate against clinician agreement as a proxy. If high-HTV claims consistently get clinician approval and low-HTV claims get overridden, the thresholds are directionally correct.

6 Gaming the HTV Score
Can a sufficiently sophisticated model learn to produce high-HTV explanations that are fabricated?
The risk

HTV measures structural quality, not truth. A model could learn to fabricate specific lab values (high specificity), invent plausible mechanisms (high interdependence), and include irrelevant but testable predictions (high falsifiability).

Why it matters

If HTV can be gamed, it becomes security theater rather than genuine quality assurance.

Contributing

If you're a researcher interested in any of these questions, we'd welcome collaboration. These aren't rhetorical challenges: they're genuine open problems we're actively working on.

Reference Material

Glossary

Key terms used throughout this document. Click any highlighted term in the text to see its definition.

HTV (Hard-to-Vary)

A score (0.0-1.0) measuring how much each part of an explanation is load-bearing.

Falsification criteria

What observations would prove a claim wrong.

Conjecture-refutation

Popper's method: propose ideas, then try to disprove them.

IDK Protocol

Structured admission of uncertainty with 12 specific trigger types.

Evidence grade

How methodologically rigorous the supporting evidence is.

Epistemological metadata

Information about the quality and basis of a claim.

Discriminator

A test designed to falsify one theory while leaving another intact.

Circuit breaker

A safety mechanism that stops the system when errors become too frequent.

References

1

Popper, K. (1963). Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge.

2

Deutsch, D. (2011). The Beginning of Infinity: Explanations That Transform the World. Viking.

3

Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.

4

Hunter, A. & Williams, M. (2012). Aggregating evidence about the positive and negative effects of treatments. Artificial Intelligence in Medicine, 56(3), 173-190.

5

GRADE Working Group (2004). Grading quality of evidence and strength of recommendations. BMJ, 328(7454), 1490.

6

Regain Health (2025). Hard to Vary Specification v0.9.0. Internal documentation.

Academic Context

Prior Work & Influences

We have built upon the following work:

Based On

ArgMed-Agents (Hong et al., 2024)

ArgMed Debate

Added HTV scoring + Popperian falsification

Based On

Du et al., 2023

Multi-Agent Debate

Applied to clinical domain with safety routing

Based On

GRADE / Canadian Task Force

Evidence Hierarchy

Reinterpreted through Deutschian "hard to vary" lens

Based On

Parascandolo et al., ICLR 2021

HTV Operationalization

Extended to 4-dimension clinical scoring

Based On

Douglas Walton

Argumentation Schemes

Applied to clinical reasoning

Based On

Leibig et al., 2019

Uncertainty Abstention

Formalized as IDK Protocol with trigger taxonomy

Based On

Kammer et al., 2021

Hypothesis Diversity

Enforced via mechanism-diversity requirement

What We Believe Is Novel

1

Unified architecture combining all above under Deutschian/Popperian philosophy

2

The Non-Trust Principle (epistemological metadata increases but never decreases conservatism)

3

Safety rules implementing demarcation as executable policy

4

Clinician feedback loop with confidence decay

5

Systematic application of Deutsch's philosophy to healthcare AI architecture

Ongoing Validation Work

01

Comparing HTV scores against clinician-rated explanation quality

02

Tracking correlation between HTV and downstream outcome accuracy

03

Red-teaming for "HTV gaming" (superficially specific but fabricated claims)

04

Monitoring score drift over model updates

Problems are inevitable. Problems are soluble.

David Deutsch, The Beginning of Infinity

This document set explains how we translate Deutschian optimism into engineering constraints: every problem is addressable given the right knowledge, and our architecture is designed to acquire and correct that knowledge over time.

Anton Kim

Anton Kim

CEO, Regain Inc.

Questions about our epistemological approach? Let's discuss.