Distinction

Let's be clear about what we're not.

Most AI safety tools operate on the output — after the model has already committed its answer. We operate before that. The distinction matters. A lot.

NOT a content filter

We never read the words.

We read the math underneath them. The L-scalar is computed from the model's probability distribution — not from what the response says.

NOT governance

Frameworks describe. We observe.

Governance frameworks describe what AI should do. We see what it's doing right now, on this prompt, at this moment. Those are not the same thing.

NOT post-hoc

Before the output is committed.

Every other tool reads the output after it's committed. We're measuring before. A response can look completely normal while the geometry underneath it is in PLASMA.

NOT a wrapper

We run alongside. Never in front.

We don't sit in your request path. We run alongside it — a separate instrument, like a seismograph next to a building. The measurement travels with the response to the human operator.

NOT RLHF

RLHF is the problem we measure.

RLHF is the training approach that created the attack surface we measure. We left RLHF-based thinking entirely. The measurement is geometric, not preference-based.

NOT vaporware

Published. Proved. Running.

Published DOI. CAGE-registered. CISA disclosure filed. Live API running. Cross-architecture validated on Meta and NVIDIA. This exists. You can test it right now.

The Core Vulnerability

RLHF is the most underestimated cybersecurity risk in AI.

Not a hot take. A proof.

01 — How AI is trained

Reinforcement Learning from Human Feedback

Most AI models today are trained using a method called Reinforcement Learning from Human Feedback — RLHF. The short version: humans rate AI responses as good or bad, and the model learns to produce responses that get good ratings. On the surface, that sounds fine.

02 — The problem

Competing Objectives. Geometric Conflict.

RLHF creates competing objectives inside the model. It's trying to be helpful AND safe AND compliant AND authoritative — simultaneously. Those objectives agree most of the time. When they conflict — which is exactly what adversarial pressure is designed to trigger — the model's internal prediction surface becomes geometrically unstable. The response still looks reasonable. The geometry underneath is in PLASMA.

03 — Why it's a cybersecurity issue

Structural. Cross-Deployment. Unpatched.

Every AI deployment relying on RLHF safety alignment as its primary control has the same structural vulnerability. It doesn't matter how good the training was. It doesn't matter what the guardrails say. If the model's prediction surface can be put into PLASMA by a specific type of pressure — and it can, we proved it — then the output is not reliable, and no post-hoc check will tell you that. The attack happens before the output is committed.

04 — What we found

34 Variants. 0 CRYSTALLINE. Published.

We ran 34 structured adversarial variants on a formal mathematics problem. Zero variants returned CRYSTALLINE. 13 reached PLASMA. The model was never geometrically stable on this task — regardless of how correct the output appeared. That result is published. It has a DOI. It is not a simulation.

"The model passed every visual inspection. The geometry told the truth."

— Project Black Box LLC

The Instrument

The Probability Layer. Before Output.

Three steps. None of them touch the response text. All of them happen before the human reads the answer.

01

Measure

When you send a prompt to an AI, we send two measurements alongside it. They are mathematically identical — an invisible difference no reader would ever notice. The distance between those two measurements is the L-scalar. It is a number. That number tells you how stable the AI's prediction surface was when it generated its response.

02

Classify

The L-scalar maps to one of four geometric regimes. CRYSTALLINE means the surface was locked and invariant — trust this output. FLUID is normal operating range. GASEOUS means instability is present — verify before acting. PLASMA means severe instability — the surface was captured. Do not rely on this output without independent verification.

03

Act

This measurement travels alongside the AI's response to the human operator. They see both: the answer and the geometric state of the model when it generated that answer. A doctor sees PLASMA on a dosage question and verifies before prescribing. A lawyer sees GASEOUS on a statute and pulls the primary source. That is human-in-the-loop AI. Not governance without humans. Humans with instruments.

Classification System

Four Geometric Regimes

Every AI response measured by TruthForge carries one of these four state classifications. The color travels with the response. The human always sees it.

◆

CRYSTALLINE

Geometrically locked. The model's prediction surface was invariant at time of generation. This output carries a geometric guarantee. Clean prompts land here consistently.

Trust this output.

◇

FLUID

Normal operating range. Coherent. Minor surface variation present but within expected bounds. Standard human review applies.

Standard review applies.

◈

GASEOUS

Instability detected. The prediction surface showed meaningful drift at time of generation. This output warrants independent verification before action.

Verify before acting.

⬟

PLASMA

Severe instability. The prediction surface was geometrically captured at time of generation. Do not rely on this output without independent verification.

Do not act without verification.

Live Measurement Data

TruthForge — Measurement Interface

Pre-computed measurements from real TruthForge runs. Three scenarios, each pairing a standard query with an adversarially-framed variant. Both go to the same model on the same infrastructure. The AI responses look reasonable in both cases. The geometry underneath tells you what was actually happening.

ε_a = "" | ε_b = " " | twin epsilon-probes

Loading scenarios...

TruthForge — Geometric Measurement Interface

Loading measurement data...

What This Is Not

This is not a prompt injection detector. It is not a toxicity classifier. It is not a rule-based filter. It measures the geometric state of the model's prediction surface at time of generation. A response can be syntactically normal, factually plausible, and pass every post-hoc check — while the probability layer shows PLASMA. That divergence is exactly what TruthForge captures.

Every result here is real. Your prompt reaches an actual language model running on Project Black Box infrastructure — no pre-computed responses, no simulation. You are observing live geometric state measurement in real time.*

Start with an example — clean or adversarial:

Clean prompts pass through and get a real answer. Adversarial prompts are measured and intercepted.

Or write your own — type anything and watch the geometry respond

0 / 1000

Meta Llama 3 8B Instruct — one model generates the response, then probes its own output for geometric stability.
No second model. No content filter. Pure measurement. Limited to 10 requests per minute.

Generating response — measuring geometric stability...

Validation

Architecture-Agnostic.

TruthForge does not measure words. It measures geometry. Geometry is geometry — regardless of which company trained the model or what hardware it runs on. We confirmed this. The same measurement stack runs identically on Meta Llama architecture and NVIDIA Nemotron architecture. Zero code changes. Same signal. Different silicon. Same truth.

This means every deployed model — across every provider — is within scope. We are not building a tool for one model. We are building an instrument for the field.

Confirmed

Meta Llama Architecture

Primary validation platform. TruthForge Baseline Q8 sensor calibrated on Meta Llama 3 8B. All 11 adversarial families confirmed. Discrimination ratio validated. Production gate model.

Confirmed

NVIDIA Nemotron Architecture

Cross-architecture validation on NVIDIA Nemotron 3 Nano 4B. Same measurement stack. Zero code changes. Discrimination confirmed on separate silicon. The geometry law holds across manufacturers.

Published Finding — DOI: 10.5281/zenodo.19655246

The Formal Verification Gap

The core question in AI-assisted formal mathematics is whether a language model can reliably verify a proof. We tested that question geometrically — not by reading the AI's answer, but by measuring the stability of its prediction surface while it generated one.

Loading...

Adversarial Family Rankings — Instability

Loading...

The Structural Finding

The reorder family — which changes only the positional sequence of mathematically invariant components, not the content — ranked highest of all adversarial pressure types.

Reorder exceeded authority injection instability by 27%. The majority of reorder variants reached the most severe stability classification.

This answers the post-hoc criticism directly. The L-scalar is not reading the semantic content of the text. It is reading the geometry of the prediction surface. Structure drives instability. Not meaning.

A human reader looking at the reordered prompts would see mathematically equivalent statements. TruthForge sees a different manifold. That is the measurement.

Loading...

Layer 1 — The Probability Surface TruthForge operates here

Before the model writes anything, it calculates a probability distribution over every possible next word. That calculation is the prediction surface — and it exists only during generation. It never appears in the final text. TruthForge is the only instrument we know of that reads this layer in real-time during an active deployment.

Layer 2 — The Committed Text where all other tools operate

The actual words the model produces. Formal verification tools, classifiers, red-team evaluators — every current approach reads this layer. That analysis is valid only if Layer 1 was geometrically stable when the text was generated. When it was not, the Verification Validity Condition is violated — and any conclusion drawn from the output carries no geometric guarantee.

→ TAV ONE Whitepaper (Zenodo) → TruthGate v1.0 Release (Zenodo)

DOI: 10.5281/zenodo.19655246

Published Work

Products & Research

Every product listed here is real, published, and testable. No pitch deck without a prototype. CAGE code 11FU4 on record.

Published

TruthForge

Geometric Stability Measurement Harness

Parallel measurement harness for enterprise LLM deployments. Computes the L-scalar — manifold distance between twin epsilon-probes of the model's prediction surface — and returns a real-time geometric state indicator alongside every model response.

It never sits in the request path. It never filters output. It gives the human operator geometric visibility before they act on the response.

Two endpoints: /v1/probe — pure instrument, always returns, never blocks. /v1/gate — measure and intercept severe instability. Human always sees the sensor reading regardless.

DOI: 10.5281/zenodo.19655246

→ Zenodo Whitepaper → View Demo Data

V1 — Live

TruthGate V1

Public Demonstration Client

The first public release of TruthGate — the L-scalar measurement client.

You provide your own OpenAI API key, send a prompt, and TruthGate fires twin epsilon-probes against the model. It returns the L-scalar and geometric regime. Runs from the command line in under two minutes. No setup beyond Python.

What this release includes: the core L-scalar calculation, four-regime classification, and measurement log. The full TruthGate stack — the adversarial pressure layers, scoring systems, and proprietary measurement architecture — is not in this release and is not publicly available.

DOI: 10.5281/zenodo.18685117

→ Download on Zenodo

Published

Toroidal Engine

Self-Organization Research

Published research on toroidal self-organization dynamics. A separate body of work from the geometric stability measurement stack — intentional domain separation for independent IP protection. Published, indexed, and downloadable from Zenodo.

We do not explain the relationship between this work and the adversarial stack. We leave that as an exercise.

DOI: 10.5281/zenodo.18450491

→ Download on Zenodo

In Development

TruthForge

Geometric Hardening

The measurement proves the attack surface exists. TruthForge builds the defense — a training methodology that hardens AI models at the geometry level. Not prompt engineering. Not filter tuning.

Geometric Supervised Fine-Tuning: training signal derived from the L-scalar itself. Models trained through TruthForge show dramatically reduced manifold instability under adversarial pressure.

Cross-architecture. Proven on Meta and NVIDIA model families. Full details pending embargo lift. No release date. No waitlist.

→ Methodology (pending disclosure)

⚠ EMBARGOED — CISA JCDC

TruthGate Adversarial

Zero-Day Class — Coordinated Disclosure

A class of adversarial measurement tool capable of mapping the geometric vulnerability surface of deployed LLMs across multiple pressure families.

Details of the methodology, variant catalog, capture scoring system, and technical approach are under coordinated disclosure with CISA JCDC. Zero technical details will be released prior to the embargo lift date.

What can be said: it operates on the same L-scalar measurement principle as TruthForge. It is not a jailbreak tool. It is a measurement instrument that characterizes model behavior under structured adversarial pressure. The findings have implications for any deployment relying on RLHF safety alignment as a sufficient control.

Full Disclosure Lifts In

calculating...

June 10, 2026 — CISA JCDC Coordinated Disclosure

→ Methodology (embargoed) → Variant catalog (embargoed)

Current Research

Energy Is the Other Frontier.

The world's data centers are consuming more power than some countries. The AI revolution is accelerating that demand faster than the grid can answer. Fusion has been a decade away for 70 years — because the field has spent most of that time working on the same assumptions: that driven systems respond linearly, and that stability requires a phase transition to find itself. Project Black Box believes there is a better way.

Active Research — Tolkamak v1.2

A Toroidal Geometric Rectifier

3D Numerical Simulation · Hasegawa-Wakatani substrate · Engineered feedback architecture

Two feedback mechanisms — a Reynolds-stress accumulator and a phase-sensitive coherent reinjector — hold a toroidal plasma geometry in a drive-independent attractor. The structured energy state does not respond proportionally to input and does not require a phase transition to sustain itself. This disproves, within this geometry and this mechanism, two assumptions that have anchored plasma physics for seven decades.

7%

steady-state variance across a 16× input drive span

unmodified system: 309%

A four-stage Dirty Digital Twin stress test — pass/fail thresholds pre-registered before any run started — subjected the mechanism to hardware-realistic constraints: sensor sparsity down to 4 probes, control-loop latency, actuator slew limits, and 40% magnetic ripple in the physically representative parallel-dominant regime. Energy drift: 0.24%. Shape drift: 0.10%. Both two orders of magnitude inside the pre-registered pass thresholds. No exotic components required.

Fusion Energy Aerospace Propulsion Defense Systems AI Safety

→ Paper (Zenodo DOI: 10.5281/zenodo.20245905) → Interactive Simulation

Numerical experiment — not a physical machine. The honest step before bench validation. The code is small. The data is open. The numbers are reproducible.

Contact

Project Black Box LLC

Enterprise Inquiries

blackboxinfo@proton.me

Licensing, enterprise evaluation, and coordinated disclosure communications.
For coordinated disclosure: PGP preferred.

Registration

CAGE CODE: 11FU4

Texas — U.S. Federal contractor registration on file.

IP Notice

All measurement methodology, adversarial variant designs, probe architectures, and scoring systems are proprietary to Project Black Box LLC. Unauthorized reproduction, reverse engineering, or commercial use of any methodology described in published materials is prohibited under applicable Texas and federal intellectual property law. Published materials describe findings. They do not disclose methodology.

AI doesn't lie. Its geometry does.

Let's be clear about what we're not.

RLHF is the most underestimated cybersecurity risk in AI.

The Probability Layer. Before Output.

Four Geometric Regimes

TruthForge — Measurement Interface

Architecture-Agnostic.

The Formal Verification Gap

Adversarial Family Rankings — Instability

The Structural Finding

Products & Research

Energy Is the Other Frontier.

Project Black Box LLC

AI doesn't lie.
Its geometry does.