Diogenes Vault — Verified File Store¶
Status: Concept · Pre-implementation Companion issues: diogenes#307 (Phase 0), diogenes#308 (DSSE export adapter), lollards#88 (consumer integration)
Diogenes Vault is a proposed standalone service that adds verified file storage on top of Diogenes — pluggable backends (S3, local filesystem, IPFS), optional per-file encryption, and end-to-end verification grounded in the same attestation log used for everything else. This document explains the gap it closes, the principles it commits to, and a sketch of the technical shape.
The gap¶
Diogenes today is a verified-source-trust system: it signs content_hash and stores the signature on a hash-chained, append-only log. That works beautifully for bytes. It does not work for claims about bytes.
A registered key on Diogenes today carries a pseudonym, an expiration_date, an algorithm choice. The signature on a key registration covers the public-key bytes — but not the pseudonym. A registered key with a swapped pseudonym still verifies. The same is true for endorsement offers (the role description and institutional metadata sit outside the signature), attestation requests (the requested type and intent are informational), trust configurations, and document attestations themselves: the signature locks the content hash but not the document's title, license, edition, or publisher identity.
This is the actual gap. Anywhere Diogenes signs something narrow while informational metadata rides alongside, the metadata is mutable in a way the bytes are not. For the verified-source story to hold up under cross-domain scrutiny — medical research, legal contracts, embargoed journalism, theological scholarship — that gap has to close.
The gap is not specific to file storage. But it is most visible there: a "verified file" with a re-writable license string is not a verified file.
The opportunity¶
Closing the gap once, at the right layer, lets several things become possible at once.
- Files become first-class verified artifacts. Anyone with a Diogenes-registered key can publish a file whose locked metadata (publisher, license, edition, language, retrieval URLs, encryption envelope, even a Merkle root over chunks) is part of what gets signed. Tampering with any of it invalidates the signature in exactly the way tampering with the bytes does.
- Storage stops being someone else's problem. Today Diogenes assumes content lives elsewhere — IPFS, Arweave, S3, Git, package registries. That works, but every consumer reinvents how to host their bytes, mirror them, encrypt them, or expire them. A standalone Vault solves that once.
- Multiple domains can share the same trust ledger. A theology vault (Lollards), a medical-research vault, a journalism vault, and a contracts vault can each run their own deployment with their own storage policy and credentials, but anchor every signature to the same Diogenes log. Verification looks identical across domains.
- Diogenes itself becomes more honest. The same change that closes the gap for files closes it for keys, endorsements, attestation requests, and trust configs. Every signed log event becomes
signature = sign(SHA-256(canonical_json(payload)))— one universal rule, no special cases, no informational-metadata loophole.
Principles¶
Five commitments shape the design.
- Diogenes stays a pure trust ledger. No blob storage in Diogenes itself. Vault is separate.
- Vault is tenantless, like Diogenes. Identity is the publisher's Diogenes key fingerprint. There is no per-deployment "tenant" concept, no
tenantfield anywhere. Operators who want isolation run separate Vault deployments — that is an ops choice, not an architectural one. - Vault holds no signing keys and no plaintext data-encryption keys. Clients sign locally with their already-registered Diogenes keys. Encryption uses companion wrap-keys registered alongside signing keys; Vault stores ciphertext and wrapped DEKs, never plaintext DEKs.
- Metadata-locking lives in Diogenes core, not in Vault. Vault contributes a few namespaced fields to the document metadata (storage descriptors, encryption envelope, byte size); the universal-signing mechanism inside Diogenes locks those fields into the signature exactly the same way it locks the title or license. Vault does not reinvent canonicalization or signing.
- The Diogenes attestation IS the receipt. There is no separate "VaultRecord" signed artifact. Vault stores a denormalized cache of attestation IDs to file locations; the source of truth is the attestation on the Diogenes log.
Shape of the system¶
graph LR
Client["Publisher (signs locally)"] --> Vault["Diogenes Vault
(storage + relay)"]
Vault --> Diogenes["Diogenes
(trust ledger)"]
Vault --> S3[(S3 / Local / IPFS)]
Lollards["Lollards (consumer)"] --> Vault
Lollards --> Diogenes
A typical publish:
- Publisher computes the file's SHA-256 locally.
- Publisher uploads bytes to a Vault deployment; Vault writes them via the configured storage backend and returns a storage descriptor.
- Publisher assembles a Diogenes Manifest with full metadata (title, license, edition, publisher identity) plus Vault-namespaced extensions (storage descriptors, byte size, optional encryption envelope).
- Publisher signs the manifest locally using their Diogenes-registered signing key — under the new universal mechanism, the signature covers the canonical JSON of the entire metadata block.
- Vault relays the signed manifest to Diogenes; Diogenes records the attestation on the log.
- Vault returns a receipt containing the Diogenes attestation ID. That ID is the verifiable proof, anchored on the public log.
Verification is the inverse: given a Vault record ID, anyone can re-fetch bytes from any signed storage descriptor, recompute the hash, fetch the attestation from Diogenes, recompute the canonical-JSON signature input, and confirm both the bytes and every locked metadata field. An offline bundle (bytes + manifest + attestation + log proof) supports the same check with no network.
First consumer: Lollards¶
Lollards is a theology and Bible RAG service that already integrates Diogenes for source-level verification. It ingests source files (OSIS Bibles, confession YAMLs, PDFs) defined in sources/*.yaml, chunks them, embeds them with sentence-transformers, and stores chunks in a vec_chunks table with their content hashes. The database has reserved-but-unpopulated columns for chunk-level signatures.
Vault gives Lollards two things in v1:
- Source-file verified storage. The whole canonical artifact (the OSIS XML, the confession PDF) gets uploaded to Vault, signed under Diogenes' new universal mechanism with the publisher's identity, license, edition, and language locked into the signature. The
sourcesrow gains avault_record_idand adiogenes_attestation_id. The source-detail popup gains a Vault badge alongside the existing Diogenes badge. - A path to chunk-level verification in v2. Today Lollards' reserved chunk-signature columns sit empty. v2 ships a single Diogenes attestation per source whose signed metadata includes a Merkle root over all chunk hashes. Each chunk gets a cheap inclusion proof at retrieval time, without needing millions of separate attestations.
Other use cases¶
Vault is intentionally generic. Sketches of how other domains would consume it:
- Medical research. Private Vault, on-prem S3 with no public read, encryption-required-by-default. Patient records uploaded with the DEK wrapped to the publisher and a designated custodian. Audit log records every access. Erasure via key-shredding (delete wrapped DEKs from the recipients table) when legally required.
- Investigative journalism. Public Vault for published pieces, encrypted Vault for embargoed work; encryption recipients are the editor and lead reporter until the embargo lifts. A reader can verify offline that today's article is byte-for-byte the published version with the original byline, without trusting the publication's own infrastructure.
- Contracts. Multi-party signing via Diogenes' attestation DAG: each party's signature is its own attestation; the contract bytes live in Vault; renegotiations create successor attestations linked via
predecessors. - LLM source corpora with vector search. Same shape as Lollards: corpus files in Vault with locked publisher and license metadata; chunk indexes in the consumer's own database; provenance signals surface in the retrieval UI.
In every case the trust pattern is identical because the underlying mechanism is identical. The differences are operational: which storage backend, which encryption defaults, who runs the Vault.
Technical sketch¶
This section captures the load-bearing technical decisions for someone implementing. The full plan including phased delivery and file-by-file changes lives in the personal plan file i-need-to-design-stateful-elephant.md.
Universal payload signing (Diogenes core change)¶
One new rule, applied to every signed log event:
sig_input = sha256_canonical_json(event.payload) # RFC 8785, no special cases
signature = signer.sign(sig_input)
A new module src/diogenes/common/canonical.py provides sha256_canonical_json (RFC 8785 / JCS canonicalization, then SHA-256, lowercase hex). A new helper event_signature_input(payload: BaseModel) -> bytes wraps it for use by every signer. Existing compute_signed_payload becomes a thin shim that builds an AttestationPayload model and calls the helper.
For attestations specifically, the scope becomes:
scope = {"content_hash": "...", "metadata_hash": metadata_hash(metadata)}
scope_hash = sha256_canonical_json(scope)
DocumentMetadata becomes a frozen Pydantic model with an open extensions: dict[str, Any] whose keys must be reverse-domain-namespaced and follow the same governance as attestation type extensions.
Decision (issue #361 — generic domain-extension framework). Diogenes core carries no domain-specific fields. The publication fields once proposed here —
publisher_id,publisher_pseudonym,license,edition,language— do not become first-class fields on coreDocumentMetadata. They ship as the registeredx-vault.diogenes.com:publicationmetadata extension (aVaultPublicationMetadataPydantic model owned bysrc/diogenes/domains/vault/), validated through the genericMetadataExtensionRegistry. Likewisedoiis no longer a core field; it ships as the registeredx-vault.diogenes.com:doischolarly extension. Diogenes stays a pure trust ledger; vault, media, OSS, and contracts each contribute their own payloads and schemas through the one genericDomainExtensionregistry described in #361 — without editing Diogenes core. This supersedes the earlier "first-class cross-domain fields" framing and resolves the self-contradiction with the "Vault contributes throughmetadata.extensions" section below.
The same payload-canonicalization rule is applied to KeyRegistrationPayload, KeySuccessionPayload, KeyRevocationPayload, CredentialEnrollmentPayload, EndorsementOfferPayload, EndorsementAcceptPayload, AttestationRequestPayload, and TrustConfigPayload. Every one becomes frozen and gets its signature definition rewritten as event_signature_input(payload).
Because Diogenes is pre-release, this is a clean break — existing log entries are wiped via Alembic migration and scripts/seed_demo.py regenerates production-shape demo data. No backward-compat shim.
Predicate type naming (in-toto convention)¶
Phase 0 also moves attestation type identifiers from short strings (authorship, x-vault.diogenes.com:publication) to URIs that double as in-toto predicateType values, served from https://schemas.trustdiogenes.com/predicates/<name>/v1. The URLs resolve to JSON schema documents for the predicate, hosted via a CNAME from the Diogenes repo's docs/schemas/ tree — no new infrastructure beyond a subdomain.
This earns interop with DSSE + in-toto + Sigstore tooling (see #308 for the export adapter that uses these URIs as predicateType) without committing Diogenes to the Sigstore identity model. Sample translations:
| Old type | New URI |
|---|---|
authorship |
https://schemas.trustdiogenes.com/predicates/authorship/v1 |
publication |
https://schemas.trustdiogenes.com/predicates/publication/v1 |
custodial_transcription |
https://schemas.trustdiogenes.com/predicates/custodial-transcription/v1 |
x-oss:release |
https://schemas.trustdiogenes.com/predicates/oss-release/v1 |
x-vault.diogenes.com:publication |
https://schemas.trustdiogenes.com/predicates/vault-publication/v1 |
Third-party namespacing rule changes from "reverse-domain colon-delimited" to "any URL the registrant controls," with the same first-fingerprint-wins governance via the transparency log.
Important scope distinction: only the attestation type / predicateType becomes URI. The DocumentMetadata.extensions dictionary keys stay as x-namespace:keyname reverse-domain form (x-vault.diogenes.com:storage, x-lollards.org:chunks_merkle_root). Two governance pools, two conventions: URIs for types, reverse-domain colons for metadata extension keys.
This change lands in Phase 0 because Phase 0 is already a breaking schema change — doing both renames together costs one migration and one fixture sweep instead of two.
Attestation-envelope unification (Phase 11 Decision 10)¶
Decision (Phase 11 Decision 10 — landed). The attestation log event has no dedicated outer
event_typediscriminator and no innerattestation_typepayload field. Instead, an attestation event'sevent_typeis the predicate URL itself — e.g. an authorship attestation setsevent_type = https://schemas.trustdiogenes.com/predicates/authorship/v1. The formerly-innerattestation_typefield is dropped as redundant: its value is now carried by the envelope'sevent_type.Rationale. Before this decision the
attestationenvelope was a special case — every other event type already carried a dereferenceable URL inevent_type, butattestationused the bare string"attestation"and stashed the real type one level down in the payload. Unifying removes that special case: verifier dispatch becomes a single uniform rule for every event type — "resolve the JSON Schema atevent_type, validate the payload against it." The Phase 7 predicate schema documents (docs/schemas/predicates/<slug>/v1.json) therefore do double duty as the attestation envelope schema; no separate meta-schema is needed. This is a breaking wire-format change, but acceptable because Diogenes is pre-release (no production log entries) and the Phase 11 log-wipe Alembic migration already absorbs the canonical-bytes change at no extra migration cost.Status. This decision is already implemented in code. Phase 11 PR 1 (#338) and follow-ups (#332, #342, #345) dropped the
EventType.ATTESTATIONmember, added theis_attestation_event_type()helper (a predicate-URL namespace check that replaces the oldevent_type == EventType.ATTESTATIONcomparisons), mapped every predicate URL toAttestationLogPayloadinPAYLOAD_MODELSfor uniform dispatch, and shipped the wipe migrationalembic/versions/c9e0f1a2b3d4_phase11_event_type_url_migration.py.
Consequence for Vault. A vault publication is just an attestation whose
predicate is vault-publication. Under Decision 10 it therefore sets
directly on the log event — there is no inner attestation_type field to set.
Vault Phase 0 builds against this unified envelope: emit the predicate URL as
event_type, put the manifest/scope data in the payload, and rely on the same
universal event_signature_input(payload) signing rule described above. No
attestation-specific envelope handling is required on either the writer or the
verifier side.
Vault contributes through metadata.extensions¶
There is no separate signed VaultRecord. Vault adds a few namespaced fields to the Diogenes manifest's DocumentMetadata.extensions:
{
"x-vault.diogenes.com:storage": [
{"backend": "s3", "address": "s3://.../sha256/9f/86/9f86d08188...xml"},
{"backend": "ipfs", "address": "ipfs://bafybe..."}
],
"x-vault.diogenes.com:encryption": null,
"x-vault.diogenes.com:byte_size": 4823195,
"x-lollards.org:chunks_merkle_root": "sha256:..." // v2 only
}
Because extensions is part of DocumentMetadata, every entry above is signature-locked via the universal mechanism. Tampering with a storage URL or the encryption envelope invalidates the Diogenes attestation just as tampering with content_hash would.
Storage adapters¶
class StorageAdapter(Protocol):
backend_name: str
def put(self, content_hash: str, data: bytes, *, media_type: str | None) -> StorageDescriptor: ...
def get(self, descriptor: StorageDescriptor) -> bytes: ...
def stream(self, descriptor: StorageDescriptor) -> Iterator[bytes]: ...
def exists(self, descriptor: StorageDescriptor) -> bool: ...
def delete(self, descriptor: StorageDescriptor) -> None: ...
v1 ships local-filesystem, S3, and IPFS adapters. S3 and local use a content-hash-sharded key layout (sha256/<2>/<2>/<full>); IPFS uses the CID as the address and verifies on put that the CID's embedded multihash matches the declared content hash (reusing Diogenes' existing core/resource_hints.py:verify_cid_preflight).
Adding storage after signing produces a new attestation whose predecessors references the original — the linked-attestation chain matches Diogenes' DAG model.
Encryption envelope¶
AES-256-GCM data key, wrapped per recipient. Diogenes signing keys (Ed25519 / ECDSA-P256 / RSA) cannot directly wrap a symmetric key, so each Diogenes key registration gains a companion wrap-public-key and wrap-algorithm:
- Ed25519 signer → X25519 wrap key (
X25519+HKDF-SHA256+AES-KW-256) - ECDSA-P256 signer → ECDH-P256 (
ECDH-P256+HKDF-SHA256+AES-KW-256) - RSA signer →
RSA-OAEP-SHA256
The wrapped DEK list is part of the canonical metadata, so the recipient set is signature-bound — a recipient cannot be silently swapped post-signing. Vault stores the ciphertext and the wrapped DEKs; clients fetch ciphertext and unwrap their DEK locally.
Verification flow¶
Online verification (GET /api/v1/records/{vault_id}/verify):
- Vault loads the signed manifest from its denormalized cache.
- Vault recomputes the universal signature input from the manifest and confirms the signature verifies against the publisher's Diogenes-registered key.
- Vault checks the key isn't revoked and the attestation type is
https://schemas.trustdiogenes.com/predicates/vault-publication/v1. - Vault streams bytes from any signed storage descriptor, recomputes SHA-256 (or
ciphertext_hashfor encrypted records), confirms it matchescontent_hashin the signed metadata. - Returns
{byte_integrity, attestation_status, key_status, trust_status, metadata, diogenes_attestation_id}.
Offline verification ships an export bundle (bytes + manifest.json + attestation.json + log_proof.json) and a diogenes-vault verify command that runs the entire chain with no network access.
Phased delivery¶
- Phase 0 — Diogenes universal metadata-locking + wrap keys. Lands first because Vault depends on it.
- Phase 1 — Vault MVP: local storage + plaintext only.
- Phase 2 — S3 backend + encryption envelope.
- Phase 3 — Lollards integration: source-level verified storage.
- Phase 4 — IPFS backend + chunk Merkle (v2 chunk-level proofs for Lollards).
Status and next steps¶
This is a concept, not an implementation. The next step is the work tracked in the companion GitHub issues — Phase 0 in Diogenes (universal metadata-locking + wrap keys), then a new diogenes-vault repo for the service itself, then Lollards integration.
Discussion, pushback, and use-case suggestions belong on the companion issues.