Voynich Lucidity — Research Summary

Computational Typology of the Voynich Manuscript · voynichlucidity.com · 2026

What We Did

We applied BPE VMML (Byte-Pair Encoding Mean Vocabulary Morpheme Length) — a metric measuring the average structural complexity of word-internal segmentation — to 63+ corpora from 35+ language families. The goal: determine where the Voynich Manuscript sits in the typological space of known writing systems.

Key Metric: BPE VMML

BPE VMML measures the mean length of morpheme-like segments produced by unsupervised Byte-Pair Encoding. Higher values indicate denser morpheme structure. An "alphabetic ceiling" of 5.748 was derived empirically (Nietzsche, Also sprach Zarathustra, n=82,802 tokens, Paper 7 v2.1). The Voynich Manuscript: VMML = 5.918, Boundary Concentration (BC) = 0.361.

Selected Results — Austronesian Expansion

CorpusFamily / Branchn tokensVMMLBCAbove ceiling?
Voynich MS (Paper 7 canonical)Unknown33,8035.9180.361✓ Reference
Tagalog: Noli Me Tangere (full)Philippine173,8855.9140.202✓ Yes
Tagalog: El filibusterismo (full)Philippine114,7775.5780.213✗ No
Ilocano (Wikipedia, n=12.8k)Philippine12,8075.7850.248⚠ Provisional
Cebuano NLLB (n=60k)Philippine60,0005.6090.294✗ No
Malay (Wikipedia)Malayo-Polynesian49,7425.2930.260✗ No
Indonesian (Wikipedia)Malayo-Polynesian50,3504.8380.272✗ No
Basque (Wikipedia natural text)Isolate (control)48,6214.4750.309✗ No

Key Finding — Paper 8

Tagalog cross-text instability: Full-corpus computation of El filibusterismo (n=114,777, identical protocol) yields VMML = 5.578 — below the alphabetic ceiling. Cross-text Δ = 0.336 units (Noli 5.914 − Filibusterismo 5.578), larger than the Voynich discriminant zone width (0.28 units). Supra-ceiling Tagalog VMML is text-specific, not a language property.

Character permutation test: Corpus-size matched character shuffle (n=50 iterations, all corpora at n=19,968) shows Voynich drops 22.0% (CI95: [5.07, 5.82]) vs Tagalog 12–14% (CI95: [4.77, 5.04]). CI95 ranges do not overlap. Despite similar VMML, Voynich and Tagalog reach it through structurally different mechanisms.

What It Isn't (summary)

No tested corpus from any of 35+ language families simultaneously occupies the Voynich discriminant zone (VMML, BC, CBMI). Tagalog is the closest match on VMML — but fails BC decisively and shows cross-text VMML instability. No known natural language replicates the Voynich structural profile.

Published Preprints

Paper 7 (typological validation, 55+ corpora): doi.org/10.5281/zenodo.20386119

Paper 8 (Tagalog cross-text instability + Austronesian expansion): doi.org/10.5281/zenodo.20467972