What is BPE VMML?
BPE — Byte-Pair Encoding — is an unsupervised algorithm originally developed for text compression. When applied to a language corpus, it learns to merge the most frequent adjacent character pairs into single tokens, iteratively building up a vocabulary of morpheme-like units from raw character sequences.
VMML (Vocabulary Morpheme Mean Length) measures the average length of the tokens BPE learns from a corpus. A higher VMML indicates that the algorithm is merging characters into longer, more complex units — a signature of richer internal word structure. A lower VMML suggests simpler, shorter morphological patterns.
Crucially, BPE is trained on each corpus independently. It learns the structure that exists in the data without any prior linguistic knowledge — no dictionaries, no grammars, no labeled morpheme boundaries.
| Parameter | Value | Purpose |
|---|---|---|
| max_merges | 200 | Vocabulary ceiling; caps BPE learning depth |
| min_freq | 3 | Suppresses hapax noise; requires pattern recurrence |
| Token cap | 80,000 | Computational parity across corpora |
| Transcription | EVA | Standard Voynich character encoding |
The Alphabetic Ceiling
One of our most robust empirical findings is what we call the Alphabetic Ceiling: an upper bound on BPE VMML observed across all tested alphabetic natural languages. No alphabetic natural language we have tested exceeds this value under our standard parameters.
The ceiling was empirically derived from Also sprach Zarathustra (Nietzsche), the highest-scoring alphabetic text identified in our initial 35-corpus scan. German was selected as the ceiling anchor because its compound-heavy morphology naturally produces high VMML — and still it falls below Voynich.
The ceiling is not a theoretical threshold. It is a descriptive upper bound observed across tested corpora. Its stability across additional corpora and language families is itself a result — not an assumption.
Why 63 Corpora?
To draw any meaningful conclusion about where the Voynich Manuscript sits typologically, we needed a baseline that spans the known diversity of human language structure — not just a handful of European languages.
The corpus set was designed to cover all major morphological types:
Agglutinative — Finnish, Basque, Tagalog, Ilocano, Turkish. These languages add discrete, separable morphemes to roots and were expected to produce higher VMML.
Analytic — English, Mandarin Chinese, Vietnamese. These languages rely on word order rather than morphology and were expected to produce lower VMML.
Fusional — Latin, Ancient Greek, Russian. Morphemes fuse and alter each other; intermediate VMML expected.
Isolating — Cantonese, Thai. Minimal inflection; low VMML expected.
Unknown scripts — Voynich Manuscript (EVA), Rohonc Codex. The subjects of comparison.
Boundary Concentration (BC)
VMML alone is insufficient to characterise morphological structure — two languages can share similar VMML while reaching it through entirely different mechanisms. Boundary Concentration was introduced to resolve this ambiguity.
BC measures what fraction of BPE morpheme boundaries fall at edge positions — the first or last 20% of the token. A high BC indicates that most morphological work happens at word edges: prefixes and suffixes. A low BC indicates that morphological boundaries are distributed internally — a signature of infix-heavy systems like Philippine focus languages.
This metric was the decisive discriminator in Paper 8. Tagalog's VMML can, under certain corpus conditions, approach or enter the Voynich zone. But Tagalog's BC of approximately 0.20 is irreconcilable with Voynich's 0.361 — a gap of 0.16 units that reflects fundamentally different morphological architectures.
High BC in Voynich is consistent with prefix/suffix-dominant morphology. Low BC in Tagalog reflects its focus-morphology system, which encodes grammatical functions via infixes and circumfixes distributed through the interior of words.
Honest Limitations
We hold ourselves to the standard we apply to others. Every methodological limitation is disclosed in our preprints. Here are the most significant ones.
- 01 BC and CBMI edge windows (20%) are operationally defined, not statistically optimised thresholds. Different window choices would yield different absolute values; relative ordering across languages may still hold.
- 02 The 80,000-token cap introduces a negative VMML bias relative to full-corpus computation. Nietzsche was run at full corpus (n=82,802) to set the ceiling — all other corpora were capped. This could artificially inflate Voynich's relative position if Voynich is a small text.
- 03 Cross-text stability requires multiple texts per language to establish that a given VMML value is a property of the language, not the specific text. Paper 8 demonstrated this risk concretely: Tagalog VMML varies by 0.336 units across two Rizal novels.
- 04 All Voynich analysis depends on the EVA (European Voynich Alphabet) transcription. EVA encoding choices directly affect BPE learning. Results are valid under EVA; other transcription schemes would need independent testing.
- 05 Correlation is not causation. Similar VMML values do not imply linguistic relationship, shared origin, or any semantic claim about what the Voynich Manuscript means. VMML characterises structural fingerprint, not content.