What is BPE VMML?

BPE — Byte-Pair Encoding — is an unsupervised algorithm originally developed for text compression. When applied to a language corpus, it learns to merge the most frequent adjacent character pairs into single tokens, iteratively building up a vocabulary of morpheme-like units from raw character sequences.

VMML (Vocabulary Morpheme Mean Length) measures the average length of the tokens BPE learns from a corpus. A higher VMML indicates that the algorithm is merging characters into longer, more complex units — a signature of richer internal word structure. A lower VMML suggests simpler, shorter morphological patterns.

Crucially, BPE is trained on each corpus independently. It learns the structure that exists in the data without any prior linguistic knowledge — no dictionaries, no grammars, no labeled morpheme boundaries.

Parameter Value Purpose
max_merges200Vocabulary ceiling; caps BPE learning depth
min_freq3Suppresses hapax noise; requires pattern recurrence
Token cap80,000Computational parity across corpora
TranscriptionEVAStandard Voynich character encoding
"Think of BPE as teaching a child to read by finding recurring patterns in words. The 'morpheme length' measures how complex those patterns are across an entire language — not just one word, but tens of thousands at once."

The Alphabetic Ceiling

One of our most robust empirical findings is what we call the Alphabetic Ceiling: an upper bound on BPE VMML observed across all tested alphabetic natural languages. No alphabetic natural language we have tested exceeds this value under our standard parameters.

5.748 Ceiling (Nietzsche, n=82,802)
5.918 Voynich Manuscript VMML
+0.170 Voynich exceeds ceiling by

The ceiling was empirically derived from Also sprach Zarathustra (Nietzsche), the highest-scoring alphabetic text identified in our initial 35-corpus scan. German was selected as the ceiling anchor because its compound-heavy morphology naturally produces high VMML — and still it falls below Voynich.

The ceiling is not a theoretical threshold. It is a descriptive upper bound observed across tested corpora. Its stability across additional corpora and language families is itself a result — not an assumption.

Why 63 Corpora?

To draw any meaningful conclusion about where the Voynich Manuscript sits typologically, we needed a baseline that spans the known diversity of human language structure — not just a handful of European languages.

35+ Language families
30+ Languages tested
63 Total corpora (incl. multi-text)

The corpus set was designed to cover all major morphological types:

Agglutinative — Finnish, Basque, Tagalog, Ilocano, Turkish. These languages add discrete, separable morphemes to roots and were expected to produce higher VMML.

Analytic — English, Mandarin Chinese, Vietnamese. These languages rely on word order rather than morphology and were expected to produce lower VMML.

Fusional — Latin, Ancient Greek, Russian. Morphemes fuse and alter each other; intermediate VMML expected.

Isolating — Cantonese, Thai. Minimal inflection; low VMML expected.

Unknown scripts — Voynich Manuscript (EVA), Rohonc Codex. The subjects of comparison.

Boundary Concentration (BC)

VMML alone is insufficient to characterise morphological structure — two languages can share similar VMML while reaching it through entirely different mechanisms. Boundary Concentration was introduced to resolve this ambiguity.

BC measures what fraction of BPE morpheme boundaries fall at edge positions — the first or last 20% of the token. A high BC indicates that most morphological work happens at word edges: prefixes and suffixes. A low BC indicates that morphological boundaries are distributed internally — a signature of infix-heavy systems like Philippine focus languages.

0.361 Voynich BC (edge-concentrated)
~0.20 Tagalog BC (infix-dominated)

This metric was the decisive discriminator in Paper 8. Tagalog's VMML can, under certain corpus conditions, approach or enter the Voynich zone. But Tagalog's BC of approximately 0.20 is irreconcilable with Voynich's 0.361 — a gap of 0.16 units that reflects fundamentally different morphological architectures.

High BC in Voynich is consistent with prefix/suffix-dominant morphology. Low BC in Tagalog reflects its focus-morphology system, which encodes grammatical functions via infixes and circumfixes distributed through the interior of words.

Honest Limitations

We hold ourselves to the standard we apply to others. Every methodological limitation is disclosed in our preprints. Here are the most significant ones.

Peer review standard. Every result published on this site has been through multiple rounds of adversarial review — including deliberate attempts to falsify our own claims — before publication. We report limitations explicitly because partial disclosure is a form of deception.