We applied BPE VMML (Byte-Pair Encoding Mean Vocabulary Morpheme Length) — a metric measuring the average structural complexity of word-internal segmentation — to 63+ corpora from 35+ language families. The goal: determine where the Voynich Manuscript sits in the typological space of known writing systems.
BPE VMML measures the mean length of morpheme-like segments produced by unsupervised Byte-Pair Encoding. Higher values indicate denser morpheme structure. An "alphabetic ceiling" of 5.748 was derived empirically (Nietzsche, Also sprach Zarathustra, n=82,802 tokens, Paper 7 v2.1). The Voynich Manuscript: VMML = 5.918, Boundary Concentration (BC) = 0.361.
| Corpus | Family / Branch | n tokens | VMML | BC | Above ceiling? |
|---|---|---|---|---|---|
| Voynich MS (Paper 7 canonical) | Unknown | 33,803 | 5.918 | 0.361 | ✓ Reference |
| Tagalog: Noli Me Tangere (full) | Philippine | 173,885 | 5.914 | 0.202 | ✓ Yes |
| Tagalog: El filibusterismo (full) | Philippine | 114,777 | 5.578 | 0.213 | ✗ No |
| Ilocano (Wikipedia, n=12.8k) | Philippine | 12,807 | 5.785 | 0.248 | ⚠ Provisional |
| Cebuano NLLB (n=60k) | Philippine | 60,000 | 5.609 | 0.294 | ✗ No |
| Malay (Wikipedia) | Malayo-Polynesian | 49,742 | 5.293 | 0.260 | ✗ No |
| Indonesian (Wikipedia) | Malayo-Polynesian | 50,350 | 4.838 | 0.272 | ✗ No |
| Basque (Wikipedia natural text) | Isolate (control) | 48,621 | 4.475 | 0.309 | ✗ No |
Tagalog cross-text instability: Full-corpus computation of El filibusterismo (n=114,777, identical protocol) yields VMML = 5.578 — below the alphabetic ceiling. Cross-text Δ = 0.336 units (Noli 5.914 − Filibusterismo 5.578), larger than the Voynich discriminant zone width (0.28 units). Supra-ceiling Tagalog VMML is text-specific, not a language property.
Character permutation test: Corpus-size matched character shuffle (n=50 iterations, all corpora at n=19,968) shows Voynich drops 22.0% (CI95: [5.07, 5.82]) vs Tagalog 12–14% (CI95: [4.77, 5.04]). CI95 ranges do not overlap. Despite similar VMML, Voynich and Tagalog reach it through structurally different mechanisms.
No tested corpus from any of 35+ language families simultaneously occupies the Voynich discriminant zone (VMML, BC, CBMI). Tagalog is the closest match on VMML — but fails BC decisively and shows cross-text VMML instability. No known natural language replicates the Voynich structural profile.
Paper 7 (typological validation, 55+ corpora): doi.org/10.5281/zenodo.20386119
Paper 8 (Tagalog cross-text instability + Austronesian expansion): doi.org/10.5281/zenodo.20467972