Complete operational definitions, parameters, limitations, and reproducibility information.
BPE (Byte-Pair Encoding) is an unsupervised subword segmentation algorithm originally developed for neural machine translation (Sennrich et al., 2016). Applied here: each corpus is tokenized by whitespace; BPE learns merge rules by iteratively combining the most frequent character pair. VMML (Vocabulary Morpheme Mean Length) is the frequency-weighted mean length of all BPE segments in the resulting vocabulary.
These parameters were fixed before any corpus analysis began. They were not adjusted post-hoc to optimize results for any individual language or for the Voynich Manuscript.
The alphabetic ceiling (5.748) was derived from the highest-VMML alphabetic corpus in the Paper 7 comparison set: Friedrich Nietzsche, Also sprach Zarathustra (German), n = 82,802 tokens. This ceiling was established before the Austronesian expansion in Paper 8 and is used as a reference marker, not a threshold with independent statistical derivation.
Inclusion criteria: Minimum 10,000 tokens. Natural text only — no paradigm lists, no bot-generated content. Language verified by native-speaker inspection of the first 500 words plus automated language detection.
Excluded corpora and reasons:
All corpora are documented with: source URL or ISBN, token count (n), language family, and quality flags. This documentation is included in the Zenodo deposit for each paper.
BC measures the fraction of BPE morpheme boundaries located at edge positions — defined as the first or last 20% of a token's character positions. Values approaching 1 indicate prefix/suffix-dominant morphology (agglutinative or inflectional patterns visible at word edges). Values approaching 0 indicate infix-dominant or root-internal morphology.
The Voynich BC value of 0.361 places it in a zone that is intermediate between strongly prefix/suffix-dominant alphabetic languages and strongly infix-dominant Austronesian languages. This metric is descriptive, not a statistical threshold — the observation window of ±10% around the Voynich value is defined by the Voynich value itself and has no independently derived significance boundary.
The divergence between Voynich VMML and Tagalog BC is one of the two failure conditions that prevent Tagalog from being a complete structural match. The other is cross-text VMML instability (see Section 4, Limitation 1).
These are genuine constraints on the conclusions that can be drawn from this research. They are not caveats added for rhetorical balance — they reflect real methodological boundaries.
All data, scripts, and intermediate results are published on Zenodo under open licenses. The analysis pipeline is written in Python 3.10+ with no proprietary dependencies. All required packages are standard scientific Python (sentencepiece, numpy, pandas, matplotlib).
The Zenodo deposits include: raw corpora (or download scripts for corpora with license restrictions), BPE tokenizer configurations, pre-computed VMML/BC/CBMI tables, and all figures from both papers in editable format.
Independent replication attempts are welcome. Discrepancies in results should be reported to contact@voynichlucidity.com — they will be investigated and, if confirmed, documented as errata in the Zenodo record.
If you use this methodology or data in your own research, please cite the relevant paper. Both APA and BibTeX formats are provided below.
L. (2026). Structural fingerprinting of the Voynich Manuscript: A BPE-based typological comparison across 35+ language families. Zenodo. https://doi.org/10.5281/zenodo.20386119
L. (2026). Austronesian structural comparison with the Voynich Manuscript: Tagalog cross-text instability and discriminant zone analysis. Zenodo. https://doi.org/10.5281/zenodo.20467972
@misc{voynich_lucidity_p7_2026,
author = {L.},
title = {Structural fingerprinting of the Voynich Manuscript:
A BPE-based typological comparison across 35+ language families},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20386119},
url = {https://doi.org/10.5281/zenodo.20386119}
}
@misc{voynich_lucidity_p8_2026,
author = {L.},
title = {Austronesian structural comparison with the Voynich Manuscript:
Tagalog cross-text instability and discriminant zone analysis},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20467972},
url = {https://doi.org/10.5281/zenodo.20467972}
}