Contents

1. BPE VMML — Operational Definition

BPE (Byte-Pair Encoding) is an unsupervised subword segmentation algorithm originally developed for neural machine translation (Sennrich et al., 2016). Applied here: each corpus is tokenized by whitespace; BPE learns merge rules by iteratively combining the most frequent character pair. VMML (Vocabulary Morpheme Mean Length) is the frequency-weighted mean length of all BPE segments in the resulting vocabulary.

Fixed parameters — held constant across all 63+ corpora max_merges = 200
min_freq = 3
tokenizer = whitespace split (no language-specific normalization)
weighting = frequency-weighted mean over vocabulary segments

These parameters were fixed before any corpus analysis began. They were not adjusted post-hoc to optimize results for any individual language or for the Voynich Manuscript.

The alphabetic ceiling (5.748) was derived from the highest-VMML alphabetic corpus in the Paper 7 comparison set: Friedrich Nietzsche, Also sprach Zarathustra (German), n = 82,802 tokens. This ceiling was established before the Austronesian expansion in Paper 8 and is used as a reference marker, not a threshold with independent statistical derivation.

2. Corpus Standards

10,000 Minimum tokens required for inclusion
63+ Corpora analyzed across both papers
35+ Language families represented

Inclusion criteria: Minimum 10,000 tokens. Natural text only — no paradigm lists, no bot-generated content. Language verified by native-speaker inspection of the first 500 words plus automated language detection.

Excluded corpora and reasons:

All corpora are documented with: source URL or ISBN, token count (n), language family, and quality flags. This documentation is included in the Zenodo deposit for each paper.

3. Boundary Concentration (BC)

BC measures the fraction of BPE morpheme boundaries located at edge positions — defined as the first or last 20% of a token's character positions. Values approaching 1 indicate prefix/suffix-dominant morphology (agglutinative or inflectional patterns visible at word edges). Values approaching 0 indicate infix-dominant or root-internal morphology.

BC computation edge_zone = first 20% and last 20% of token length (character positions)
BC = boundaries_in_edge_zone / total_boundaries
Voynich BC = 0.361
Tagalog BC ≈ 0.20 (infix-dominant; CV reduplication + infix -um-, -in-)

The Voynich BC value of 0.361 places it in a zone that is intermediate between strongly prefix/suffix-dominant alphabetic languages and strongly infix-dominant Austronesian languages. This metric is descriptive, not a statistical threshold — the observation window of ±10% around the Voynich value is defined by the Voynich value itself and has no independently derived significance boundary.

The divergence between Voynich VMML and Tagalog BC is one of the two failure conditions that prevent Tagalog from being a complete structural match. The other is cross-text VMML instability (see Section 4, Limitation 1).

4. Honest Limitations

These are genuine constraints on the conclusions that can be drawn from this research. They are not caveats added for rhetorical balance — they reflect real methodological boundaries.

5. Reproducibility

All data, scripts, and intermediate results are published on Zenodo under open licenses. The analysis pipeline is written in Python 3.10+ with no proprietary dependencies. All required packages are standard scientific Python (sentencepiece, numpy, pandas, matplotlib).

Replication environment Language: Python 3.10+
Platform: macOS / Linux (Windows untested)
Runtime: 2–3 hours for full replication on a standard laptop
Memory: ~4 GB RAM sufficient for all corpora
License: CC BY 4.0 (data) · MIT (code)

The Zenodo deposits include: raw corpora (or download scripts for corpora with license restrictions), BPE tokenizer configurations, pre-computed VMML/BC/CBMI tables, and all figures from both papers in editable format.

Independent replication attempts are welcome. Discrepancies in results should be reported to contact@voynichlucidity.com — they will be investigated and, if confirmed, documented as errata in the Zenodo record.

6. Citation

If you use this methodology or data in your own research, please cite the relevant paper. Both APA and BibTeX formats are provided below.

APA — Paper 7 (Typological Comparison)

L. (2026). Structural fingerprinting of the Voynich Manuscript: A BPE-based typological comparison across 35+ language families. Zenodo. https://doi.org/10.5281/zenodo.20386119

APA — Paper 8 (Austronesian Expansion)

L. (2026). Austronesian structural comparison with the Voynich Manuscript: Tagalog cross-text instability and discriminant zone analysis. Zenodo. https://doi.org/10.5281/zenodo.20467972

BibTeX — Paper 7
@misc{voynich_lucidity_p7_2026,
  author    = {L.},
  title     = {Structural fingerprinting of the Voynich Manuscript:
               A BPE-based typological comparison across 35+ language families},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20386119},
  url       = {https://doi.org/10.5281/zenodo.20386119}
}
BibTeX — Paper 8
@misc{voynich_lucidity_p8_2026,
  author    = {L.},
  title     = {Austronesian structural comparison with the Voynich Manuscript:
               Tagalog cross-text instability and discriminant zone analysis},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20467972},
  url       = {https://doi.org/10.5281/zenodo.20467972}
}
↑ Back to top · Findings → · Download data →