Methodology · Voynich Lucidity

1. BPE VMML — Operational Definition

BPE (Byte-Pair Encoding) is an unsupervised subword segmentation algorithm originally developed for neural machine translation (Sennrich et al., 2016). Applied here: each corpus is tokenized by whitespace; BPE learns merge rules by iteratively combining the most frequent character pair. VMML (Vocabulary Morpheme Mean Length) is the frequency-weighted mean length of all BPE segments in the resulting vocabulary.

Fixed parameters — held constant across all 63+ corpora max_merges = 200
min_freq = 3
tokenizer = whitespace split (no language-specific normalization)
weighting = frequency-weighted mean over vocabulary segments

These parameters were fixed before any corpus analysis began. They were not adjusted post-hoc to optimize results for any individual language or for the Voynich Manuscript.

BPE process diagram — 4 steps from EVA corpus to VMML 5.918

Fig. 4 — BPE process applied to the Voynich EVA corpus. Four stages: raw corpus → character pair counting → iterative merging (200×) → subword vocabulary → VMML computed as frequency-weighted mean segment length. Result: 5.918 (95% CI [5.77–6.05]).

The alphabetic ceiling (5.748) was derived from the highest-VMML alphabetic corpus in the Paper 7 comparison set: Friedrich Nietzsche, Also sprach Zarathustra (German), n = 82,802 tokens. This ceiling was established before the Austronesian expansion in Paper 8 and is used as a reference marker, not a threshold with independent statistical derivation.

v2.3 update (2026-06-08): Section 5.9 added — per-folio Currier A/B reanalysis. Cross-boundary mutual information (CBMI) identified as primary A/B discriminant: CBMI_A = 1.97 bits vs CBMI_B = 1.51 bits, Cohen's d = −1.01, permutation p < 0.001. CBMI survives within-quire control (p = 0.0008), ruling out manuscript section as confound. Finding is orthogonal to Parisel (2026) vowel-selection model.

Paper 7 v2.6 — Full code (Zenodo) Paper 8 — Full code (Zenodo)

2. Corpus Standards

10,000 Minimum tokens required for inclusion

63+ Corpora analyzed across both papers

35+ Language families represented

Inclusion criteria: Minimum 10,000 tokens. Natural text only — no paradigm lists, no bot-generated content. Language verified by native-speaker inspection of the first 500 words plus automated language detection.

Excluded corpora and reasons:

EX1 Cebuano Wikipedia — approximately 85% of articles are Lsjbot stubs. Machine-generated content structurally different from natural language; excluded to avoid systematic VMML distortion.
EX2 Basque morphological paradigm list — paradigm tables, not running text. Inclusion would produce artificially elevated BC values by design rather than natural morphological behavior.

All corpora are documented with: source URL or ISBN, token count (n), language family, and quality flags. This documentation is included in the Zenodo deposit for each paper.

3. Boundary Concentration (BC)

BC measures the fraction of BPE morpheme boundaries located at edge positions — defined as the first or last 20% of a token's character positions. Values approaching 1 indicate prefix/suffix-dominant morphology (agglutinative or inflectional patterns visible at word edges). Values approaching 0 indicate infix-dominant or root-internal morphology.

BC computation edge_zone = first 20% and last 20% of token length (character positions)
BC = boundaries_in_edge_zone / total_boundaries
Voynich BC = 0.361
Tagalog BC ≈ 0.20 (infix-dominant; CV reduplication + infix -um-, -in-)

The Voynich BC value of 0.361 places it in a zone that is intermediate between strongly prefix/suffix-dominant alphabetic languages and strongly infix-dominant Austronesian languages. This metric is descriptive, not a statistical threshold — the observation window of ±10% around the Voynich value is defined by the Voynich value itself and has no independently derived significance boundary.

The divergence between Voynich VMML and Tagalog BC is one of the two failure conditions that prevent Tagalog from being a complete structural match. The other is cross-text VMML instability (see Section 4, Limitation 1).

4. Honest Limitations

These are genuine constraints on the conclusions that can be drawn from this research. They are not caveats added for rhetorical balance — they reflect real methodological boundaries.

L1 Cross-text VMML stability is not confirmed for most tested languages. Only Tagalog has been tested on multiple texts under identical conditions. That test revealed instability: VMML(Noli me Tangere) vs. VMML(El Filibusterismo) shows a delta of 0.336 units — a difference large enough to shift a language between structural zones. This instability means single-text measurements for other languages may not represent stable language-level properties. Further multi-text validation is needed across the full corpus.
L2 EVA transcription dependency. All Voynich analysis is based on the EVA (European Voynich Alphabet) transcription standard. If the Voynich script contains glyph distinctions that EVA collapses, or if different transcribers apply EVA differently, results may vary. Replication with independent transcriptions (e.g., the Landini or Frogguy systems) has not been conducted.
L3 80k token cap in Paper 8 introduces systematic negative VMML bias. To ensure cross-study comparability, Paper 8 truncates all corpora at 80,000 tokens. Full-corpus computation in Paper 7 (no cap) produces VMML values approximately 0.180 units higher on average. This means Paper 8 values are not directly comparable to Paper 7 full-corpus values for the same language. The direction of bias is consistent and downward.
L4 BPE is sensitive to training corpus size. Corpora with n < 10,000 tokens show substantially higher VMML variance. The 10,000-token minimum was chosen to reduce this variance, but it does not eliminate it. Corpora near the minimum threshold should be interpreted with more caution than large corpora (>50,000 tokens).
L5 BC and CBMI observation windows are descriptive, not independently derived. The ±10% window around the Voynich observed values for BC and CBMI is used to frame the comparison space, not to establish statistical significance. No null distribution has been derived for these metrics under a hypothesis of random structural variation.

5. Reproducibility

All data, scripts, and intermediate results are published on Zenodo under open licenses. The analysis pipeline is written in Python 3.10+ with no proprietary dependencies. All required packages are standard scientific Python (sentencepiece, numpy, pandas, matplotlib).

Replication environment Language: Python 3.10+
Platform: macOS / Linux (Windows untested)
Runtime: 2–3 hours for full replication on a standard laptop
Memory: ~4 GB RAM sufficient for all corpora
License: CC BY 4.0 (data) · MIT (code)

The Zenodo deposits include: raw corpora (or download scripts for corpora with license restrictions), BPE tokenizer configurations, pre-computed VMML/BC/CBMI tables, and all figures from both papers in editable format.

Paper 7 — doi.org/10.5281/zenodo.20668229 Paper 8 — doi.org/10.5281/zenodo.20668970

Independent replication attempts are welcome. Discrepancies in results should be reported to contact@voynichlucidity.com — they will be investigated and, if confirmed, documented as errata in the Zenodo record.

6. Citation

If you use this methodology or data in your own research, please cite the relevant paper. Both APA and BibTeX formats are provided below.

APA — Paper 7 (Typological Comparison)

L. (2026). Structural fingerprinting of the Voynich Manuscript: A BPE-based typological comparison across 35+ language families. Zenodo. https://doi.org/10.5281/zenodo.20668229

APA — Paper 8 (Austronesian Expansion)

L. (2026). Austronesian structural comparison with the Voynich Manuscript: Tagalog cross-text instability and discriminant zone analysis. Zenodo. https://doi.org/10.5281/zenodo.20668970

BibTeX — Paper 7

@misc{voynich_lucidity_p7_2026,
  author    = {L.},
  title     = {Structural fingerprinting of the Voynich Manuscript:
               A BPE-based typological comparison across 35+ language families},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20668229},
  url       = {https://doi.org/10.5281/zenodo.20668229}
}

BibTeX — Paper 8

@misc{voynich_lucidity_p8_2026,
  author    = {L.},
  title     = {Austronesian structural comparison with the Voynich Manuscript:
               Tagalog cross-text instability and discriminant zone analysis},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20668970},
  url       = {https://doi.org/10.5281/zenodo.20668970}
}