Parametric Text Tessellation

Parametric Text Tessellation
Abbreviation	PTT
Proposed by	Andrew Lehti
Field	Text classification; structural text analysis
Method type	Deterministic feature-based classifier
Primary signal	Character patterns, punctuation, word edges, capitalization, and token distributions
Scoring methods	Cosine similarity; softmax normalization
Model format	Probability arrays; scaled integer matrices
Semantic model required	No
Related areas	Stylometry, computational linguistics, forensic text analysis, machine-generated text detection

Parametric Text Tessellation (PTT) is a proposed deterministic method for classifying text by its structural pattern rather than its semantic meaning. The method treats writing as a measurable arrangement of characters, spaces, punctuation marks, word boundaries, capitalization patterns, and other surface features. These features are converted into statistical profiles that can be compared against trained reference datasets.

PTT is intended as a compact alternative or supplement to large semantic classifiers. Instead of attempting to understand what a text says, it measures how the text is built. Its central premise is that texts have surface-level structural habits that can be modeled, compared, and scored. These habits may reflect writing style, platform conventions, dataset origin, machine generation, formatting behavior, or other non-semantic regularities.

The method has been proposed for use in text classification, dataset comparison, moderation triage, synthetic-text detection, style comparison, and as a preprocessing layer for heavier semantic systems.

Overview

PTT belongs to a family of text-analysis approaches that measure form rather than meaning. It is related in broad purpose to stylometry and computational text analysis, but it differs in its emphasis on compact, bounded feature groups and deterministic scoring.

In PTT, a body of text is converted into several feature distributions. These may include raw keyboard characters, adjacent lowercase letter-space pairs, punctuation-adjacent letters, word-edge patterns, capitalization transitions, and optional Unicode-mapped character signals. Each feature group produces counts. The counts are normalized into probabilities. The resulting profile is treated as a structural imprint of the text.

A trained dataset is represented by one or more such imprints. A new input text is processed with the same feature rules and compared against those reference imprints. The comparison produces a ranked distribution showing which trained datasets the new text most resembles structurally.

PTT does not require a vocabulary list, language model, syntax parser, embedding model, or semantic label system. It requires only stable feature extraction rules and a trained reference bundle.

Conceptual basis

The guiding assumption of PTT is that text has measurable surface geometry. A sentence is not only a sequence of meanings; it is also a sequence of marks. Letter transitions, spaces, punctuation placement, word beginnings, word endings, capitalization, and symbol use form patterns that can be counted.

This approach separates structural resemblance from semantic interpretation. Two texts can discuss different topics while retaining similar structural behavior. Conversely, two texts can discuss the same topic while differing sharply in punctuation rhythm, capitalization behavior, word-edge structure, or character distribution.

The method can be summarized as a shape comparison model:

A dataset is represented as a shape.
A new text is represented as a shape.
Classification is performed by comparing those shapes.

This makes PTT useful where semantic content may be unreliable, disguised, noisy, or secondary to the classification task.

Difference from semantic classification

Modern text classifiers often rely on semantic features, including words, subwords, embeddings, contextual relationships, and transformer-based representations. These systems are powerful but can be expensive to run and difficult to inspect. They may also be affected by paraphrase, topic changes, vocabulary substitution, and prompt-level disguise.

PTT avoids these dependencies by using bounded structural features. This gives it several practical traits:

models can be small;
scoring can be fast;
outputs are repeatable;
features can be inspected directly;
scoring can run locally or in a browser;
vocabulary changes do not erase all signal.

The method does not claim that semantic analysis is unnecessary. Instead, it treats structure as a separate signal. In some workflows, PTT can act as a first-pass classifier before a larger semantic system is used.

Feature groups

PTT uses feature processors, each of which views the same text through a different structural lens. The exact processor set can vary by implementation, but the proposed system includes several recurring groups.

Raw character distribution

The keyboardRaw group counts characters as they appear at the keyboard or text-stream level. It preserves uppercase letters, lowercase letters, digits, punctuation, whitespace, and symbols. This group captures the mechanical surface of writing.

Lowercase letter-space pairs

The lowerSpacePairs group lowercases text and counts overlapping two-character sequences made from letters and spaces. For example, hello world contains pairs such as he, el, ll, lo, o , w, wo, and or. This group captures local flow across letters and word boundaries without storing full vocabulary.

Punctuation anchors

Punctuation-based groups count letters near punctuation. One processor may count the last letter before punctuation, such as e., t?, or s!. Another may count the first letter of a word attached to punctuation, such as h! from hello!. These features measure sentence endings, clause endings, and punctuation habits.

Word-edge patterns

Word-edge processors reduce words to boundary structure. A processor may count the first and last letter of each word, so middle becomes me. Another may count the first letter of one word paired with the first letter of the next word, so hello world again becomes hw and wa. These features preserve skeletal word shape and local word rhythm without storing complete words.

Segment and word skeletons

Some processors capture broader boundaries, such as the first and last character of punctuation-bounded segments. Others record the first, middle, and final character of each word. These features act as compressed word or phrase skeletons.

Capitalization patterns

The capPairs group removes non-uppercase characters and counts adjacent uppercase transitions. This can capture abbreviations, acronyms, title-case behavior, and capitalized sequences.

Unicode handling

PTT can handle characters outside a configured token set in more than one way. A simple fallback approach sends unknown characters into an OTHER bucket. This keeps the model compact but collapses many distinct characters into the same token.

A second approach uses deterministic modulo mapping. In this mode, an unknown character is mapped into the existing key space by using its character code and the length of the allowed key string:

mappedChar = keyString[ord(char) % keyLength]

This does not translate the character and does not preserve linguistic meaning. It acts as a bounded structural hashing method. The same unknown character maps to the same replacement each time, allowing the system to measure symbol streams without expanding the model into the full Unicode range.

Training process

Training begins with one or more labeled text collections. In a directory-based implementation, each folder represents a category, and the folder name becomes the dataset label.

Each training sample is processed through the enabled feature groups. Empty strings and duplicate-marker strings may be ignored. The feature processors produce token counts, and the counts are accumulated across the dataset.

The accumulated counts are then normalized:

probability = tokenCount / totalTokenCount

Normalization allows datasets of different sizes to be compared. A small dataset and a large dataset can both be represented as ratios. The model compares distribution shape rather than raw text volume.

The trained result stores the dataset label, feature groups, token lists, total counts, and probability arrays.

Model storage and compaction

A readable PTT model stores each feature group as a token list and a corresponding probability array. The token list defines the meaning of each position. The probability array stores the value for that token in the same order.

For example:

tokens = ["a", "b", "c"]
probabilityMatrix["DatasetA"] = [0.10, 0.25, 0.65]

This means that, in the relevant feature group, DatasetA has a distribution of 10% a, 25% b, and 65% c.

The compact representation removes repeated keys and labels. All candidates are aligned into one shared token order for each feature group. Missing values are stored as zero. Probabilities are then multiplied by a fixed scale and rounded into integers:

storedValue = round(probability × scale)

A scale such as 100000 can preserve small differences while reducing storage size. Since cosine similarity measures vector direction, multiplying all values in a vector by the same constant does not alter the comparison. The scale cancels during cosine scoring.

A compact bundle may store:

candidates
groupOrder
tokens
matrix
totals

The candidate list defines row order. The token list defines column order. The matrix stores the scaled values.

Scoring

Scoring applies the same feature processors used during training to a new input text. The input becomes a query vector for each feature group. Candidate vectors already exist in the trained bundle.

The main comparison uses cosine similarity:

cosine(q, c) = dot(q, c) / (||q|| × ||c||)

Here, q is the query vector and c is a candidate vector. Cosine similarity compares direction rather than size, which allows short and long texts to be compared by proportional structure.

The similarity scores are then converted through softmax:

pᵢ = exp((sᵢ - maxScore) × sharpness) / Σ exp((sⱼ - maxScore) × sharpness)

The sharpness parameter controls contrast. Higher values make the strongest match stand out more. Lower values keep the distribution softer.

Each feature group produces its own distribution. The final score is produced by averaging or weighting those group outputs and normalizing the result so all candidate scores sum to one.

Flat and nested scoring

PTT supports flat scoring and nested scoring.

Flat scoring compares all candidate rows in one shared competition. This is suitable when the model is a single flat bundle.

Nested scoring is used when a model is built from other models. In that case, a parent model with more candidate rows could dominate a flat comparison. Nested scoring first scores candidates inside each parent model and then averages the parent outputs. This gives each parent model a more balanced influence.

Flat scoring asks which candidate row wins. Nested scoring asks which label wins after each parent model has contributed.

Candidate exclusion

Candidate exclusion allows selected candidates to be removed before scoring. This can be done by name or by position. For example, if candidates are arranged as:

A, B, C, D, E, F, G, H, I, J, K

and the exclusion positions are:

[0, 2, 4, 6, 8, 10]

then the scorer removes A, C, E, G, I, and K, leaving B, D, F, H, and J.

This is useful for paired datasets, alternate versions, or experimental conditions. Since softmax is relative, exclusion changes the comparison field and should be treated as part of the experimental setup rather than as a display-only filter.

Balance and scale

PTT scores are shaped by the number and balance of reference sets. A two-category model produces a forced comparison. This may be useful for narrow tests, but it can exaggerate differences because every input must lean toward one side.

As the number of categories increases, the output becomes more map-like. A text can resemble one category strongly, fall between several categories, or fail to align clearly with any of them. This is useful for dataset analysis and mixed-signal interpretation.

Large models require balance. If one class contains many near-duplicate categories and another class contains only one, the larger class can pull probability toward itself. Nested scoring and parent-model weighting are proposed ways to reduce this problem.

A balanced PTT model should consider:

category coverage;
contrast sets;
feature-group influence;
parent-model weight;
duplicate or near-duplicate categories;
sample-size differences.

Interpretation of results

PTT output should be interpreted as structural resemblance, not as proof of identity, motive, authorship, or truth.

A result such as 52% Scholarly and 33% Bibliographic means the text’s measured structure most resembles those reference sets under the current feature rules. It does not mean the text is objectively scholarly, and it does not prove that it came from a bibliography.

The model is only as useful as its reference data. If the training sets are biased, narrow, mislabeled, duplicated, or contaminated by platform artifacts, the output may reflect those defects.

Applications

PTT has been proposed for several uses.

Moderation triage

PTT can identify text that structurally resembles known low-quality, hostile, spam-like, or manipulative datasets. Because it does not depend on prohibited words, it may detect some disguised patterns that keyword filters miss.

Synthetic-text detection

The method may help detect cadence, punctuation regularity, capitalization habits, and other structural features associated with machine-generated or templated text. It does not prove machine authorship by itself.

Dataset comparison

PTT can compare datasets to determine whether one set structurally resembles another. This may be useful for detecting overlap, contamination, formatting artifacts, or mixed-category behavior.

Style comparison

The method can compare texts against known writing sets without requiring a full authorship model. This places it near stylometric work, although PTT emphasizes compact structural token spaces rather than broader linguistic interpretation.

Client-side classification

Because compact PTT bundles can be small, they may run locally in a browser or script. This reduces reliance on server-side inference and can lower cost, latency, and logging exposure.

Preprocessing for semantic systems

PTT can act as a first-pass structural filter before larger semantic models. Text with clear structural alignment can be routed automatically, while ambiguous or high-risk cases can be escalated to a larger classifier.

PTT can also supply metadata to a semantic model. For example, a pipeline could send the raw text together with structural scores such as Scholarly 52%, Bibliographic 33%, Toxic 0.6%, and Spam 0.0%. The semantic model reads meaning while PTT supplies form.

Development testing

Development testing reportedly showed that PTT performs best when enough text is available for the structural signal to stabilize. Very short samples can be scored, but the results may shift as more text is added.

In tests involving MBTI-style datasets, PTT was evaluated against noisy self-labeled writing sets. One design reduced sixteen MBTI types into eight families: NFJ, NFP, NTJ, NTP, SFJ, SFP, STJ, and STP. This was done because the introversion/extraversion axis may be weaker in text structure than the N/S, T/F, and J/P axes.

Reported test observations included:

a known INTJ sample returning NTJ across tested model sets;
a known ENFP sample returning NFP under a no-I/E family model;
some INTP-related samples showing NTP signals internally even when the flat final output differed;
longer samples producing more stable results than shorter samples;
some claimed or self-reported types producing mixed neighboring-region distributions.

These observations are not a substitute for formal benchmarking. They are development findings and should be interpreted as exploratory. A controlled validation would require separated training and test sets, clear ground truth, duplicate removal, source separation, and comparison against baseline models.

Proposed validation methods

Future validation of PTT can be separated into three areas.

First, speed and size can be measured directly by testing model size, load time, memory use, and rows scored per second.

Second, classification accuracy can be tested with held-out labeled datasets. Training and test data must be separated. Duplicate text, repeated authors, or shared source material should be controlled.

Third, robustness against semantic disguise can be tested through adversarial rewriting. Samples can be paraphrased, cleaned, padded, shortened, slang-shifted, or translated. PTT and semantic classifiers can then be compared to see which signal changes more.

For MBTI-style testing, validation can be separated into exact sixteen-type matching, eight-family matching without I/E, and dimension-level matching. A model that misses exact type but recovers NTJ, NTP, or another family may still contain useful structural signal.

Useful reports should include final winners, top-three scores, entropy, top-two margin, feature-group winners, flat-versus-nested comparison, and exclusion-condition results.

Limitations

PTT has several limitations.

Short text is unstable. A short sentence may not contain enough structure to support a confident result.

Training-data quality is central. If a category contains repeated formatting, copied text, platform artifacts, narrow topics, or mislabeled examples, the model may learn those artifacts.

Feature choice affects results. A model trained only on punctuation may behave differently from one trained on punctuation, letter pairs, capitalization, and word-edge features.

Unicode modulo mapping preserves deterministic measurement but does not preserve linguistic meaning.

PTT measures surface geometry. It does not replace semantic review, human judgment, domain knowledge, legal analysis, or controlled validation.

Significance

PTT is significant as a proposed lightweight approach to text classification. It reframes text as a measurable structural object rather than a semantic message. This gives it possible value in areas where meaning is expensive, misleading, intentionally disguised, or unnecessary for the first stage of classification.

Its main strengths are compactness, repeatability, speed, and inspectability. Its main weakness is dependence on the quality and balance of reference datasets. The method is therefore best understood as a structural classifier, preprocessing layer, or secondary signal rather than a universal replacement for semantic models.

References