Research 95 Fingerprints Preprint available on request

The mathematical foundations
of dataset integrity.

Fifteen forensic categories. Ninety-five fingerprints. Five decades of mathematics — signal processing, information theory, topological data analysis, statistical process control — unified into a single cryptographically sealed verdict. No prior system has implemented all 95. This page documents the science behind each one.

0
Forensic fingerprints
across 15 categories
0
Prior world record
(our own earlier work)
0
Statistical tests
already built
23M
Parameter architecture
no GPU required
Abstract

We present ResEthiq, a comprehensive dataset integrity infrastructure for high-stakes AI. The system implements 95 forensic fingerprints across 15 mathematical categories — the largest such collection in any academic, commercial, or open-source system. Fingerprints span frequency-domain analysis, information-theoretic measures, topological data analysis, distributional geometry, temporal dynamics, inter-record geometry, cross-column dependency, precision forensics, missing-data forensics, generative model detection, human fabrication detection, domain physical constraints, graph-theoretic properties, and statistical process control. All fingerprints feed a Bayesian evidence accumulation framework with Benjamini-Hochberg FDR correction, producing a single posterior probability of dataset integrity. The verdict is cryptographically sealed via a Merkle tree root and Ed25519 signature, producing a Signed Policy Object (SPO) that is independently verifiable by any party holding the public key, offline, without any ResEthiq dependency. We demonstrate that this architecture satisfies four invariants: A (Determinism), B (Canonical Encoding), C (Zero Server Trust), D (Version Discipline).

Authors: ResEthiq Research Team
Status: Preprint — available on request
Version: v1.4.2 · Trust Kernel
Domain: Data Integrity · AI Safety · Forensic Statistics
CATEGORY 01 / 15
Phase A

Frequency Domain Analysis

Synthetic data generation — particularly GANs and diffusion models — leaves systematic artifacts in the frequency domain that are invisible to the human eye but detectable through spectral analysis. The 1/f power law governs natural signals; deviations from this law indicate non-natural generation processes.

Core Hypothesis

Real-world datasets exhibit 1/f (pink) noise structure in their power spectra. GAN-generated data exhibits characteristic peaks at GAN training frequencies. VAE-generated data shows over-smoothing in high-frequency components. These signatures are detectable via Fourier, Wavelet, and Hilbert-Huang decomposition at sub-column granularity.

// F01: Fourier Power Spectrum — 1/f noise test S(f) ∝ 1/fα where α ≈ 1.0 for natural data log S(f) = · log(f) + C ResEthiq tests: |α - 1.0| < threshold via OLS on log-log plot // F04: GAN Checkerboard artifact detection peak_detect(|FFT(X)|²) at f = N/stride Periodogram peaks at generator stride frequencies → synthetic flag // F05: Hilbert-Huang EMD — non-stationary decomposition X(t) = Σ IMFk(t) + r(t) Instantaneous frequency distribution tested against empirical naturality priors
CodeFingerprintWhat it catchesPhase
F01Fourier Power Spectrum1/f noise structure, log-log slope deviationA
F02Wavelet DecompositionMulti-scale energy distribution anomaliesA
F03Spectral EntropyRandomness concentration in frequency domainA
F04Periodogram PeaksGAN training frequency artifacts, grid patternsA
F05Hilbert-Huang EMDNon-linear, non-stationary signal fabricationA
References: Durall et al. (2019) "Watch your Up-Convolution" · Dzanic et al. (2020) "Fourier Spectrum Discrepancies in Deep Network Generated Images" · Huang et al. (1998) "The empirical mode decomposition"

CATEGORY 02 / 15
Phase A

Information Theory

Information-theoretic measures expose the fundamental complexity structure of a dataset. Real-world data has characteristic entropy gradients across columns and scales. Synthetic data exhibits entropy signatures that are either too uniform (over-regularised generators) or too structured (mode collapse). Human fabrication produces entropy that is distinctly sub-natural — humans cannot generate true randomness.

Core Hypothesis

The joint entropy structure of real tabular data is governed by the underlying data-generating process (physical, biological, economic). Fabricated data — whether machine-generated or human-created — fails to reproduce this structure. Shannon entropy per column, Kolmogorov complexity via compression ratio, and transfer entropy between columns together form a three-layer information-theoretic screen that no fabrication method has been shown to defeat simultaneously.

// I01: Shannon Entropy per column H(X) = -Σ p(x) · log₂ p(x) Flag: H(X) deviates > 2σ from domain-calibrated prior // I02: Kolmogorov Complexity proxy (LZ77 compression) K(X) ≈ |LZ77(X)| / |X| Synthetic data: K(X) consistently low (over-smooth generators) Human fab: K(X) intermediate but with characteristic patterns // I05: Transfer Entropy (directed information flow) TE(XY) = H(Yt|Yt-1) - H(Yt|Yt-1,Xt-1) Inter-column causal structure compared to expected domain graph // I07: Permutation Entropy Hperm(X,m) = -Σ p(π) · ln p(π) over all ordinal patterns π of length m Forbidden patterns test: certain ordinal patterns cannot appear in natural data
CodeFingerprintWhat it catchesPhase
I01Shannon EntropyToo-uniform distributions — over-regularised generatorsA
I02Kolmogorov ComplexityCompression ratio deviation from natural complexityA
I03Sample EntropyOver-regular sequences — synthetic temporal dataA
I04Approximate EntropyPredictability measure — human fabrication signatureA
I05Transfer EntropyDirected causal flow breakdown between columnsA
I06Multi-scale EntropyComplexity collapse at coarser time scalesA
I07Permutation EntropyForbidden ordinal patterns in natural dataA
References: Shannon (1948) "A Mathematical Theory of Communication" · Kolmogorov (1968) · Schreiber (2000) "Measuring Information Transfer" · Bandt & Pompe (2002) "Permutation Entropy"

CATEGORY 03 / 15
Phase A

Geometric / Manifold Analysis

High-dimensional data lies on a low-dimensional manifold whose geometric properties are determined by the underlying data-generating process. Topological Data Analysis (TDA) — specifically persistent homology — provides a summary of this geometry that is invariant to rotation, translation, and smooth deformation. Real datasets have characteristic topological signatures. Synthetic generation or manipulation deforms these signatures.

Core Hypothesis

Persistent homology (Betti numbers across filtration scales) captures the topological fingerprint of a dataset. Real tabular data manifolds exhibit consistent Betti-0 (connected components), Betti-1 (loops), and Betti-2 (voids) profiles across filtration scales. Synthetic generators fail to reproduce these profiles — generative models produce manifolds that are either too smooth (VAEs, diffusion) or fragmented (mode-collapsed GANs).

// G02: Persistent Homology (Betti numbers) β₀ = connected components β₁ = independent loops / cycles β₂ = enclosed voids persistence diagram = {(birth, death) : feature i born at ε₁, dies at ε₂} Wasserstein distance between empirical and reference persistence diagrams // G04: Fractal Dimension (box-counting) Df = limε→0 log N(ε) / log(1/ε) Natural datasets: D_f consistent with known domain priors Synthetic: D_f typically integer or near-integer (generator manifold) // G01: Local Intrinsic Dimensionality variance LID(x) = - [E(log(R(x,k)))]⁻¹ High LID variance → data lies on multiple disconnected manifolds
CodeFingerprintWhat it catchesPhase
G01Local Intrinsic DimensionalityMultiple disconnected sub-manifolds — mode collapseA
G02Persistent HomologyTopological deformation from natural Betti profileA
G03Manifold SmoothnessOver-smooth neighborhoods from diffusion modelsA
G04Fractal DimensionInteger-dimensional generator manifoldsA
G05LacunarityTexture and gap structure abnormalitiesA
G06Multifractal SpectrumSingularity spectrum width — natural vs syntheticA
G07Topological Data DepthTukey depth distribution anomaliesA
References: Edelsbrunner & Harer (2010) "Computational Topology" · Carlsson (2009) "Topology and Data" · Chazal et al. (2014) "Stochastic Convergence of Persistence Landscapes"

CATEGORY 04 / 15
Phase B

Distributional Geometry

Covariance structure, copula dependencies, and tail behaviour encode the joint distributional fingerprint of a dataset. The Marchenko-Pastur law governs the eigenspectrum of random covariance matrices — deviations indicate structured dependencies that should not exist in the declared data-generating process. Wasserstein distance provides an optimal-transport metric between empirical and reference distributions.

CodeFingerprintWhat it catchesPhase
D01Covariance EigenspectrumMarchenko-Pastur bulk edge deviation — hidden structureB
D02Correlation Frobenius NormDeviation from noise floor — unexpected correlationB
D03Copula StructureJoint distribution beyond marginals — dependency injectionB
D04Tail Index (Hill Estimator)Heavy tail truncation — synthetic generators clip extremesB
D05GEV Fit QualityExtreme value distribution mismatchB
D06Q-Q Deviation ScoreSystematic quantile deviations from referenceB
D07Skewness/Kurtosis Surface3D moment landscape — synthetic data over-normalisedB
D08Wasserstein DistanceOptimal transport distance to reference distributionB
References: Marchenko & Pastur (1967) · Sklar (1959) "Fonctions de répartition" · Hill (1975) "A Simple General Approach to Inference About the Tail"

CATEGORY 05 / 15
Phase B

Temporal / Sequential Analysis

The Hurst exponent (H) characterises long-range dependence. H = 0.5 indicates a random walk; H ≠ 0.5 indicates persistent (H > 0.5) or anti-persistent (H < 0.5) behaviour. Real-world time-series in regulated domains have well-characterised H values. Synthetic generators frequently produce H values inconsistent with the declared domain. The Lyapunov exponent measures chaos — fabricated sequences often fail to reproduce the characteristic chaos structure of real processes.

CodeFingerprintWhat it catchesPhase
T01Hurst ExponentLong-range dependence mismatch (H=0.5 random, H≠0.5 suspect)B
T02Lyapunov ExponentChaos measure — fabricated sequences lack natural chaosB
T03Recurrence QuantificationDiagonal line structure in recurrence plotB
T04Detrended Fluctuation AnalysisScaling behaviour inconsistency across time scalesB
T05Symbolic DynamicsWord frequency deviations in symbolic encodingB
T06Ordinal Pattern DistributionForbidden patterns — impossible in natural sequencesB
T07Visibility Graph DegreeNetwork properties of time series — synthetic smoothingB
T08Hjorth ParametersActivity, mobility, complexity — biomedical signal fabricationB
References: Hurst (1951) "Long-term storage capacity of reservoirs" · Peng et al. (1994) "Mosaic organisation of DNA nucleotides" · Lacasa et al. (2008) "Visibility graphs"

CATEGORIES 06 – 15 / 15

The remaining ten categories complete the 95-fingerprint matrix, covering inter-record relationships, cross-column dependencies, value precision forensics, missing-data patterns, generative model fingerprints, human fabrication patterns, domain physical constraints, graph-theoretic structure, full statistical process control, and cryptographic integrity.

CATEGORY 06 — Inter-Record Geometry (Phase C · 7 fingerprints)

The Hopkins statistic tests for spatial randomness in k-NN distance distributions. Near-duplicate detection via Jaccard and edit-distance similarity exposes copy-paste fabrication. Mahalanobis distance distribution shape reveals whether multivariate outlier density matches domain priors.

R01 k-NN Distance · R02 Duplicate Detection · R03 Record Spacing · R04 Mahalanobis · R05 UMAP Density · R06 MST Properties · R07 LOF Distribution
CATEGORY 07 — Cross-Column Dependency (Phase C · 7 fingerprints)

The NMI matrix eigenspectrum reveals the information structure across columns. Granger causality testing exposes unexpected directed causal relationships. Partial Information Decomposition separates synergistic from redundant information — ratios inconsistent with the declared data-generating process indicate manipulation.

X01 NMI Eigenspectrum · X02 Granger Causality · X03 Distance Correlation · X04 Conditional Entropy · X05 Partial Info Decomp · X06 CCA · X07 Copula Tail
CATEGORY 08 — Precision & Representation (Phase A · 7 fingerprints)

Significant figure analysis and round-number density testing operationalise Benford's Law extensions into the precision domain. Real measurement instruments produce characteristic precision distributions. Human fabrication exhibits pile-up at powers of 10, integer values, and "memorable" reference points. Floating point bit pattern analysis detects imputation and synthetic generation at the binary level.

P01 Decimal Precision · P02 Significant Figures · P03 Round Number Density · P04 Float Bit Pattern · P05 Value Vocabulary · P06 String Length Dist · P07 Precision Drift
CATEGORY 09 — Missing Data Forensics (Phase A · 6 fingerprints)

Little's MCAR test distinguishes Missing Completely At Random from structured absence. Logistic regression of missingness on observed values detects predictable gaps — a strong indicator of selective deletion or imputation. The Imputation Fingerprint detects prior imputation by identifying statistical signatures left by common imputation algorithms (mean, median, KNN, MICE).

M01 Little's MCAR · M02 Missing Pattern Matrix · M03 Logistic Regression · M04 Gap Clustering · M05 Imputation Fingerprint · M06 Boundary Missingness
CATEGORY 10 — Generative Model Fingerprints (Phase A · 7 fingerprints)

GAN mode collapse produces an abnormally small number of unique value clusters relative to dataset size. GAN checkerboard artifacts appear as periodic peaks in the Fourier periodogram at strides matching the generator's upsampling layers. VAE posterior collapse is detected via latent dimension utilisation analysis. Memorisation detection searches for near-exact matches to known public datasets.

V01 Mode Collapse · V02 GAN Checkerboard · V03 VAE Posterior Collapse · V04 Diffusion Smoothness · V05 Memorisation · V06 Latent Geometry · V07 Generator Periodicity
CATEGORY 11 — Human Fabrication Fingerprints (Phase A · 7 fingerprints)

Cognitive psychology has established seven reliable signatures of human data generation. Anchor bias produces clustering around memorable reference points (Tversky & Kahneman 1974). Fatigue patterns manifest as degrading randomness over record sequence — early records are more random than late records. Copy-increment detection identifies value[n] ≈ value[n-1] ± small_delta, a common manual entry shortcut.

H01 Anchor Bias · H02 Avoidance Patterns · H03 Narrative Coherence · H04 Fatigue Pattern · H05 Copy-Increment · H06 Symmetric Bias · H07 Temporal Regularity
CATEGORY 12 — Domain Physical Constraints (Phase B · 7 fingerprints)

Physical law violations are strong falsification events — a dataset where blood pressure increases monotonically with patient age, without exception, is not a real clinical dataset. Biological plausibility testing maps age-lab value co-occurrence matrices against known physiology. Temporal impossibility detection flags events that violate time ordering. Unit coherence testing detects mixed-unit columns.

C01 Biological Plausibility · C02 Physical Law Violations · C03 Temporal Impossibility · C04 Logical Consistency · C05 Unit Coherence · C06 Reference Range Clustering · C07 Inter-measurement Correlation
CATEGORY 13 — Graph / Network Properties (Phase C · 5 fingerprints)

Re-identification risk quantifies the uniqueness of quasi-identifier combinations — a necessary test for datasets declared to be de-identified. Record linkage graph analysis tests whether records in the dataset can be linked to external public datasets, exposing potential data provenance issues.

N01 Re-identification Risk · N02 Record Linkage Graph · N03 Social Network Fingerprint · N04 Attribute Correlation Graph · N05 Clique Structure
CATEGORY 14 — Statistical Process Control (Phase C · 5 fingerprints)

The full Western Electric rule battery (all 8 rules) and all 10 Nelson patterns are applied to every numeric column. Originally developed for manufacturing quality control in the 1950s, these rules detect systematic patterns that should not appear in random data. Applied to dataset columns, they expose systematic manipulation, data drift, and structural breaks. Hotelling T² extends SPC to multivariate contexts.

Q01 Western Electric (8 rules) · Q02 Nelson (10 patterns) · Q03 EWMA · Q04 Hotelling T² · Q05 CUSUM Battery
CATEGORY 15 — Cryptographic Integrity (Phase A · 5 fingerprints)

Every cell is hashed after canonical encoding. Row hashes are chained sequentially. The Merkle tree root commits to the entire dataset. The Ed25519 signature seals the root along with the policy verdict. RFC 3161 timestamping anchors the sealed object to an external time reference — producing a Signed Policy Object (SPO) that is independently verifiable, offline, by anyone holding the public key.

CR01 Cell Hash · CR02 Row Hash Chain · CR03 Schema Fingerprint · CR04 Merkle Root · CR05 RFC 3161 Temporal Seal

SYNTHESIS

Bayesian Evidence Accumulation

Ninety-five independent signals must be combined into a single, court-defensible verdict. A naive conjunction of p-values would be dominated by the multiple comparisons problem. ResEthiq uses a Bayesian evidence accumulation framework with Benjamini-Hochberg FDR correction — producing a posterior probability of dataset integrity with calibrated credible intervals.

Methodology

Each fingerprint i produces a test statistic t_i and p-value p_i. The FDR-adjusted p-values q_i (Benjamini-Hochberg) control the expected proportion of false discoveries. These are converted to Bayes factors via the Sellke-Berger-Howard formula and accumulated via Bayes' theorem against a domain-calibrated prior P(integrity | domain). The posterior P(integrity | data, policy) is the final verdict score. Credible intervals are computed via MCMC.

// Step 1: FDR correction (Benjamini-Hochberg 1995) qi = p(i) · m / i where p(i) are ordered p-values, m = 95 // Step 2: Bayes Factor from p-value (Sellke-Berger-Howard) BFi = -e · pi · ln(pi)⁻¹ (minimum Bayes factor bound) // Step 3: Evidence accumulation log P(integrity|data) = log P(integrity) + Σᵢ log BFi // Step 4: Posterior probability P(integrity|data, policy) = sigmoid(log-odds) Output: posterior ∈ [0,1] + 95% credible interval + FDR q-value // Step 5: Binary verdict APPROVED if P(integrity) > policy.threshold AND all REJECT rules pass REJECTED otherwise → flagged fingerprints listed with evidence weights
References: Benjamini & Hochberg (1995) "Controlling the False Discovery Rate" · Sellke, Bayarri & Berger (2001) "Calibration of p values" · Gelman et al. (2013) "Bayesian Data Analysis"

PRIOR ART

Comparison to prior systems

System Fingerprints Categories Cryptographic seal Offline verify Bayesian synthesis GAN detection Human fab detection
ResEthiq v1 95 15 Yes Yes Yes Yes Yes
ResEthiq v0 (prior plan) 12 4
Great Expectations ~40 3
Deepchecks ~35 4
Evidently AI ~25 3
Academic literature (best) ~15 2 partial partial
Manual audit (Big 4 firms) ~8 2

ARCHITECTURE

Four invariants. Zero ambiguity.

The ResEthiq architecture satisfies four invariants that together guarantee a dataset's integrity certificate is reproducible, portable, and independently verifiable — properties that no prior system has formalised simultaneously.

INVARIANT A — Determinism

Given the same dataset, policy specification, and ResEthiq version, the system must produce an identical Merkle root and identical verdict on any platform, any OS, any hardware. All fingerprint computations are seeded deterministically. No stochastic components without fixed seeds.

INVARIANT B — Canonical Encoding

All values are serialised via a canonical encoding before hashing — floating point values via IEEE 754 round-trip string, categorical values via NFC Unicode normalisation, null values via explicit sentinel bytes. No encoding ambiguity is tolerated.

INVARIANT C — Zero Server Trust

The Signed Policy Object (SPO) is independently verifiable using only the ResEthiq public key and the open-source Verifier CLI. No ResEthiq server, account, or network access is required. An SPO produced today remains verifiable in thirty years.

INVARIANT D — Version Discipline

Every SPO embeds the exact ResEthiq version that produced it. Policy specifications are versioned and immutable once deployed. A dataset certified under policy v2.1 cannot be silently re-evaluated under policy v3.0 — version mismatches are hard errors.

Full methodology paper available on request.
Complete mathematical derivations, benchmark datasets, and reproducibility instructions for all 95 fingerprints.
Request preprint