ResEthiq — Research

Abstract

We present ResEthiq, a comprehensive dataset integrity infrastructure for high-stakes AI. The system implements 95 forensic fingerprints across 15 mathematical categories — the largest such collection in any academic, commercial, or open-source system. Fingerprints span frequency-domain analysis, information-theoretic measures, topological data analysis, distributional geometry, temporal dynamics, inter-record geometry, cross-column dependency, precision forensics, missing-data forensics, generative model detection, human fabrication detection, domain physical constraints, graph-theoretic properties, and statistical process control. All fingerprints feed a Bayesian evidence accumulation framework with Benjamini-Hochberg FDR correction, producing a single posterior probability of dataset integrity. The verdict is cryptographically sealed via a Merkle tree root and Ed25519 signature, producing a Signed Policy Object (SPO) that is independently verifiable by any party holding the public key, offline, without any ResEthiq dependency. We demonstrate that this architecture satisfies four invariants: A (Determinism), B (Canonical Encoding), C (Zero Server Trust), D (Version Discipline).

Contents

01Frequency Domain 02Information Theory 03Geometric / Manifold 04Distributional Geometry 05Temporal / Sequential 06Inter-Record Geometry 07Cross-Column Dependency 08Precision & Representation 09Missing Data Forensics 10Generative Model Fingerprints 11Human Fabrication 12Physical Constraints 13Graph / Network 14Statistical Process Control 15Cryptographic Integrity

BBayesian Synthesis CPrior Art Comparison ASystem Architecture

CATEGORY 01 / 15

Phase A

Frequency Domain Analysis

Synthetic data generation — particularly GANs and diffusion models — leaves systematic artifacts in the frequency domain that are invisible to the human eye but detectable through spectral analysis. The 1/f power law governs natural signals; deviations from this law indicate non-natural generation processes.

Core Hypothesis

Real-world datasets exhibit 1/f (pink) noise structure in their power spectra. GAN-generated data exhibits characteristic peaks at GAN training frequencies. VAE-generated data shows over-smoothing in high-frequency components. These signatures are detectable via Fourier, Wavelet, and Hilbert-Huang decomposition at sub-column granularity.

// F01: Fourier Power Spectrum — 1/f noise test S(f) ∝ 1/f^α where α ≈ 1.0 for natural data log S(f) = -α · log(f) + C ResEthiq tests: |α - 1.0| < threshold via OLS on log-log plot // F04: GAN Checkerboard artifact detection peak_detect(|FFT(X)|²) at f = N/stride Periodogram peaks at generator stride frequencies → synthetic flag // F05: Hilbert-Huang EMD — non-stationary decomposition X(t) = Σ IMF_k(t) + r(t) Instantaneous frequency distribution tested against empirical naturality priors

Code	Fingerprint	What it catches	Phase
F01	Fourier Power Spectrum	1/f noise structure, log-log slope deviation	A
F02	Wavelet Decomposition	Multi-scale energy distribution anomalies	A
F03	Spectral Entropy	Randomness concentration in frequency domain	A
F04	Periodogram Peaks	GAN training frequency artifacts, grid patterns	A
F05	Hilbert-Huang EMD	Non-linear, non-stationary signal fabrication	A

References: Durall et al. (2019) "Watch your Up-Convolution" · Dzanic et al. (2020) "Fourier Spectrum Discrepancies in Deep Network Generated Images" · Huang et al. (1998) "The empirical mode decomposition"

CATEGORY 02 / 15

Phase A

Information Theory

Information-theoretic measures expose the fundamental complexity structure of a dataset. Real-world data has characteristic entropy gradients across columns and scales. Synthetic data exhibits entropy signatures that are either too uniform (over-regularised generators) or too structured (mode collapse). Human fabrication produces entropy that is distinctly sub-natural — humans cannot generate true randomness.

Core Hypothesis

The joint entropy structure of real tabular data is governed by the underlying data-generating process (physical, biological, economic). Fabricated data — whether machine-generated or human-created — fails to reproduce this structure. Shannon entropy per column, Kolmogorov complexity via compression ratio, and transfer entropy between columns together form a three-layer information-theoretic screen that no fabrication method has been shown to defeat simultaneously.

// I01: Shannon Entropy per column H(X) = -Σ p(x) · log₂ p(x) Flag: H(X) deviates > 2σ from domain-calibrated prior // I02: Kolmogorov Complexity proxy (LZ77 compression) K(X) ≈ |LZ77(X)| / |X| Synthetic data: K(X) consistently low (over-smooth generators) Human fab: K(X) intermediate but with characteristic patterns // I05: Transfer Entropy (directed information flow) TE(X→Y) = H(Y_t|Y_t-1) - H(Y_t|Y_t-1,X_t-1) Inter-column causal structure compared to expected domain graph // I07: Permutation Entropy H_perm(X,m) = -Σ p(π) · ln p(π) over all ordinal patterns π of length m Forbidden patterns test: certain ordinal patterns cannot appear in natural data

Code	Fingerprint	What it catches	Phase
I01	Shannon Entropy	Too-uniform distributions — over-regularised generators	A
I02	Kolmogorov Complexity	Compression ratio deviation from natural complexity	A
I03	Sample Entropy	Over-regular sequences — synthetic temporal data	A
I04	Approximate Entropy	Predictability measure — human fabrication signature	A
I05	Transfer Entropy	Directed causal flow breakdown between columns	A
I06	Multi-scale Entropy	Complexity collapse at coarser time scales	A
I07	Permutation Entropy	Forbidden ordinal patterns in natural data	A

References: Shannon (1948) "A Mathematical Theory of Communication" · Kolmogorov (1968) · Schreiber (2000) "Measuring Information Transfer" · Bandt & Pompe (2002) "Permutation Entropy"

CATEGORY 03 / 15

Phase A

Geometric / Manifold Analysis

High-dimensional data lies on a low-dimensional manifold whose geometric properties are determined by the underlying data-generating process. Topological Data Analysis (TDA) — specifically persistent homology — provides a summary of this geometry that is invariant to rotation, translation, and smooth deformation. Real datasets have characteristic topological signatures. Synthetic generation or manipulation deforms these signatures.

Core Hypothesis

Persistent homology (Betti numbers across filtration scales) captures the topological fingerprint of a dataset. Real tabular data manifolds exhibit consistent Betti-0 (connected components), Betti-1 (loops), and Betti-2 (voids) profiles across filtration scales. Synthetic generators fail to reproduce these profiles — generative models produce manifolds that are either too smooth (VAEs, diffusion) or fragmented (mode-collapsed GANs).

// G02: Persistent Homology (Betti numbers) β₀ = connected components β₁ = independent loops / cycles β₂ = enclosed voids persistence diagram = {(birth, death) : feature i born at ε₁, dies at ε₂} Wasserstein distance between empirical and reference persistence diagrams // G04: Fractal Dimension (box-counting) D_f = lim_ε→0 log N(ε) / log(1/ε) Natural datasets: D_f consistent with known domain priors Synthetic: D_f typically integer or near-integer (generator manifold) // G01: Local Intrinsic Dimensionality variance LID(x) = - [E(log(R(x,k)))]⁻¹ High LID variance → data lies on multiple disconnected manifolds

Code	Fingerprint	What it catches	Phase
G01	Local Intrinsic Dimensionality	Multiple disconnected sub-manifolds — mode collapse	A
G02	Persistent Homology	Topological deformation from natural Betti profile	A
G03	Manifold Smoothness	Over-smooth neighborhoods from diffusion models	A
G04	Fractal Dimension	Integer-dimensional generator manifolds	A
G05	Lacunarity	Texture and gap structure abnormalities	A
G06	Multifractal Spectrum	Singularity spectrum width — natural vs synthetic	A
G07	Topological Data Depth	Tukey depth distribution anomalies	A

References: Edelsbrunner & Harer (2010) "Computational Topology" · Carlsson (2009) "Topology and Data" · Chazal et al. (2014) "Stochastic Convergence of Persistence Landscapes"

CATEGORY 04 / 15

Phase B

Distributional Geometry

Covariance structure, copula dependencies, and tail behaviour encode the joint distributional fingerprint of a dataset. The Marchenko-Pastur law governs the eigenspectrum of random covariance matrices — deviations indicate structured dependencies that should not exist in the declared data-generating process. Wasserstein distance provides an optimal-transport metric between empirical and reference distributions.

Code	Fingerprint	What it catches	Phase
D01	Covariance Eigenspectrum	Marchenko-Pastur bulk edge deviation — hidden structure	B
D02	Correlation Frobenius Norm	Deviation from noise floor — unexpected correlation	B
D03	Copula Structure	Joint distribution beyond marginals — dependency injection	B
D04	Tail Index (Hill Estimator)	Heavy tail truncation — synthetic generators clip extremes	B
D05	GEV Fit Quality	Extreme value distribution mismatch	B
D06	Q-Q Deviation Score	Systematic quantile deviations from reference	B
D07	Skewness/Kurtosis Surface	3D moment landscape — synthetic data over-normalised	B
D08	Wasserstein Distance	Optimal transport distance to reference distribution	B

References: Marchenko & Pastur (1967) · Sklar (1959) "Fonctions de répartition" · Hill (1975) "A Simple General Approach to Inference About the Tail"

CATEGORY 05 / 15

Phase B

Temporal / Sequential Analysis

The Hurst exponent (H) characterises long-range dependence. H = 0.5 indicates a random walk; H ≠ 0.5 indicates persistent (H > 0.5) or anti-persistent (H < 0.5) behaviour. Real-world time-series in regulated domains have well-characterised H values. Synthetic generators frequently produce H values inconsistent with the declared domain. The Lyapunov exponent measures chaos — fabricated sequences often fail to reproduce the characteristic chaos structure of real processes.

Code	Fingerprint	What it catches	Phase
T01	Hurst Exponent	Long-range dependence mismatch (H=0.5 random, H≠0.5 suspect)	B
T02	Lyapunov Exponent	Chaos measure — fabricated sequences lack natural chaos	B
T03	Recurrence Quantification	Diagonal line structure in recurrence plot	B
T04	Detrended Fluctuation Analysis	Scaling behaviour inconsistency across time scales	B
T05	Symbolic Dynamics	Word frequency deviations in symbolic encoding	B
T06	Ordinal Pattern Distribution	Forbidden patterns — impossible in natural sequences	B
T07	Visibility Graph Degree	Network properties of time series — synthetic smoothing	B
T08	Hjorth Parameters	Activity, mobility, complexity — biomedical signal fabrication	B

References: Hurst (1951) "Long-term storage capacity of reservoirs" · Peng et al. (1994) "Mosaic organisation of DNA nucleotides" · Lacasa et al. (2008) "Visibility graphs"

CATEGORIES 06 – 15 / 15

The remaining ten categories complete the 95-fingerprint matrix, covering inter-record relationships, cross-column dependencies, value precision forensics, missing-data patterns, generative model fingerprints, human fabrication patterns, domain physical constraints, graph-theoretic structure, full statistical process control, and cryptographic integrity.

CATEGORY 06 — Inter-Record Geometry (Phase C · 7 fingerprints)

The Hopkins statistic tests for spatial randomness in k-NN distance distributions. Near-duplicate detection via Jaccard and edit-distance similarity exposes copy-paste fabrication. Mahalanobis distance distribution shape reveals whether multivariate outlier density matches domain priors.

R01 k-NN Distance · R02 Duplicate Detection · R03 Record Spacing · R04 Mahalanobis · R05 UMAP Density · R06 MST Properties · R07 LOF Distribution

CATEGORY 07 — Cross-Column Dependency (Phase C · 7 fingerprints)

The NMI matrix eigenspectrum reveals the information structure across columns. Granger causality testing exposes unexpected directed causal relationships. Partial Information Decomposition separates synergistic from redundant information — ratios inconsistent with the declared data-generating process indicate manipulation.

X01 NMI Eigenspectrum · X02 Granger Causality · X03 Distance Correlation · X04 Conditional Entropy · X05 Partial Info Decomp · X06 CCA · X07 Copula Tail

CATEGORY 08 — Precision & Representation (Phase A · 7 fingerprints)

Significant figure analysis and round-number density testing operationalise Benford's Law extensions into the precision domain. Real measurement instruments produce characteristic precision distributions. Human fabrication exhibits pile-up at powers of 10, integer values, and "memorable" reference points. Floating point bit pattern analysis detects imputation and synthetic generation at the binary level.

P01 Decimal Precision · P02 Significant Figures · P03 Round Number Density · P04 Float Bit Pattern · P05 Value Vocabulary · P06 String Length Dist · P07 Precision Drift

CATEGORY 09 — Missing Data Forensics (Phase A · 6 fingerprints)

Little's MCAR test distinguishes Missing Completely At Random from structured absence. Logistic regression of missingness on observed values detects predictable gaps — a strong indicator of selective deletion or imputation. The Imputation Fingerprint detects prior imputation by identifying statistical signatures left by common imputation algorithms (mean, median, KNN, MICE).

M01 Little's MCAR · M02 Missing Pattern Matrix · M03 Logistic Regression · M04 Gap Clustering · M05 Imputation Fingerprint · M06 Boundary Missingness

CATEGORY 10 — Generative Model Fingerprints (Phase A · 7 fingerprints)

GAN mode collapse produces an abnormally small number of unique value clusters relative to dataset size. GAN checkerboard artifacts appear as periodic peaks in the Fourier periodogram at strides matching the generator's upsampling layers. VAE posterior collapse is detected via latent dimension utilisation analysis. Memorisation detection searches for near-exact matches to known public datasets.

V01 Mode Collapse · V02 GAN Checkerboard · V03 VAE Posterior Collapse · V04 Diffusion Smoothness · V05 Memorisation · V06 Latent Geometry · V07 Generator Periodicity

CATEGORY 11 — Human Fabrication Fingerprints (Phase A · 7 fingerprints)

Cognitive psychology has established seven reliable signatures of human data generation. Anchor bias produces clustering around memorable reference points (Tversky & Kahneman 1974). Fatigue patterns manifest as degrading randomness over record sequence — early records are more random than late records. Copy-increment detection identifies value[n] ≈ value[n-1] ± small_delta, a common manual entry shortcut.

H01 Anchor Bias · H02 Avoidance Patterns · H03 Narrative Coherence · H04 Fatigue Pattern · H05 Copy-Increment · H06 Symmetric Bias · H07 Temporal Regularity

CATEGORY 12 — Domain Physical Constraints (Phase B · 7 fingerprints)

Physical law violations are strong falsification events — a dataset where blood pressure increases monotonically with patient age, without exception, is not a real clinical dataset. Biological plausibility testing maps age-lab value co-occurrence matrices against known physiology. Temporal impossibility detection flags events that violate time ordering. Unit coherence testing detects mixed-unit columns.

C01 Biological Plausibility · C02 Physical Law Violations · C03 Temporal Impossibility · C04 Logical Consistency · C05 Unit Coherence · C06 Reference Range Clustering · C07 Inter-measurement Correlation

CATEGORY 13 — Graph / Network Properties (Phase C · 5 fingerprints)

Re-identification risk quantifies the uniqueness of quasi-identifier combinations — a necessary test for datasets declared to be de-identified. Record linkage graph analysis tests whether records in the dataset can be linked to external public datasets, exposing potential data provenance issues.

N01 Re-identification Risk · N02 Record Linkage Graph · N03 Social Network Fingerprint · N04 Attribute Correlation Graph · N05 Clique Structure

CATEGORY 14 — Statistical Process Control (Phase C · 5 fingerprints)

The full Western Electric rule battery (all 8 rules) and all 10 Nelson patterns are applied to every numeric column. Originally developed for manufacturing quality control in the 1950s, these rules detect systematic patterns that should not appear in random data. Applied to dataset columns, they expose systematic manipulation, data drift, and structural breaks. Hotelling T² extends SPC to multivariate contexts.

Q01 Western Electric (8 rules) · Q02 Nelson (10 patterns) · Q03 EWMA · Q04 Hotelling T² · Q05 CUSUM Battery

CATEGORY 15 — Cryptographic Integrity (Phase A · 5 fingerprints)

Every cell is hashed after canonical encoding. Row hashes are chained sequentially. The Merkle tree root commits to the entire dataset. The Ed25519 signature seals the root along with the policy verdict. RFC 3161 timestamping anchors the sealed object to an external time reference — producing a Signed Policy Object (SPO) that is independently verifiable, offline, by anyone holding the public key.

CR01 Cell Hash · CR02 Row Hash Chain · CR03 Schema Fingerprint · CR04 Merkle Root · CR05 RFC 3161 Temporal Seal

SYNTHESIS

Bayesian Evidence Accumulation

Ninety-five independent signals must be combined into a single, court-defensible verdict. A naive conjunction of p-values would be dominated by the multiple comparisons problem. ResEthiq uses a Bayesian evidence accumulation framework with Benjamini-Hochberg FDR correction — producing a posterior probability of dataset integrity with calibrated credible intervals.

Methodology

Each fingerprint i produces a test statistic t_i and p-value p_i. The FDR-adjusted p-values q_i (Benjamini-Hochberg) control the expected proportion of false discoveries. These are converted to Bayes factors via the Sellke-Berger-Howard formula and accumulated via Bayes' theorem against a domain-calibrated prior P(integrity | domain). The posterior P(integrity | data, policy) is the final verdict score. Credible intervals are computed via MCMC.

// Step 1: FDR correction (Benjamini-Hochberg 1995) q_i = p_(i) · m / i where p_(i) are ordered p-values, m = 95 // Step 2: Bayes Factor from p-value (Sellke-Berger-Howard) BF_i = -e · p_i · ln(p_i)⁻¹ (minimum Bayes factor bound) // Step 3: Evidence accumulation log P(integrity|data) = log P(integrity) + Σᵢ log BF_i // Step 4: Posterior probability P(integrity|data, policy) = sigmoid(log-odds) Output: posterior ∈ [0,1] + 95% credible interval + FDR q-value // Step 5: Binary verdict APPROVED if P(integrity) > policy.threshold AND all REJECT rules pass REJECTED otherwise → flagged fingerprints listed with evidence weights

References: Benjamini & Hochberg (1995) "Controlling the False Discovery Rate" · Sellke, Bayarri & Berger (2001) "Calibration of p values" · Gelman et al. (2013) "Bayesian Data Analysis"

PRIOR ART

Comparison to prior systems

System	Fingerprints	Categories	Cryptographic seal	Offline verify	Bayesian synthesis	GAN detection	Human fab detection
ResEthiq v1	95	15	Yes	Yes	Yes	Yes	Yes
ResEthiq v0 (prior plan)	12	4	—	—	—	—	—
Great Expectations	~40	3	—	—	—	—	—
Deepchecks	~35	4	—	—	—	—	—
Evidently AI	~25	3	—	—	—	—	—
Academic literature (best)	~15	2	—	—	partial	partial	—
Manual audit (Big 4 firms)	~8	2	—	—	—	—	—

ARCHITECTURE

Four invariants. Zero ambiguity.

The ResEthiq architecture satisfies four invariants that together guarantee a dataset's integrity certificate is reproducible, portable, and independently verifiable — properties that no prior system has formalised simultaneously.

INVARIANT A — Determinism

Given the same dataset, policy specification, and ResEthiq version, the system must produce an identical Merkle root and identical verdict on any platform, any OS, any hardware. All fingerprint computations are seeded deterministically. No stochastic components without fixed seeds.

INVARIANT B — Canonical Encoding

All values are serialised via a canonical encoding before hashing — floating point values via IEEE 754 round-trip string, categorical values via NFC Unicode normalisation, null values via explicit sentinel bytes. No encoding ambiguity is tolerated.

INVARIANT C — Zero Server Trust

The Signed Policy Object (SPO) is independently verifiable using only the ResEthiq public key and the open-source Verifier CLI. No ResEthiq server, account, or network access is required. An SPO produced today remains verifiable in thirty years.

INVARIANT D — Version Discipline

Every SPO embeds the exact ResEthiq version that produced it. Policy specifications are versioned and immutable once deployed. A dataset certified under policy v2.1 cannot be silently re-evaluated under policy v3.0 — version mismatches are hard errors.

Full methodology paper available on request.

Complete mathematical derivations, benchmark datasets, and reproducibility instructions for all 95 fingerprints.

Request preprint

The mathematical foundationsof dataset integrity.

Frequency Domain Analysis

Information Theory

Geometric / Manifold Analysis

Distributional Geometry

Temporal / Sequential Analysis

Bayesian Evidence Accumulation

Comparison to prior systems

Four invariants. Zero ambiguity.

The mathematical foundations
of dataset integrity.