Scientific Benchmarks
Most fraud detection vendors won't publish these numbers. We do.
Transparent, peer-reviewable methodology with 95% confidence intervals, adversarial red-team testing against real-world evasion tools, and every limitation explicitly disclosed. All benchmark code is open-source.
Benchmark Dashboard
Illustrative performance metrics from controlled test environments
True positive rate on labeled dataset (n=50,000)
Legitimate users incorrectly flagged
Edge runtime, Vercel PoP, controlled load
Detection by Attack Category
Controlled test environments with adversarial red-team scenarios
Latency Pipeline Breakdown
End-to-end timing from signal collection to decision delivery
Canvas, WebGL, audio, hardware probes
All fingerprints hashed before transmission
Nearest edge PoP, TLS 1.3
26-layer pipeline, deterministic scoring
JSON with evidence trail and decision
Test Framework & Dataset
Benchmarks are conducted against standardized fraud detection test suites derived from anonymized production traffic patterns. Datasets include labeled ground-truth for supervised evaluation — each sample independently verified by at least two human analysts.
Test corpus: 50,000+ labeled sessions spanning bot traffic, credential stuffing, trial abuse, account takeover, and legitimate user flows. Class distribution mirrors real-world attack-to-legitimate ratios (approximately 3:97).
All datasets are version-controlled and immutable once published. Dataset provenance and labeling methodology are documented in the accompanying whitepaper.
Reproducibility & Statistical Rigor
All tests use deterministic seeds and fixed datasets. Results are averaged across 10,000+ iterations with 95% confidence intervals reported. The Bayesian fusion pipeline is fully deterministic — identical inputs produce identical outputs.
Confidence intervals computed via bootstrap resampling (n=1,000 resamples per metric). We report both mean and median values where distributions are non-Gaussian.
Benchmark code is open-source and auditable. Third parties can reproduce results using the published test harness and dataset specifications.
Runtime Environment
Latency benchmarks run on Vercel Edge Functions (global network, 50+ PoPs) under controlled load conditions. Client-side signal collection measured on Chrome 120+ / Firefox 121+ / Safari 17+ across desktop and mobile form factors.
Server-side decision latency measured at the edge runtime boundary — excludes DNS resolution and TCP handshake. Client-side collection timing includes all probe initialization, execution, and SHA-256 hashing.
Network conditions: benchmarks conducted from multiple geographic regions (US-East, EU-West, APAC) to capture real-world latency distribution. Results aggregated across all PoPs.
Adversarial Red-Team Testing
Red-team scenarios include: headless browser evasion (Puppeteer stealth, Playwright, undetected-chromedriver), residential proxy rotation (Bright Data, Oxylabs), browser fingerprint spoofing (Canvas Defender, WebGL fingerprint randomization), and multi-layer attack chains.
Anti-detect browser testing covers Multilogin, GoLogin, Dolphin Anty, and Incogniton profiles with hardware fingerprint randomization enabled.
State-sponsored or novel zero-day evasion techniques are explicitly excluded from these benchmarks. We test against publicly available tools and techniques only — this is a disclosed limitation.
Scoring Pipeline Architecture
The detection engine uses a 26-layer Bayesian fusion pipeline. Each layer evaluates an independent signal dimension and produces a posterior probability estimate. Layer outputs are combined using weighted log-odds fusion with configurable weight capping to mitigate conditional independence violations.
The pipeline is fully deterministic — no machine learning models, no neural networks, no stochastic components. This means: identical inputs always produce identical outputs, decisions are fully explainable down to individual signal contributions, and there is no model drift or retraining requirement.
Signal categories: device fingerprint entropy (canvas, WebGL, audio context, font enumeration), hardware timing analysis (crystal oscillator drift, GPU render timing, memory latency), behavioral biometrics (keystroke entropy, mouse micro-tremor Hurst exponent, click hesitation patterns), network characteristics (TLS JA3/JA4 fingerprinting, IP reputation, ASN analysis), and cross-session correlation (device linkage, velocity patterns, belief state propagation).
Compliance & Certifications
Trust Service Criteria: Security, Availability, Confidentiality
Privacy by Design (Art. 25), Data minimization, Right to erasure
Information Security Management System certification
Client-side hashing, no raw PII storage, audit-ready evidence trails
Data Governance & Privacy
Data Collected & Stored
- • SHA-256 hashed device fingerprints
- • Bayesian belief state (α, β parameters)
- • Risk scores and decision outcomes
- • Aggregated signal statistics
- • API request metadata (defined retention)
Data NOT Collected
- • Raw canvas/audio/font data
- • Keystroke content or form inputs
- • Browsing history or page content
- • Personal identifiers in plaintext (emails and IPs are hashed before storage)
- • Cookies or local storage contents (detection engine does not access these)
Privacy by Design: Our detection engine operates on derived signals only. Raw sensor data is hashed client-side before any transmission. All fingerprints are one-way hashed before storage, making reverse-engineering impossible.
Methodology Notes & Disclosed Limitations
- • Detection accuracy metrics are derived from controlled experiments on labeled datasets (n=50,000+, 95% CI via bootstrap resampling). Real-world performance will vary based on traffic composition, attack sophistication, and signal availability across client environments.
- • Latency figures represent server-side decision time measured at the edge runtime boundary under controlled load. End-to-end latency (including client-side signal collection) ranges 145–265ms depending on browser, device, and probe configuration.
- • Adversarial testing uses publicly available tools and techniques only. State-sponsored actors, novel zero-day evasion methods, or attacks targeting specific hardware configurations may achieve different evasion rates. This is an inherent limitation.
- • Bayesian fusion assumes conditional independence between signal layers, which is frequently violated in practice. We mitigate this through weight capping (max log-odds contribution per layer), layer decorrelation, and posterior calibration checks.
- • False positive rate (0.03%) is measured against a labeled legitimate-user corpus. In production, the effective FPR depends on policy configuration, threshold tuning, and whether shadow mode is enabled during initial deployment.
- • All benchmark code is open-source and auditable at github.com/verifystack. We welcome independent reproduction and peer review.
Last updated: February 2026 | Benchmark version: 2.1.0 | Methodology revision: 4.2 | Dataset version: 3.0.1