ATO Device Embeddings

Overview

A controlled empirical study of FastText-based device-attribute embeddings for account-takeover (ATO) anomaly detection. Four pre-specified comparisons run across five independent seeds isolate one positive design choice and three implementation traps that standard aggregate evaluation conceals.

The question

ATO detection works under a hard asymmetry: attackers are rare (roughly 1:100), but each one is high-value. Device fingerprints are an appealing signal because they’re observed passively at login and add no user friction. Embedding-based scoring is doubly appealing because it needs no labels; the signal is anomaly proximity to a per-account behavioral centroid.

But how to embed device attributes for label-free cosine-distance scoring is underspecified. No published work compares per-feature versus concatenated-string strategies on this task, characterizes corpus-construction failure modes, or quantifies how per-user score normalization interacts with realistic class imbalance.

Method

Each comparison is a pre-specified, five-seed experiment on synthetic data with controlled ground truth: which accounts have fleet injection, which features were spoofed, at what imbalance ratio. Mechanism isolation requires that controlled ground truth; production data doesn’t expose it. The C1 direction is then replicated on the public RBA dataset for distributional realism, with the caveat that the RBA ATO population at the chosen temporal split is small (n=9), so the public-dataset evidence is directional rather than strict.

The architecture under test: FastText skip-gram (vector_size=64, window=6, char n-grams 3–6) trained on per-account corpora, scoring each login by cosine distance from the per-account centroid.

The four findings

Mean-pooled per-feature embeddings beat concatenated-string encoding on spoof attacks. +0.131 ROC-AUC (0.868 vs 0.737), with the PR-AUC gap nearly 2× larger (+0.248). Concatenated-string spoof PR-AUC (0.542) is barely above the trivial baseline (0.500), exposing it as essentially non-functional at the task, a severity that ROC-AUC masks. The mechanism: cross-boundary character n-grams in the concatenated string create subword features spanning multiple device attributes, so when one attribute changes, its signal is diluted across positional n-grams shared with unchanged neighbors. Mean-pooling avoids this; one feature changes, 1/6 of the total distance shifts directly. Both encoders converge on novel (all features differ) and fleet (known-device proximity); the gap is concentrated on the case where exactly one feature differs and its signal must not be diluted. Direction holds on the RBA public dataset on 5/5 seeds.

Embedding collapse comes from corpus construction, not training objective. A 2×2 factorial (skip-gram or CBOW, per-account or per-event corpus) shows that per-event corpora collapse within-feature cosine similarity to ~0.99 on every seed and every training objective, while per-account corpora never collapse on either. The mechanism is structural: in a per-event corpus, every sentence is [os, browser, tz, lang, network, screen] in fixed positional order, so tokens of the same feature always occupy the same position with the same neighbors. Their context distributions become structurally identical, and gradient descent converges to identical embeddings. The downstream consequence is invisible to aggregate AUC: under the collapsed configuration, novel and fleet AUC remain ~0.99 while spoof AUC degrades to ~0.38. The within/cross-feature cosine similarity ratio is proposed as an operational pre-deployment health check; a ratio approaching 1.0 signals impending collapse before model retraining commits the damage.

The standard known-device gate produces zero true positives on the fleet population it’s designed to catch. Zero top-1% true positives on fleet residuals (cold-start fleet accounts) on all 5 seeds. Raw centroid-cosine scoring recovers detection at 0.491 top-1% precision on the same events. The mechanism is structural: fleet devices appear in training by construction (simulating unconfirmed prior attack events), so the gate, which fires on any training appearance, suppresses exactly the population it’s meant to detect. The two-stage gate’s ROC-AUC is not merely close to the trivial baseline; it’s identical (delta = 0.000 on all 5 seeds), a logical consequence rather than a noisy near-miss. The cosine signal itself isn’t degraded by training contamination; the centroid averages over ~40 training events, so a single contaminating event shifts it minimally and the anomaly score still reflects bulk-of-history fit. In any deployment where device reuse across accounts is possible (botnets, credential stuffing), a binary known-device gate systematically suppresses the highest-value alerts.

Per-user CDF rank-normalization collapses PR-AUC under realistic imbalance. PR-AUC drops from 0.888 to 0.224 at 1:100 imbalance (a 4× collapse), while ROC-AUC declines only 0.021. The 31× ratio between the two declines means a practitioner monitoring ROC-AUC would not detect the problem. The mechanism is score-margin compression: rank-normalization maps each user’s scores into [0,1] by their own distribution, so ~5% of every user’s benign events receive rank > 0.95 regardless of absolute anomaly score, a structural floor. At 1:100 imbalance, this injects thousands of high-rank benign events into the operating region where attacks should dominate, destroying precision at any useful recall point. ROC marginalizes over the majority class and barely registers the collapse; PR does not. The finding generalizes beyond ATO to any per-entity score normalization under class imbalance, and is dangerous precisely because ROC-AUC is the metric most commonly monitored in production fraud systems.

Common thread

The four findings share a structural property: standard aggregate evaluation conceals each failure mode.

Spoof gap is invisible without attack-subtype-stratified AUC.
Corpus collapse preserves novel and fleet AUC while destroying spoof.
Gate failure is invisible without fleet-population-stratified top-k analysis.
PR-AUC collapse is concealed by ROC-AUC.

A team monitoring overall held-out AUROC would ship a system containing all four. The contributions are concrete design guidance for the lightweight tier of ATO defense: a specific architecture that outperforms the baseline on the hardest attack subtype, three specific choices that break it, and the evaluation strategy needed to tell the difference.

Scope

A controlled PoC study. The primary evaluation is on synthetic data with a closed 30-token vocabulary and controlled ground truth; the RBA replication confirms C1’s direction on independently-structured data but is underpowered for strict encoder ordering. The work does not benchmark against transformer or GNN frontier methods (they target a different deployment tier of offline batch scoring with labeled data), and does not study adversarial drift.

Built with ML Lab

The investigation was run end-to-end inside ML Lab, with hypothesis sharpening, pre-registered verdicts evaluated per seed, adversarial review of the experimental design, and an append-only audit log of each step. The per-seed boolean verdicts, the locked-before-data analysis plan, and the structured comparison protocol aren’t decoration; they’re what the framework requires.