Machine learning fraud detection uses supervised classifiers (XGBoost, LightGBM), unsupervised anomaly detection (Isolation Forest, autoencoders), and graph neural networks on transaction graphs to score events in under 100 ms — flagging payment, account, and identity fraud with precision–recall AUC well above static rule engines. In 2026 the working stack is Kafka or Pulsar streaming a feature store (Feast, Tecton) into a model server, with EU PSD3/PSR rules, the new AMLA authority, and the EU AI Act setting the compliance perimeter.
Fraud is a real-time engineering problem. A user taps “Pay” on a checkout page; somewhere between Stripe, Adyen, Visa, the issuing bank, and a sanctions screening service, ten to thirty different scoring functions run in parallel and a decision lands in roughly 80–250 milliseconds. That decision is increasingly produced by machine learning, not by rules written in 2014 by a risk analyst who has since changed jobs.
This article is a working developer’s guide to ML fraud detection in 2026: which models actually ship, why class imbalance breaks naïve metrics, how graph neural networks moved from research to production at Mastercard and Stripe, and where EU rules — the EU AI Act, PSD3/PSR, and the brand-new AMLA authority — draw the lines.
What is machine learning fraud detection?
Machine learning fraud detection is the practice of training statistical models on labeled or partially labeled transaction data to predict the probability that a given event — a card payment, a bank transfer, a new account, an insurance claim — is fraudulent. The model output is usually a score in [0, 1] or a calibrated probability, fed into a downstream policy that decides whether to approve, decline, or step up to friction (3-D Secure, OTP, manual review).
The shift from rules to ML happened because fraud is adversarial and high-dimensional. A static rule like “decline if amount > €500 AND country mismatch” is trivially gamed; an XGBoost model evaluating 800 engineered features over a 30-day rolling window is harder to enumerate. Stripe’s Radar publicly documents using “thousands of features” per request, and Visa reports its AI-based fraud platform prevented an estimated $40 billion in attempted fraud in 2023 across roughly 192 billion transactions (Visa investor release, March 2024).
Where ML actually beats rules — and where it doesn’t
- ML wins: high-volume, high-cardinality decisions where patterns drift weekly — card-not-present (CNP), BNPL approvals, account-takeover detection, marketplace seller risk.
- Rules win: hard regulatory constraints (sanctions screening, KYC threshold checks), explainability-mandatory steps, brand-new attack vectors with zero training data.
- Hybrid is the norm: production fraud stacks chain a rule engine (deterministic blocks), an ML score, and a policy layer that turns score → action. Pure-ML systems exist mostly in pitch decks.
How does the data look? The class-imbalance problem
Card fraud rates in the SEPA area sit near 0.026% of total card payment value according to the most recent ECB Report on Card Fraud (May 2024, covering 2022 data), with €4.3 billion in absolute losses in 2022 across the SEPA area. By transaction count, the fraud-to-legitimate ratio is roughly 1:1,000 to 1:10,000 depending on the channel. That is the central technical fact of the field — and the reason most off-the-shelf ML tutorials produce garbage models when applied directly to fraud data.
With 1 fraud per 1,000 transactions, a model that always predicts “legit” has 99.9% accuracy and AUC-ROC near 0.5 with very low variance. AUC-ROC is dominated by the easy negative-vs-negative comparisons. Use AUC-PR (precision–recall) and explicit cost matrices (false positive cost = friction + lost approval; false negative cost = chargeback + ops review) for any production decision.
Resampling: SMOTE is usually the wrong answer
The reflexive fix — SMOTE, random oversampling, undersampling — almost always degrades performance in production. Synthetic minorities created by interpolation in a high-dimensional categorical feature space generate fraud examples that don’t correspond to real attack patterns, and models trained on them generalize poorly when adversaries shift. Practitioners at large processors typically prefer:
- Cost-sensitive learning: weight the loss function by the asymmetric cost matrix (e.g., XGBoost’s
scale_pos_weight). - Focal loss: down-weights easy negatives so gradient signal concentrates on hard examples — the original Lin et al., 2017 paper for object detection generalizes well to fraud.
- Stratified time-based splits: never random splits — fraud distributions drift, so train/test must respect time order to avoid leakage.
Which models ship in production fraud systems?
The 2026 inventory across publicly documented systems (Stripe, Adyen, Mastercard, Visa, Sift, Feedzai, plus disclosed EU bank stacks) converges on four families:
| Model family | Best for | Production examples | Latency (p95) |
|---|---|---|---|
| Gradient-boosted trees (XGBoost, LightGBM, CatBoost) | Tabular features, supervised, drift-resistant baseline | Stripe Radar core score; most EU bank decision layers | 2–10 ms |
| Autoencoders / Isolation Forest | Unsupervised novelty detection — new fraud rings, no labels yet | AML transaction monitoring, account anomaly detection | 5–20 ms |
| Graph Neural Networks (GraphSAGE, GAT, PinSAGE) | Entity-level: shared devices, IBANs, addresses across accounts | Mastercard Decision Intelligence Pro (2024), Visa Account Attack Intelligence, Stripe link-level features | 20–80 ms (subgraph sampled) |
| Sequence models (LSTM, Transformers) | Spending pattern drift, session-based attack detection | Card issuer behavioral profiling; e-commerce session risk | 10–40 ms |
Why graph neural networks matter in 2026
The breakthrough that moved GNNs from research to production was the realization that fraud is overwhelmingly a second-order signal. A single account looks normal; the same account sharing a device fingerprint, an IP /24, a beneficiary IBAN, and a billing-address postcode with fourteen other accounts that have all charged back in the past 30 days is obvious — but only on the graph.
Mastercard’s Decision Intelligence Pro, launched February 2024, uses generative AI atop a transaction graph of around 125 billion annual transactions, and Mastercard’s own disclosure cites a 20% average uplift in fraud detection versus the prior model. Stripe shipped graph-based features — including a “consumer link” graph identifying repeat-fraud rings across merchants — and discloses on its engineering blog that this single feature class lifted recall by double-digit percentages on certain fraud types.
Practical caveats: GNNs are operationally heavy. You need a real-time graph store (Neptune, Neo4j, or a custom shard), subgraph sampling (because you cannot run message passing over a 200M-node graph at 50 ms p95), and careful handling of leakage — a node’s neighborhood in production cannot include the future.
What does a real-time fraud detection pipeline look like?
A simplified but representative architecture used by mid-to-large processors and EU neobanks in 2026:
- Event ingest: a payment authorization request hits the gateway and emits an event onto Kafka (or Pulsar / Kinesis).
- Feature computation: a stream processor (Flink, Spark Structured Streaming) joins the event with a feature store (Feast, Tecton) — historical aggregates like “txn count last 1h”, “merchant velocity 24h”, “geo-IP / device entropy 7d”.
- Model inference: the assembled feature vector is sent to a model server (TF Serving, Triton, ONNX Runtime, or Ray Serve). Multiple models in parallel: tabular GBM + sequence model + GNN subgraph score.
- Decision policy: scores are combined under an explicit cost-sensitive policy. Output is one of {approve, decline, step_up, manual_review}.
- Feedback loop: chargebacks (typically arriving 7–60 days later) and analyst dispositions flow back into the training set; a champion–challenger framework retrains weekly or daily.
Minimal Python example: gradient boosting + isolation forest baseline
The example below is the kind of starting baseline you would actually write before adding anything fancier. Synthetic data is for illustration; the structure — time-based split, AUC-PR, cost matrix — is identical to what you would deploy.
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.metrics import average_precision_score, precision_recall_curve
import xgboost as xgb
# df has columns: ts, amount, merchant_cat, geo_mismatch, device_age_days,
# txn_count_1h, txn_count_24h, ..., is_fraud (0/1)
df = pd.read_parquet("transactions_2025.parquet").sort_values("ts")
# Time-based split — never random on fraud data
cutoff = df["ts"].quantile(0.8)
train = df[df["ts"] <= cutoff]
test = df[df["ts"] > cutoff]
features = [c for c in df.columns if c not in ("ts", "is_fraud")]
X_train, y_train = train[features], train["is_fraud"]
X_test, y_test = test[features], test["is_fraud"]
# Cost-sensitive XGBoost — DO NOT SMOTE
neg, pos = (y_train == 0).sum(), (y_train == 1).sum()
clf = xgb.XGBClassifier(
n_estimators=600,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=neg / pos, # asymmetric cost
tree_method="hist",
eval_metric="aucpr",
)
clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
scores = clf.predict_proba(X_test)[:, 1]
print(f"AUC-PR: {average_precision_score(y_test, scores):.4f}")
# Optional: unsupervised novelty score for fraud not seen in training
iso = IsolationForest(n_estimators=300, contamination="auto", random_state=42)
iso.fit(X_train)
novelty = -iso.score_samples(X_test) # higher = more anomalous
# Combine: weighted blend, calibrated downstream
combined = 0.85 * scores + 0.15 * (novelty - novelty.mean()) / novelty.std()
# Cost-sensitive threshold: FP_cost = €0.50 (friction), FN_cost = €60 (chargeback)
prec, rec, thr = precision_recall_curve(y_test, combined)
fp_cost, fn_cost = 0.50, 60.0
expected_loss = []
for t in thr:
pred = combined >= t
fp = ((pred == 1) & (y_test == 0)).sum()
fn = ((pred == 0) & (y_test == 1)).sum()
expected_loss.append(fp * fp_cost + fn * fn_cost)
best_t = thr[int(np.argmin(expected_loss))]
print(f"Optimal threshold: {best_t:.4f}")
Three things in this snippet are not negotiable in production: time-based split (the comment is there for a reason — randomly splitting fraud data leaks future information into training), scale_pos_weight instead of resampling, and a threshold chosen against an explicit cost matrix rather than the default 0.5.
What metrics should you actually report?
Approval rate, fraud rate (basis points), and net economic loss — in that order. Pure ML metrics (AUC-PR, F1) tell you whether the model is improving; business metrics tell you whether the system is improving.
- Approval rate = approvals / total auth attempts. The CFO cares about this.
- Fraud rate (bps of GMV) = chargebacks + confirmed fraud / total approved volume × 10,000. Card-network programs (Visa VAMP from April 2025, Mastercard Excessive Fraud Merchant) penalize merchants above ~0.9% chargeback ratio.
- FRR (false rejection rate) = good transactions declined / total good. Easy to underestimate the cost — Mastercard’s 2023 customer survey reported that 33% of consumers stop using a card after a single false decline.
- Population stability index (PSI) on input features and on score distributions, monitored daily. PSI > 0.25 typically triggers retraining or investigation.
Which fraud types is ML used for?
Card-not-present (CNP) payment fraud
The largest single category — roughly 84% of SEPA card fraud value in 2022 according to the ECB. CNP is the home turf of supervised tabular models because labels (chargebacks) eventually arrive, even if delayed.
Authorized push payment (APP) fraud and SEPA Instant
The fastest-growing fraud category in Europe, driven by social-engineering and “safe account” scams. The October 2025 SEPA Instant rules under Regulation (EU) 2024/886 made Verification of Payee (VoP) mandatory for all Euro-area PSPs — every credit transfer must IBAN-name match before the customer confirms. ML is used both for VoP fuzzy-match scoring (typos, transliteration) and for behavioral risk scoring on the payer side.
Account opening / synthetic identity
Synthetic-ID fraud — fabricated identities stitched from real and fake PII — is roughly a $40B+ annual loss problem in the US. Detection leans heavily on graph signals: shared phone numbers across accounts, address clustering, device fingerprint reuse. This is where GNNs earn their compute budget.
Anti-money laundering (AML)
Distinct from “fraud” in regulatory taxonomy but shares the toolchain. The EU AML Authority (AMLA), established under Regulation (EU) 2024/1620, became operational on 1 July 2025 in Frankfurt and will directly supervise selected high-risk obliged entities from 2028. AMLA’s mandate explicitly covers technology used for transaction monitoring — meaning ML AML systems are now subject to direct EU-level supervision, not just national FIUs.
What does the EU AI Act say about fraud detection?
This is the question I get asked most by EU developers, and the answer is more nuanced than the headlines suggest. The EU AI Act’s Annex III lists “high-risk” use cases. Credit scoring is explicitly Annex III point 5(b). Fraud detection used by financial institutions for their own risk management is, by the Commission’s own guidance, generally not high-risk — it sits outside Annex III scope.
Recital 58 of the AI Act explicitly notes that AI used “for the purpose of detecting fraud in the offering of financial services” should not be classified as high-risk. The reasoning: prudential safety, not consumer scoring, and excessive friction would harm consumers. This carve-out does not extend to credit scoring or insurance risk classification.
That said, the Act’s general-purpose AI obligations, transparency rules (Art. 50), and prohibitions (Art. 5 — including bans on social scoring and certain biometric categorization) still apply to fraud systems where relevant. And if your “fraud” model is also doing credit decisioning under the hood — for example, a BNPL underwriting model styled as “fraud check” — regulators will look through the label and apply Annex III obligations.
Layered on top: PSD3 and the Payment Services Regulation (PSR), currently in trilogue with adoption expected late 2026 and application 18 months later. The PSR proposal includes mandatory transaction-monitoring obligations, expanded liability for APP fraud, and a duty for PSPs to participate in fraud-data sharing schemes. ML systems will be the de-facto compliance technology — but the transparency and explainability requirements will tighten.
What about explainability and adversarial robustness?
Two practical concerns dominate post-deployment engineering work.
Explainability: when you decline a customer’s transaction, EU consumer law (and increasingly PSR Article 83 in the proposed text) require a meaningful reason. SHAP values on tree models give per-decision feature attributions — workable for individual reason codes. But Bussmann et al. (2021) showed different surrogate explainers can produce contradictory attributions for the same model output, which becomes a problem when a customer disputes a decision. Best practice in 2026: bake the reason-code mapping into a fixed lookup over feature buckets, validate it offline against SHAP medians, and use SHAP only as a monitoring sanity check.
Adversarial robustness: fraud models live in an adversarial setting. Attackers probe the system continuously and adapt. The single most underrated practice is champion–challenger with shadow scoring — every production decision is also scored by a challenger model whose decisions are logged but not enforced. When the challenger’s policy outperforms the champion on the cost function over a fixed window, you promote it. This is how processors keep up with monthly drift without committing to a new model in a single switch.
Personal note: what fraud feels like from the trader side
I’ve been trading CFDs since 2024 — Plus500, gold and indices, SMC methodology — and the fraud system you interact with as a retail trader is mostly account-level rather than transaction-level. KYC re-verification triggers, withdrawal holds when the destination IBAN doesn’t match a previous deposit, sudden 3-D Secure step-ups when a deposit comes from a different IP class than usual. These are all features of an ML behavioral model that has decided your session deviates from your historical baseline. As an end user it is sometimes infuriating; from the broker’s compliance side it is what keeps Plus500 and similar regulated under ESMA leverage rules instead of being shut down for laundering exposure.
The lesson worth carrying back: every false-positive friction event is a real customer cost. The cost matrix in the example code earlier — €0.50 per false positive — is generous on most consumer-facing flows. For a high-LTV trading account, the implicit cost of one wrong decline can run into hundreds of euros of churn. Tune accordingly.
What does this mean in practice?
- Start with a cost-sensitive XGBoost on tabular features with rolling-window aggregates. This is the boring 80%-of-the-value baseline most production fraud systems still rely on.
- Add an unsupervised layer (Isolation Forest or a small autoencoder) for novelty — fraud rings that emerge before any labels arrive.
- Invest in a feature store and time-correct training pipelines before investing in fancier models. Most fraud systems fail because of leakage and drift, not because the model class was too weak.
- Graph neural networks pay off only once your transaction graph is clean and your latency budget allows ~50 ms of subgraph sampling. For most teams in 2026 that is a year-two project, not week-one.
- Treat every metric in business units — basis points of fraud, declined-customer churn, net loss avoided — and have an explicit cost matrix written down before you tune thresholds.
- For EU deployments: the AI Act fraud carve-out is real, but PSR, AMLA, and PSD3 transparency obligations will tighten the window. Build explainability in from day one.
FAQ
Is XGBoost still the best model for fraud detection in 2026?
For tabular features, gradient-boosted trees (XGBoost, LightGBM, CatBoost) remain the strongest baseline and the most common production model — they are fast, robust to missing values, and easy to explain with SHAP. The state of the art now layers GNNs on top for entity-level signals, but pure tabular GBMs still drive the majority of in-the-loop scoring at processors like Stripe and inside most EU banks.
Should I use SMOTE on imbalanced fraud datasets?
Generally no. SMOTE creates synthetic minority samples by interpolating in feature space, which on high-dimensional fraud data produces unrealistic examples and degrades production performance. Use cost-sensitive learning (scale_pos_weight), focal loss, or class weights instead, and evaluate on AUC-PR with an explicit cost matrix.
Is fraud detection considered high-risk under the EU AI Act?
Generally no. Recital 58 of the EU AI Act explicitly excludes AI used for fraud detection in financial services from the high-risk classification, distinguishing it from credit scoring (which is high-risk under Annex III point 5(b)). However, transparency, prohibited-practice, and general-purpose model rules still apply, and a system labeled “fraud” that effectively does credit decisioning will be reclassified by regulators.
What is the difference between AML and fraud detection in ML terms?
Technically the toolchains overlap — both use anomaly detection, graph analytics, and supervised classifiers — but the regulatory framing differs. Fraud detection sits under PSD2/PSD3 and consumer-protection law; AML is a separate regulatory regime under the EU AMLD/AMLR framework, now centrally supervised by AMLA in Frankfurt since 1 July 2025. AML systems must produce SARs (Suspicious Activity Reports) and meet specific record-keeping and explainability obligations that consumer fraud systems do not.
How fast does a fraud model need to score in production?
For card payments, end-to-end authorization budget is typically 200–500 ms, of which the ML scoring layer must consume under 100 ms p99 (often under 30 ms p50). For SEPA Instant credit transfers under Regulation 2024/886, the entire transfer must complete in 10 seconds, leaving room for richer scoring including subgraph-sampled GNNs.
What is a graph neural network and why does it help with fraud?
A graph neural network learns node and edge representations on a graph by passing messages between connected nodes. For fraud, the graph nodes are accounts, devices, IBANs, and addresses; edges are shared usage. GNNs can detect coordinated fraud rings — e.g., dozens of accounts sharing a single device fingerprint — that single-node tabular models miss. Mastercard, Visa, and Stripe have all publicly described production GNN-based features as of 2024–2025.
Do I need labeled fraud data to start?
Not strictly — unsupervised methods (Isolation Forest, autoencoders, clustering) can produce useful anomaly scores from unlabeled transactions. But supervised models always outperform once labels start arriving, even with a few thousand confirmed cases, so the practical path is to deploy unsupervised first, instrument analyst-disposition feedback, and transition to a supervised primary model after 2–4 months of operation.
Bibliography and further reading
- European Central Bank. Report on Card Fraud — Eighth Report (May 2024). ecb.europa.eu
- European Parliament & Council. Regulation (EU) 2024/1689 — Artificial Intelligence Act. eur-lex.europa.eu — Recital 58 (fraud-detection carve-out), Annex III (credit scoring as high-risk).
- European Parliament & Council. Regulation (EU) 2024/1620 establishing the Authority for Anti-Money Laundering and Countering the Financing of Terrorism (AMLA). eur-lex.europa.eu
- European Parliament & Council. Regulation (EU) 2024/886 on instant credit transfers in euro. eur-lex.europa.eu — VoP requirement effective 9 October 2025.
- European Commission. Proposal for a Payment Services Regulation (PSR) and Payment Services Directive 3 (PSD3), COM(2023) 367 / 368. finance.ec.europa.eu
- Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. arXiv:1708.02002. arxiv.org/abs/1708.02002
- Hamilton, W., Ying, R., & Leskovec, J. (2017). Inductive Representation Learning on Large Graphs (GraphSAGE). NeurIPS. arxiv.org/abs/1706.02216
- Bussmann, N., Giudici, P., Marinelli, D., & Papenbrock, J. (2021). Explainable Machine Learning in Credit Risk Management. Computational Economics, 57, 203–216.
- Mastercard (2024). Mastercard supercharges AI tools with Decision Intelligence Pro, press release, 1 February 2024. mastercard.com
- Visa Inc. (2024). Visa Spent $10 Billion on Technology to Combat Fraud, Reduce Scams, investor release, 12 March 2024. investor.visa.com
- Stripe. Radar — How it works. stripe.com/radar
- Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions (SHAP). NeurIPS. arxiv.org/abs/1705.07874
- FATF. Opportunities and Challenges of New Technologies for AML/CFT (2021, updated 2024). fatf-gafi.org