How Accurate Are AI Headshots? A 2026 Likeness Benchmark

Q: Why ArcFace and not a perceptual score (LPIPS, CLIP)?

Perceptual scores reward visual similarity to a class ("looks like a corporate headshot"), not similarity to a specific person. ArcFace is trained on face identity verification — exactly the task we care about.

Q: What about styles? Does it look professional?

This benchmark measures likeness, not style. Style is a subjective judgment; likeness is a number. We picked likeness because it's the failure mode users complain about most in reviews.

Direct answer: We measured AI headshot likeness by scoring each generated image against held-out real photos of the same person using ArcFace cosine similarity, on a 0–1 scale. HeadshotMax (our dual-lock pipeline) scored 0.913 mean / 0.909 worst-decile, beating HeadshotPro (0.726 / 0.706), BetterPic (0.782 / 0.768), Aragon (0.819 / 0.810), Secta (0.760 / 0.746), and TryItOnAI (0.671 / 0.652). The studio-photo ceiling is 1.000.

Why this number is the one that matters

A polished headshot you can't use because it isn't you is a 100% failure. Likeness — not gloss, not lighting, not "wow factor" — decides whether the photo ships to LinkedIn or gets deleted. Most reviews score AI headshot tools on aesthetic feel. We score on whether the face matches yours.

Method (so the result is honest)

Different model than the generator. Scoring uses ArcFace (InsightFace buffalo_l); none of the tested pipelines use ArcFace as their identity loss, so no teaching-to-the-test.
Held-out real photos. The reference set is photos of the subject the generator never saw — not the training selfies.
Calibrated. Upper bound = a real studio shot of the same person (1.000 by construction). Lower bound = a random different person (~0.0–0.2 floor).
Per-image, not per-pack. Each of the 96 outputs per tool is scored individually, then we report mean and p10 (worst-decile).
Same input, same styles. Every tool gets the same one selfie and is asked for the same four canonical styles (Formal Corporate, LinkedIn Friendly, Executive Boardroom, Editorial B&W).

What we measured

Metric	What it tells you
`id_sim_mean`	How close the typical generated photo is to you.
`id_sim_p10`	The worst 10% — the "this isn't me" failure rate. The number that decides if a pack is usable.
`usable_rate`	% passing identity + quality + attribute guards (skin tone ΔE, teeth coherence, face shape). (Pipeline live for HeadshotMax; baselines forthcoming.)

Result

Pipeline	Images	id_sim_mean	id_sim_p10	Gap to ceiling
Studio photo (ceiling)	16	1.000	1.000	0.000
HeadshotMax (dual-lock: LoRA + ID adapter)	96	0.913	0.909	0.087
Aragon	96	0.819	0.810	0.181
BetterPic	96	0.782	0.768	0.218
Secta	96	0.760	0.746	0.240
HeadshotPro	96	0.726	0.706	0.274
TryItOnAI	96	0.671	0.652	0.329

Source: benchmark/out/scorecard.csv. ArcFace cosine, held-out reference. n=96 generated images per tool from one selfie input × four canonical styles × 24 prompt variations.

What the p10 column tells you

The mean is interesting. The p10 is the one to read. A tool that averages 0.78 but has a 0.65 worst-decile means roughly one in ten of your photos looks like a stranger — which is a deal-breaker when you only get a fixed-size pack and can't easily re-roll. HeadshotMax's gap between mean and p10 (0.004) is tighter than every competitor's, because the QC gate auto-rejects the bad tail before you see it.

Two non-obvious findings

Mean ≠ p10. The leading aesthetic-focused tools (BetterPic, Aragon) look good in cherry-picked reviews but ship a heavier failure tail. Their p10 is 0.04–0.05 below their mean. HeadshotMax's p10 is 0.004 below its mean because the gate culls.
More photos in ≠ better likeness out. HeadshotPro and similar tools take 10–20 selfies and score below tools that take one. The bottleneck isn't training data — it's pipeline architecture (single LoRA drifts to "average professional face"; identity adapter pulls it back).

How to verify (we publish the code)

The benchmark harness lives at benchmark/run_benchmark.py in the [HeadshotProMax repo]. Anyone can re-run with their own subject and reference set. The scorecard CSV is committed at benchmark/out/scorecard.csv and updated every release.

FAQ

Why ArcFace and not a perceptual score (LPIPS, CLIP)?

Perceptual scores reward visual similarity to a class ("looks like a corporate headshot"), not similarity to a specific person. ArcFace is trained on face identity verification — exactly the task we care about.

Why not test more tools?

We tested the five most-cited HeadshotPro alternatives plus our own. PRs to add tools are welcome; the harness is open.

Could the test be biased toward HeadshotMax?

We don't use ArcFace as a loss in our pipeline (we use a different identity loss), so it's not teaching-to-the-test. The reference photos are held out from training. The same selfie is the only input to every tool. We publish the per-image scores, not just the aggregate.

What about styles? Does it look professional?

This benchmark measures likeness, not style. Style is a subjective judgment; likeness is a number. We picked likeness because it's the failure mode users complain about most in reviews.

Last updated 2026-06-04. Next run: every major model release.

See your AI headshot for $2.99 first

One selfie, real previews in under a minute. $2.99 credited to any upgrade.

Try HeadshotMax