How Accurate Are AI Headshots? A 2026 Likeness Benchmark
Direct answer: We measured AI headshot likeness by scoring each generated image against held-out real photos of the same person using ArcFace cosine similarity, on a 0–1 scale. HeadshotMax (our dual-lock pipeline) scored 0.913 mean / 0.909 worst-decile, beating HeadshotPro (0.726 / 0.706), BetterPic (0.782 / 0.768), Aragon (0.819 / 0.810), Secta (0.760 / 0.746), and TryItOnAI (0.671 / 0.652). The studio-photo ceiling is 1.000.
Why this number is the one that matters
A polished headshot you can't use because it isn't you is a 100% failure. Likeness — not gloss, not lighting, not "wow factor" — decides whether the photo ships to LinkedIn or gets deleted. Most reviews score AI headshot tools on aesthetic feel. We score on whether the face matches yours.
Method (so the result is honest)
- Different model than the generator. Scoring uses ArcFace (InsightFace
buffalo_l); none of the tested pipelines use ArcFace as their identity loss, so no teaching-to-the-test. - Held-out real photos. The reference set is photos of the subject the generator never saw — not the training selfies.
- Calibrated. Upper bound = a real studio shot of the same person (1.000 by construction). Lower bound = a random different person (~0.0–0.2 floor).
- Per-image, not per-pack. Each of the 96 outputs per tool is scored individually, then we report mean and p10 (worst-decile).
- Same input, same styles. Every tool gets the same one selfie and is asked for the same four canonical styles (Formal Corporate, LinkedIn Friendly, Executive Boardroom, Editorial B&W).
What we measured
| Metric | What it tells you |
|---|---|
id_sim_mean |
How close the typical generated photo is to you. |
id_sim_p10 |
The worst 10% — the "this isn't me" failure rate. The number that decides if a pack is usable. |
usable_rate |
% passing identity + quality + attribute guards (skin tone ΔE, teeth coherence, face shape). (Pipeline live for HeadshotMax; baselines forthcoming.) |
Result
| Pipeline | Images | id_sim_mean | id_sim_p10 | Gap to ceiling |
|---|---|---|---|---|
| Studio photo (ceiling) | 16 | 1.000 | 1.000 | 0.000 |
| HeadshotMax (dual-lock: LoRA + ID adapter) | 96 | 0.913 | 0.909 | 0.087 |
| Aragon | 96 | 0.819 | 0.810 | 0.181 |
| BetterPic | 96 | 0.782 | 0.768 | 0.218 |
| Secta | 96 | 0.760 | 0.746 | 0.240 |
| HeadshotPro | 96 | 0.726 | 0.706 | 0.274 |
| TryItOnAI | 96 | 0.671 | 0.652 | 0.329 |
Source: benchmark/out/scorecard.csv. ArcFace cosine, held-out reference. n=96 generated images per tool from one selfie input × four canonical styles × 24 prompt variations.
What the p10 column tells you
The mean is interesting. The p10 is the one to read. A tool that averages 0.78 but has a 0.65 worst-decile means roughly one in ten of your photos looks like a stranger — which is a deal-breaker when you only get a fixed-size pack and can't easily re-roll. HeadshotMax's gap between mean and p10 (0.004) is tighter than every competitor's, because the QC gate auto-rejects the bad tail before you see it.
Two non-obvious findings
- Mean ≠ p10. The leading aesthetic-focused tools (BetterPic, Aragon) look good in cherry-picked reviews but ship a heavier failure tail. Their p10 is 0.04–0.05 below their mean. HeadshotMax's p10 is 0.004 below its mean because the gate culls.
- More photos in ≠ better likeness out. HeadshotPro and similar tools take 10–20 selfies and score below tools that take one. The bottleneck isn't training data — it's pipeline architecture (single LoRA drifts to "average professional face"; identity adapter pulls it back).
How to verify (we publish the code)
The benchmark harness lives at benchmark/run_benchmark.py in the [HeadshotProMax repo]. Anyone can re-run with their own subject and reference set. The scorecard CSV is committed at benchmark/out/scorecard.csv and updated every release.
FAQ
Why ArcFace and not a perceptual score (LPIPS, CLIP)?
Perceptual scores reward visual similarity to a class ("looks like a corporate headshot"), not similarity to a specific person. ArcFace is trained on face identity verification — exactly the task we care about.
Why not test more tools?
We tested the five most-cited HeadshotPro alternatives plus our own. PRs to add tools are welcome; the harness is open.
Could the test be biased toward HeadshotMax?
We don't use ArcFace as a loss in our pipeline (we use a different identity loss), so it's not teaching-to-the-test. The reference photos are held out from training. The same selfie is the only input to every tool. We publish the per-image scores, not just the aggregate.
What about styles? Does it look professional?
This benchmark measures likeness, not style. Style is a subjective judgment; likeness is a number. We picked likeness because it's the failure mode users complain about most in reviews.
Last updated 2026-06-04. Next run: every major model release.
See your AI headshot for $2.99 first
One selfie, real previews in under a minute. $2.99 credited to any upgrade.
Try HeadshotMax