Wires the X-SES-CONFIGURATION-SET header on outbound auth emails so SES bounce/delivery events flow through dev-ses-config to the dev-ses-events SNS topic. Replaces the fire-and-forget "EMAIL MAGIC LINK SENT" log (which fired before the SMTP transaction and swallowed downstream errors) with structured EMAIL_MAGIC_LINK_SUCCESS/_FAILURE logs carrying the Nodemailer messageId, so app-side sends are now correlatable with SES events. Motivated by the Atkins / Sustainable Building UK silent-quarantine incidents where we couldn't tell whether SES had even tried to send. Plan doc at docs/wip/auth-email-code-fallback-plan.md tracks the broader email-code-fallback design that PR 2 will implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
Email-auth code-fallback — implementation plan
Status: Ready to implement (2026-05-27)
Author: Khalim (with Claude, via /grill-me)
Trigger: Multiple corporate-domain users (Atkins, Sustainable Building UK,
Arup pre-fix) reporting magic-link emails never arriving. One IT department
confirmed direct sender-policy block.
Problem
Magic-link sign-in
emails are being silently quarantined by some corporate email gateways
(Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the
recipient's MX, is accepted at SMTP, then disappears post-acceptance — never
hits the inbox or junk folder. SES sees a clean Delivery event; the user
sees nothing.
This is invisible to the sender by design. We confirmed it via the Atkins-
adjacent case (Craig Williams, sustainablebuildinguk.com) where the
recipient's IT department reached out, confirmed the block, and manually
allow-listed noreply@domna.homes.
What's been ruled out
- SPF, DKIM are correctly configured (verified via
dig) - SES suppression list does not contain the failing users
- Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine
- Link pre-fetching is already defended by /verify/[token]/page.tsx (two-step consent before token consumption)
What's contributing but not the root cause
noreply@domna.homesis not a real mailbox → small reputation hit and brokenList-Unsubscribe(magic_link.ts:54)- DMARC is at
p=nonewith norua=reporting — zero visibility into alignment failures .homesis a 2014-era new gTLD with thin aggregate reputation at some receivers (structural, hard to fully mitigate without changing sender domain)- App's
EMAIL_MAGIC_LINK_SENTlog fires before the SMTP transaction and isn't wrapped in try/catch, so it can't distinguish "tried" from "sent" (authOptions.ts:86-94) - SES configuration set exists but the app never sets the
X-SES-CONFIGURATION-SETheader, so events don't flow through it (magic_link.ts:53-55)
Decisions (grilled and locked)
| # | Decision | Rationale |
|---|---|---|
| D1 | Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email | Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device |
| D2 | 6-digit numeric code | Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile |
| D3 | 10-minute expiry for both code and link (single artifact) | Code's lower entropy demands shorter window; unifies behaviour |
| D4 | Single DB row per verification — extend verificationToken with code_hash and attempts columns |
No new table; both paths consume the same row; deletion on first success enforces single-use |
| D5 | Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres | Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case |
| D6 | NextAuth v4 CredentialsProvider alongside existing EmailProvider |
Cleanest extension point; works with existing JWT session strategy at authOptions.ts:269; reuses existing user / signIn callback |
| D7 | Ship in two PRs: observability first, code-flow second behind feature flag | Lets us measure the improvement; isolates risk |
| D8 | Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod | Lean — no paid SaaS at current scale |
| D9 | Skip the SES event-consumer Lambda for now; ticket on backlog | Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today |
Out of scope / deferred
- Microsoft work-account OAuth (Azure AD multi-tenant) — would skip email entirely for M365 corporate users but locks them to MS SSO; revisit if enterprise customers demand it
- Switching sending domain off
.homes— measurable deliverability lift but requires re-warmup and brand thinking; revisit ifdomna.homesreputation stays a bottleneck after the DMARC + MAIL FROM work - SES event-consumer Lambda — ticketed for the backlog (see planning board: "Wire up SES event consumer for email auth forensics")
- BIMI / VMC, dedicated SES IP, GlockApps placement testing — overkill at current scale
- Case-sensitivity bug in portfolio invitations (separate issue) — user
Craig.Williams@…invited with mixed case, then signed in lowercase, creating two user records. The lookup at authOptions.ts:117 normalises but the invitation path doesn't. Track separately — not in this plan.
PR 1 — Observability + logging fixes
Goal: when the next Atkins-shaped incident happens, we know within minutes whether SES tried, whether SES accepted, and whether the recipient SMTP accepted. Today we know none of these.
Risk: near-zero. Additive only.
Effort: ~half a day.
Changes
-
Add
X-SES-CONFIGURATION-SETheader to outbound mail.headers: { ...(process.env.SES_CONFIGURATION_SET && { "X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET, }), "List-Unsubscribe": `<mailto:${provider.from}>`, },Conditional so CI (Mailpit) doesn't need the env var.
-
Capture and return Nodemailer messageId from
MagicLinksEmail.In magic_link.ts:39-62, change the return type to
Promise<{ messageId: string }>and return{ messageId: result.messageId }after theresult.rejectedcheck. -
Fix the misleading log in authOptions.ts:86-94:
sendVerificationRequest: async ({ identifier, url, provider }) => { try { const { messageId } = await MagicLinksEmail({ identifier, url, provider }); console.log("EMAIL_MAGIC_LINK_SUCCESS", { email: identifier, messageId, timestamp: new Date().toISOString(), }); } catch (err) { console.error("EMAIL_MAGIC_LINK_FAILURE", { email: identifier, error: err instanceof Error ? err.message : String(err), timestamp: new Date().toISOString(), }); throw err; } },Now log presence proves "we tried",
EMAIL_MAGIC_LINK_SUCCESSproves "SES accepted at SMTP", andmessageIdcorrelates to SES events. -
Add
SES_CONFIGURATION_SETenv var to the deployment platform (Vercel / wherever). Value:dev-ses-config. -
Terraform — add
renderingFailureevent type and output the cfg-set name. In the SES module:resource "aws_ses_event_destination" "sns" { # ... matching_types = [ "send", "bounce", "reject", "complaint", "delivery", "renderingFailure", # ← add ] # ... } output "configuration_set_name" { value = aws_ses_configuration_set.this.name }
Acceptance criteria
- Sending a magic link produces exactly one of:
EMAIL_MAGIC_LINK_SUCCESS(with messageId) orEMAIL_MAGIC_LINK_FAILURE(with error). No bareEMAIL MAGIC LINK SENTlines. - The
messageIdin app logs matches themail.messageIdSES uses (verify once via SES console for one test send). - Terraform plan is clean after the two SES module changes.
IT track (parallel, out-of-band)
These happen in parallel to PR 1; no engineer dependency.
-
Create
noreply@domna.homesas a real shared mailbox in M365. Configure an auto-reply pointing to a monitored support address. SetReply-To:on outgoing mail in PR 2 once the mailbox exists, OR changeEMAIL_FROMto something likeaccounts@domna.homesand make that monitored. UpdateEMAIL_FROMenv var when ready. -
Update the DMARC TXT record at
_dmarc.domna.homesfrom:v=DMARC1; p=none;to:
v=DMARC1; p=none; rua=mailto:<reporting-address>; fo=1; adkim=r; aspf=r; pct=100;Recommend Postmark DMARC Monitor (free) for the reporting address — they provide a unique mailbox like
<uuid>@inbound-smtp.dmarc.postmarkapp.comand parse the XML into readable reports. -
Progression (after ~4 weeks of clean reports): bump to
p=quarantine; pct=10;, thenp=quarantine; pct=100;, thenp=reject; pct=100;. Don't skip the ramp — HubSpot / Outlook sends need to be confirmed aligned first.
Nice-to-have (do when convenient, not blocking)
- Custom MAIL FROM domain (
mail.domna.homes) — adds SPF alignment via our domain instead of*.amazonses.com. In Terraform:
Plus the correspondingresource "aws_ses_domain_mail_from" "this" { domain = aws_ses_domain_identity.this.domain mail_from_domain = "mail.${var.domain_name}" }MX 10 feedback-smtp.eu-west-2.amazonses.comandTXT v=spf1 include:amazonses.com -allatmail.domna.homes. Modest but real deliverability win.
PR 2 — Code fallback flow (behind feature flag)
Goal: every magic-link email now also contains a 6-digit code. The post-submit UX leads with code entry; the link still works as a fast path.
Risk: medium — changes the user-facing auth flow. Feature flag mitigates.
Effort: ~2-3 days.
Schema migration
Add to the existing verificationToken table in
src/app/db/schema/users.ts:
export const verificationTokens = pgTable("verificationToken", {
identifier: text("identifier").notNull(),
token: text("token").notNull(),
expires: timestamp("expires", { mode: "date" }).notNull(),
codeHash: text("code_hash"), // ← new (nullable)
attempts: integer("attempts").notNull().default(0), // ← new
});
Also new table for rate-limit state:
export const authRateLimits = pgTable("authRateLimits", {
scope: text("scope").notNull(), // "email-send" | "ip-verify"
key: text("key").notNull(), // email or IP
count: integer("count").notNull().default(0),
windowStart: timestamp("window_start").notNull(),
}, (t) => ({
pk: primaryKey({ columns: [t.scope, t.key] }),
}));
Drizzle migration is additive — existing rows get code_hash = NULL,
attempts = 0. Pre-deploy magic-link emails (those without codes) continue
to work via the link path; users on the code-entry page will fail and need
to resend. Acceptable transient.
Code utility — src/lib/auth/code.ts
import crypto from "crypto";
export function generateCode(): string {
return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0");
}
export function hashCode(code: string, secret: string): string {
return crypto.createHash("sha256").update(code + secret).digest("hex");
}
crypto.randomInt not Math.random — must be CSPRNG.
sendVerificationRequest update
- Generate code via
generateCode() - Hash with
NEXTAUTH_SECRET UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?- Pass
code(plaintext) intoMagicLinksEmail - Existing logging unchanged
Email template update
magic_link.ts takes a new
code argument. The HTML and plain-text bodies lead with:
Your sign-in code: 482 911
Enter this at
ara.domna.homes/auth/verify-codeOr click here to skip the code step: [Sign in to Ara] →
Code rendered large and monospace; the link is a smaller secondary button.
NextAuth providers — add CredentialsProvider
In authOptions.ts:
import CredentialsProvider from "next-auth/providers/credentials";
CredentialsProvider({
id: "email-code",
name: "Email Code",
credentials: {
email: { type: "text" },
code: { type: "text" },
},
async authorize(credentials) {
if (!credentials?.email || !credentials?.code) return null;
const email = credentials.email.toLowerCase();
const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!);
// Look up the row by (identifier, code_hash, not-expired)
// If found: check attempts < 5, delete row, return user
// If found but attempts >= 5: delete row, return null
// If not found: increment attempts on the row matching identifier (if any), return null
// Rate-limit check before any of this; on exceed, return null silently
},
}),
Verify the signIn callback at
authOptions.ts:114-206
handles account.type === "credentials" cleanly (it should fall through the
OAuth branches at lines 142 and 180, but worth a unit test).
Drop magic-link maxAge
Change authOptions.ts:84
from 60 * 60 to 60 * 10. Code and link now share the 10-min window.
New page — src/app/auth/verify-code/page.tsx
Code-input form. POSTs {email, code} to NextAuth credentials endpoint via
signIn("email-code", { email, code, redirect: false }). Includes a
"Resend code" button (rate-limited via the new auth_rate_limits table).
Redirect change
authOptions.ts:103:
change verifyRequest: "/auth/verify-request" to
verifyRequest: "/auth/verify-code?email=<encoded>" so post-submit lands on
the code-entry page with the email pre-filled.
Existing /verify/[token] page
Unchanged behaviour — the link path still works. Optionally add a small "Type the code instead" link to the code-entry page for the rare user who prefers it.
Feature flag
Env var AUTH_CODE_FALLBACK_ENABLED=true|false. When false, skip the code
generation in sendVerificationRequest and redirect to the old
/auth/verify-request page. Lets us canary and roll back via env-var flip.
Acceptance criteria
- Submitting email at
/lands on/auth/verify-code(when flag enabled) - The email body contains a clear 6-digit code AND a working link
- Typing the code on the verify-code page signs the user in
- Clicking the link still signs the user in (single artifact — using the link invalidates the code, and vice versa)
- 5 wrong code attempts deletes the row; a 6th attempt with the correct code fails
- 6th code request within an hour to the same email is silently no-op'd
- After successful sign-in, behaviour identical to today (lands on
/home, user record +lastLoginupdated, etc.)
Testing strategy
CI (every PR)
- Vitest for pure-function unit tests (existing pattern — see src/app/email_templates/buildMailHeaders.test.ts added in PR 1).
- Cypress for E2E (existing harness at
cypress/e2e/) covering:- Code path: submit → poll Mailpit → extract code → enter → land on
/home - Link path: submit → poll Mailpit → extract URL → visit → confirm → land on
/home - Wrong code: 5 wrong attempts invalidate; 6th with correct code fails
- Expired code: code submitted >10 min after generation fails
- Code path: submit → poll Mailpit → extract code → enter → land on
- Mailpit as a docker-compose service. SMTP at
:1025, JSON API at:8025. Cypress task helpers (cy.task) call the Mailpit JSON API to extract codes/links from captured emails. - Unit tests for
generateCode()distribution,hashCode()determinism, rate-limit math (Vitest).
Pre-release (manual gate on template changes)
- Run the template through mail-tester.com before shipping. Target >9/10. Free, 60 seconds.
Production
- Eyeball the SES Account dashboard weekly: bounce rate, complaint rate, reputation status.
- One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team.
- When investigating an incident, temporarily subscribe an SQS queue or
email endpoint to
dev-ses-eventsSNS topic to capture events for the duration. Unsubscribe after.
When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the SES event-consumer Lambda.
Files touched (summary)
| Phase | File | Change |
|---|---|---|
| PR 1 | src/app/email_templates/magic_link.ts | Add X-SES-CONFIGURATION-SET header; return messageId |
| PR 1 | src/app/api/auth/[...nextauth]/authOptions.ts | Replace sendVerificationRequest logging with try/catch + messageId |
| PR 1 | Terraform modules/ses/ |
Add renderingFailure event type + cfg-set name output |
| PR 1 | Deployment env vars | Add SES_CONFIGURATION_SET=dev-ses-config |
| IT | DNS — _dmarc.domna.homes |
Update DMARC TXT value |
| IT | M365 admin | Create noreply@domna.homes shared mailbox |
| PR 2 | src/app/db/schema/users.ts | Add code_hash, attempts columns; new authRateLimits table |
| PR 2 | src/lib/auth/code.ts (new) |
generateCode(), hashCode() |
| PR 2 | src/app/email_templates/magic_link.ts | Accept code arg; render code prominently in HTML + plaintext |
| PR 2 | src/app/api/auth/[...nextauth]/authOptions.ts | Add CredentialsProvider; generate+persist code in sendVerificationRequest; drop maxAge to 600; change verifyRequest redirect |
| PR 2 | src/app/auth/verify-code/page.tsx (new) |
Code-input form + resend button |
| PR 2 | src/lib/auth/rate-limit.ts (new) |
Per-email-send, per-code-attempt, per-IP-verify limiters backed by authRateLimits table |
| PR 2 | .github/workflows/test.yml |
Mailpit service + Cypress E2E |
| PR 2 | docker-compose.yml |
Mailpit local-dev service |
| PR 2 | Deployment env vars | Add AUTH_CODE_FALLBACK_ENABLED flag |
Open questions
None blocking. Possibilities to revisit after PR 2 ships:
- Do enterprise customers (paid relationships, formal SLAs) want Microsoft work-account OAuth as their primary sign-in?
- Does the
.homesTLD continue to be a deliverability bottleneck after MAIL FROM + DMARCp=rejectare in place? If yes, evaluate moving sends to a.com/.co.ukwe own. - Does the SES event volume justify the Lambda from the backlog ticket?