# Email-auth code-fallback — implementation plan **Status:** Ready to implement (2026-05-27) **Author:** Khalim (with Claude, via `/grill-me`) **Trigger:** Multiple corporate-domain users (Atkins, Sustainable Building UK, Arup pre-fix) reporting magic-link emails never arriving. One IT department confirmed direct sender-policy block. ## Problem [Magic-link sign-in](../../src/app/api/auth/[...nextauth]/authOptions.ts#L74-L95) emails are being **silently quarantined** by some corporate email gateways (Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the recipient's MX, is accepted at SMTP, then disappears post-acceptance — never hits the inbox or junk folder. SES sees a clean `Delivery` event; the user sees nothing. This is invisible to the sender by design. We confirmed it via the Atkins- adjacent case (Craig Williams, sustainablebuildinguk.com) where the recipient's IT department reached out, confirmed the block, and manually allow-listed `noreply@domna.homes`. ### What's been ruled out - SPF, DKIM are correctly configured (verified via `dig`) - SES suppression list does not contain the failing users - Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine - Link pre-fetching is already defended by [/verify/[token]/page.tsx](../../src/app/verify/[token]/page.tsx) (two-step consent before token consumption) ### What's contributing but not the root cause - `noreply@domna.homes` is not a real mailbox → small reputation hit and broken `List-Unsubscribe` ([magic_link.ts:54](../../src/app/email_templates/magic_link.ts#L54)) - DMARC is at `p=none` with no `rua=` reporting — zero visibility into alignment failures - `.homes` is a 2014-era new gTLD with thin aggregate reputation at some receivers (structural, hard to fully mitigate without changing sender domain) - App's `EMAIL_MAGIC_LINK_SENT` log fires *before* the SMTP transaction and isn't wrapped in try/catch, so it can't distinguish "tried" from "sent" ([authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94)) - SES configuration set exists but the app never sets the `X-SES-CONFIGURATION-SET` header, so events don't flow through it ([magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55)) ## Decisions (grilled and locked) | # | Decision | Rationale | |---|---|---| | D1 | Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email | Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device | | D2 | 6-digit numeric code | Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile | | D3 | 10-minute expiry for both code and link (single artifact) | Code's lower entropy demands shorter window; unifies behaviour | | D4 | Single DB row per verification — extend `verificationToken` with `code_hash` and `attempts` columns | No new table; both paths consume the same row; deletion on first success enforces single-use | | D5 | Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres | Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case | | D6 | NextAuth v4 `CredentialsProvider` alongside existing `EmailProvider` | Cleanest extension point; works with existing JWT session strategy at [authOptions.ts:269](../../src/app/api/auth/[...nextauth]/authOptions.ts#L269); reuses existing user / `signIn` callback | | D7 | Ship in two PRs: observability first, code-flow second behind feature flag | Lets us measure the improvement; isolates risk | | D8 | Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod | Lean — no paid SaaS at current scale | | D9 | Skip the SES event-consumer Lambda for now; ticket on backlog | Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today | ## Out of scope / deferred - **Microsoft work-account OAuth** (Azure AD multi-tenant) — would skip email entirely for M365 corporate users but locks them to MS SSO; revisit if enterprise customers demand it - **Switching sending domain off `.homes`** — measurable deliverability lift but requires re-warmup and brand thinking; revisit if `domna.homes` reputation stays a bottleneck after the DMARC + MAIL FROM work - **SES event-consumer Lambda** — ticketed for the backlog (see planning board: "Wire up SES event consumer for email auth forensics") - **BIMI / VMC, dedicated SES IP, GlockApps placement testing** — overkill at current scale - **Case-sensitivity bug in portfolio invitations** (separate issue) — user `Craig.Williams@…` invited with mixed case, then signed in lowercase, creating two user records. The lookup at [authOptions.ts:117](../../src/app/api/auth/[...nextauth]/authOptions.ts#L117) normalises but the *invitation* path doesn't. Track separately — not in this plan. --- # PR 1 — Observability + logging fixes **Goal:** when the next Atkins-shaped incident happens, we know within minutes whether SES tried, whether SES accepted, and whether the recipient SMTP accepted. Today we know none of these. **Risk:** near-zero. Additive only. **Effort:** ~half a day. ### Changes 1. **Add `X-SES-CONFIGURATION-SET` header** to outbound mail. In [magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55): ```ts headers: { ...(process.env.SES_CONFIGURATION_SET && { "X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET, }), "List-Unsubscribe": ``, }, ``` Conditional so CI (Mailpit) doesn't need the env var. 2. **Capture and return Nodemailer messageId** from `MagicLinksEmail`. In [magic_link.ts:39-62](../../src/app/email_templates/magic_link.ts#L39-L62), change the return type to `Promise<{ messageId: string }>` and return `{ messageId: result.messageId }` after the `result.rejected` check. 3. **Fix the misleading log** in [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94): ```ts sendVerificationRequest: async ({ identifier, url, provider }) => { try { const { messageId } = await MagicLinksEmail({ identifier, url, provider }); console.log("EMAIL_MAGIC_LINK_SUCCESS", { email: identifier, messageId, timestamp: new Date().toISOString(), }); } catch (err) { console.error("EMAIL_MAGIC_LINK_FAILURE", { email: identifier, error: err instanceof Error ? err.message : String(err), timestamp: new Date().toISOString(), }); throw err; } }, ``` Now log presence proves "we tried", `EMAIL_MAGIC_LINK_SUCCESS` proves "SES accepted at SMTP", and `messageId` correlates to SES events. 4. **Add `SES_CONFIGURATION_SET` env var** to the deployment platform (Vercel / wherever). Value: `dev-ses-config`. 5. **Terraform — add `renderingFailure` event type and output the cfg-set name.** In the SES module: ```hcl resource "aws_ses_event_destination" "sns" { # ... matching_types = [ "send", "bounce", "reject", "complaint", "delivery", "renderingFailure", # ← add ] # ... } output "configuration_set_name" { value = aws_ses_configuration_set.this.name } ``` ### Acceptance criteria - Sending a magic link produces exactly one of: `EMAIL_MAGIC_LINK_SUCCESS` (with messageId) or `EMAIL_MAGIC_LINK_FAILURE` (with error). No bare `EMAIL MAGIC LINK SENT` lines. - The `messageId` in app logs matches the `mail.messageId` SES uses (verify once via SES console for one test send). - Terraform plan is clean after the two SES module changes. --- # IT track (parallel, out-of-band) These happen in parallel to PR 1; no engineer dependency. 1. **Create `noreply@domna.homes` as a real shared mailbox** in M365. Configure an auto-reply pointing to a monitored support address. Set `Reply-To:` on outgoing mail in PR 2 once the mailbox exists, OR change `EMAIL_FROM` to something like `accounts@domna.homes` and make *that* monitored. Update `EMAIL_FROM` env var when ready. 2. **Update the DMARC TXT record** at `_dmarc.domna.homes` from: ``` v=DMARC1; p=none; ``` to: ``` v=DMARC1; p=none; rua=mailto:; fo=1; adkim=r; aspf=r; pct=100; ``` Recommend Postmark DMARC Monitor (free) for the reporting address — they provide a unique mailbox like `@inbound-smtp.dmarc.postmarkapp.com` and parse the XML into readable reports. 3. **Progression** (after ~4 weeks of clean reports): bump to `p=quarantine; pct=10;`, then `p=quarantine; pct=100;`, then `p=reject; pct=100;`. Don't skip the ramp — HubSpot / Outlook sends need to be confirmed aligned first. ### Nice-to-have (do when convenient, not blocking) - **Custom MAIL FROM domain** (`mail.domna.homes`) — adds SPF alignment via *our* domain instead of `*.amazonses.com`. In Terraform: ```hcl resource "aws_ses_domain_mail_from" "this" { domain = aws_ses_domain_identity.this.domain mail_from_domain = "mail.${var.domain_name}" } ``` Plus the corresponding `MX 10 feedback-smtp.eu-west-2.amazonses.com` and `TXT v=spf1 include:amazonses.com -all` at `mail.domna.homes`. Modest but real deliverability win. --- # PR 2 — Code fallback flow (behind feature flag) **Goal:** every magic-link email now also contains a 6-digit code. The post-submit UX leads with code entry; the link still works as a fast path. **Risk:** medium — changes the user-facing auth flow. Feature flag mitigates. **Effort:** ~2-3 days. ### Schema migration Add to the existing `verificationToken` table in [src/app/db/schema/users.ts](../../src/app/db/schema/users.ts): ```ts export const verificationTokens = pgTable("verificationToken", { identifier: text("identifier").notNull(), token: text("token").notNull(), expires: timestamp("expires", { mode: "date" }).notNull(), codeHash: text("code_hash"), // ← new (nullable) attempts: integer("attempts").notNull().default(0), // ← new }); ``` Also new table for rate-limit state: ```ts export const authRateLimits = pgTable("authRateLimits", { scope: text("scope").notNull(), // "email-send" | "ip-verify" key: text("key").notNull(), // email or IP count: integer("count").notNull().default(0), windowStart: timestamp("window_start").notNull(), }, (t) => ({ pk: primaryKey({ columns: [t.scope, t.key] }), })); ``` Drizzle migration is additive — existing rows get `code_hash = NULL`, `attempts = 0`. Pre-deploy magic-link emails (those without codes) continue to work via the link path; users on the code-entry page will fail and need to resend. Acceptable transient. ### Code utility — `src/lib/auth/code.ts` ```ts import crypto from "crypto"; export function generateCode(): string { return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0"); } export function hashCode(code: string, secret: string): string { return crypto.createHash("sha256").update(code + secret).digest("hex"); } ``` `crypto.randomInt` not `Math.random` — must be CSPRNG. ### `sendVerificationRequest` update In [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94): 1. Generate code via `generateCode()` 2. Hash with `NEXTAUTH_SECRET` 3. `UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?` 4. Pass `code` (plaintext) into `MagicLinksEmail` 5. Existing logging unchanged ### Email template update [magic_link.ts](../../src/app/email_templates/magic_link.ts) takes a new `code` argument. The HTML and plain-text bodies lead with: > **Your sign-in code: 482 911** > > Enter this at `ara.domna.homes/auth/verify-code` > > Or click here to skip the code step: [Sign in to Ara] → Code rendered large and monospace; the link is a smaller secondary button. ### NextAuth providers — add `CredentialsProvider` In [authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts): ```ts import CredentialsProvider from "next-auth/providers/credentials"; CredentialsProvider({ id: "email-code", name: "Email Code", credentials: { email: { type: "text" }, code: { type: "text" }, }, async authorize(credentials) { if (!credentials?.email || !credentials?.code) return null; const email = credentials.email.toLowerCase(); const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!); // Look up the row by (identifier, code_hash, not-expired) // If found: check attempts < 5, delete row, return user // If found but attempts >= 5: delete row, return null // If not found: increment attempts on the row matching identifier (if any), return null // Rate-limit check before any of this; on exceed, return null silently }, }), ``` Verify the `signIn` callback at [authOptions.ts:114-206](../../src/app/api/auth/[...nextauth]/authOptions.ts#L114-L206) handles `account.type === "credentials"` cleanly (it should fall through the OAuth branches at lines 142 and 180, but worth a unit test). ### Drop magic-link maxAge Change [authOptions.ts:84](../../src/app/api/auth/[...nextauth]/authOptions.ts#L84) from `60 * 60` to `60 * 10`. Code and link now share the 10-min window. ### New page — `src/app/auth/verify-code/page.tsx` Code-input form. POSTs `{email, code}` to NextAuth credentials endpoint via `signIn("email-code", { email, code, redirect: false })`. Includes a "Resend code" button (rate-limited via the new `auth_rate_limits` table). ### Redirect change [authOptions.ts:103](../../src/app/api/auth/[...nextauth]/authOptions.ts#L103): change `verifyRequest: "/auth/verify-request"` to `verifyRequest: "/auth/verify-code?email="` so post-submit lands on the code-entry page with the email pre-filled. ### Existing `/verify/[token]` page Unchanged behaviour — the link path still works. Optionally add a small "Type the code instead" link to the code-entry page for the rare user who prefers it. ### Feature flag Env var `AUTH_CODE_FALLBACK_ENABLED=true|false`. When `false`, skip the code generation in `sendVerificationRequest` and redirect to the old `/auth/verify-request` page. Lets us canary and roll back via env-var flip. ### Acceptance criteria - Submitting email at `/` lands on `/auth/verify-code` (when flag enabled) - The email body contains a clear 6-digit code AND a working link - Typing the code on the verify-code page signs the user in - Clicking the link still signs the user in (single artifact — using the link invalidates the code, and vice versa) - 5 wrong code attempts deletes the row; a 6th attempt with the correct code fails - 6th code request within an hour to the same email is silently no-op'd - After successful sign-in, behaviour identical to today (lands on `/home`, user record + `lastLogin` updated, etc.) --- # Testing strategy ### CI (every PR) - **Vitest** for pure-function unit tests (existing pattern — see [src/app/email_templates/buildMailHeaders.test.ts](../../src/app/email_templates/buildMailHeaders.test.ts) added in PR 1). - **Cypress** for E2E (existing harness at `cypress/e2e/`) covering: - Code path: submit → poll Mailpit → extract code → enter → land on `/home` - Link path: submit → poll Mailpit → extract URL → visit → confirm → land on `/home` - Wrong code: 5 wrong attempts invalidate; 6th with correct code fails - Expired code: code submitted >10 min after generation fails - **Mailpit** as a docker-compose service. SMTP at `:1025`, JSON API at `:8025`. Cypress task helpers (`cy.task`) call the Mailpit JSON API to extract codes/links from captured emails. - **Unit tests** for `generateCode()` distribution, `hashCode()` determinism, rate-limit math (Vitest). ### Pre-release (manual gate on template changes) - Run the template through [mail-tester.com](https://mail-tester.com) before shipping. Target >9/10. Free, 60 seconds. ### Production - Eyeball the SES Account dashboard weekly: bounce rate, complaint rate, reputation status. - One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team. - When investigating an incident, temporarily subscribe an SQS queue or email endpoint to `dev-ses-events` SNS topic to capture events for the duration. Unsubscribe after. When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the SES event-consumer Lambda. --- # Files touched (summary) | Phase | File | Change | |---|---|---| | PR 1 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Add `X-SES-CONFIGURATION-SET` header; return `messageId` | | PR 1 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Replace `sendVerificationRequest` logging with try/catch + messageId | | PR 1 | Terraform `modules/ses/` | Add `renderingFailure` event type + cfg-set name output | | PR 1 | Deployment env vars | Add `SES_CONFIGURATION_SET=dev-ses-config` | | IT | DNS — `_dmarc.domna.homes` | Update DMARC TXT value | | IT | M365 admin | Create `noreply@domna.homes` shared mailbox | | PR 2 | [src/app/db/schema/users.ts](../../src/app/db/schema/users.ts) | Add `code_hash`, `attempts` columns; new `authRateLimits` table | | PR 2 | `src/lib/auth/code.ts` (new) | `generateCode()`, `hashCode()` | | PR 2 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Accept `code` arg; render code prominently in HTML + plaintext | | PR 2 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Add `CredentialsProvider`; generate+persist code in `sendVerificationRequest`; drop `maxAge` to 600; change `verifyRequest` redirect | | PR 2 | `src/app/auth/verify-code/page.tsx` (new) | Code-input form + resend button | | PR 2 | `src/lib/auth/rate-limit.ts` (new) | Per-email-send, per-code-attempt, per-IP-verify limiters backed by `authRateLimits` table | | PR 2 | `.github/workflows/test.yml` | Mailpit service + Cypress E2E | | PR 2 | `docker-compose.yml` | Mailpit local-dev service | | PR 2 | Deployment env vars | Add `AUTH_CODE_FALLBACK_ENABLED` flag | --- # Open questions None blocking. Possibilities to revisit after PR 2 ships: - Do enterprise customers (paid relationships, formal SLAs) want Microsoft work-account OAuth as their primary sign-in? - Does the `.homes` TLD continue to be a deliverability bottleneck after MAIL FROM + DMARC `p=reject` are in place? If yes, evaluate moving sends to a `.com` / `.co.uk` we own. - Does the SES event volume justify the Lambda from the backlog ticket?