diff --git a/docs/wip/auth-email-code-fallback-plan.md b/docs/wip/auth-email-code-fallback-plan.md new file mode 100644 index 0000000..66713b4 --- /dev/null +++ b/docs/wip/auth-email-code-fallback-plan.md @@ -0,0 +1,451 @@ +# Email-auth code-fallback — implementation plan + +**Status:** Ready to implement (2026-05-27) +**Author:** Khalim (with Claude, via `/grill-me`) +**Trigger:** Multiple corporate-domain users (Atkins, Sustainable Building UK, +Arup pre-fix) reporting magic-link emails never arriving. One IT department +confirmed direct sender-policy block. + +## Problem + +[Magic-link sign-in](../../src/app/api/auth/[...nextauth]/authOptions.ts#L74-L95) +emails are being **silently quarantined** by some corporate email gateways +(Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the +recipient's MX, is accepted at SMTP, then disappears post-acceptance — never +hits the inbox or junk folder. SES sees a clean `Delivery` event; the user +sees nothing. + +This is invisible to the sender by design. We confirmed it via the Atkins- +adjacent case (Craig Williams, sustainablebuildinguk.com) where the +recipient's IT department reached out, confirmed the block, and manually +allow-listed `noreply@domna.homes`. + +### What's been ruled out + +- SPF, DKIM are correctly configured (verified via `dig`) +- SES suppression list does not contain the failing users +- Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine +- Link pre-fetching is already defended by + [/verify/[token]/page.tsx](../../src/app/verify/[token]/page.tsx) (two-step + consent before token consumption) + +### What's contributing but not the root cause + +- `noreply@domna.homes` is not a real mailbox → small reputation hit and + broken `List-Unsubscribe` ([magic_link.ts:54](../../src/app/email_templates/magic_link.ts#L54)) +- DMARC is at `p=none` with no `rua=` reporting — zero visibility into + alignment failures +- `.homes` is a 2014-era new gTLD with thin aggregate reputation at some + receivers (structural, hard to fully mitigate without changing sender domain) +- App's `EMAIL_MAGIC_LINK_SENT` log fires *before* the SMTP transaction and + isn't wrapped in try/catch, so it can't distinguish "tried" from "sent" + ([authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94)) +- SES configuration set exists but the app never sets the + `X-SES-CONFIGURATION-SET` header, so events don't flow through it + ([magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55)) + +## Decisions (grilled and locked) + +| # | Decision | Rationale | +|---|---|---| +| D1 | Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email | Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device | +| D2 | 6-digit numeric code | Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile | +| D3 | 10-minute expiry for both code and link (single artifact) | Code's lower entropy demands shorter window; unifies behaviour | +| D4 | Single DB row per verification — extend `verificationToken` with `code_hash` and `attempts` columns | No new table; both paths consume the same row; deletion on first success enforces single-use | +| D5 | Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres | Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case | +| D6 | NextAuth v4 `CredentialsProvider` alongside existing `EmailProvider` | Cleanest extension point; works with existing JWT session strategy at [authOptions.ts:269](../../src/app/api/auth/[...nextauth]/authOptions.ts#L269); reuses existing user / `signIn` callback | +| D7 | Ship in two PRs: observability first, code-flow second behind feature flag | Lets us measure the improvement; isolates risk | +| D8 | Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod | Lean — no paid SaaS at current scale | +| D9 | Skip the SES event-consumer Lambda for now; ticket on backlog | Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today | + +## Out of scope / deferred + +- **Microsoft work-account OAuth** (Azure AD multi-tenant) — would skip email + entirely for M365 corporate users but locks them to MS SSO; revisit if + enterprise customers demand it +- **Switching sending domain off `.homes`** — measurable deliverability lift + but requires re-warmup and brand thinking; revisit if `domna.homes` + reputation stays a bottleneck after the DMARC + MAIL FROM work +- **SES event-consumer Lambda** — ticketed for the backlog (see planning + board: "Wire up SES event consumer for email auth forensics") +- **BIMI / VMC, dedicated SES IP, GlockApps placement testing** — overkill at + current scale +- **Case-sensitivity bug in portfolio invitations** (separate issue) — user + `Craig.Williams@…` invited with mixed case, then signed in lowercase, + creating two user records. The lookup at + [authOptions.ts:117](../../src/app/api/auth/[...nextauth]/authOptions.ts#L117) + normalises but the *invitation* path doesn't. Track separately — not in + this plan. + +--- + +# PR 1 — Observability + logging fixes + +**Goal:** when the next Atkins-shaped incident happens, we know within minutes +whether SES tried, whether SES accepted, and whether the recipient SMTP +accepted. Today we know none of these. + +**Risk:** near-zero. Additive only. + +**Effort:** ~half a day. + +### Changes + +1. **Add `X-SES-CONFIGURATION-SET` header** to outbound mail. + + In [magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55): + + ```ts + headers: { + ...(process.env.SES_CONFIGURATION_SET && { + "X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET, + }), + "List-Unsubscribe": ``, + }, + ``` + + Conditional so CI (Mailpit) doesn't need the env var. + +2. **Capture and return Nodemailer messageId** from `MagicLinksEmail`. + + In [magic_link.ts:39-62](../../src/app/email_templates/magic_link.ts#L39-L62), + change the return type to `Promise<{ messageId: string }>` and return + `{ messageId: result.messageId }` after the `result.rejected` check. + +3. **Fix the misleading log** in + [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94): + + ```ts + sendVerificationRequest: async ({ identifier, url, provider }) => { + try { + const { messageId } = await MagicLinksEmail({ identifier, url, provider }); + console.log("EMAIL_MAGIC_LINK_SUCCESS", { + email: identifier, + messageId, + timestamp: new Date().toISOString(), + }); + } catch (err) { + console.error("EMAIL_MAGIC_LINK_FAILURE", { + email: identifier, + error: err instanceof Error ? err.message : String(err), + timestamp: new Date().toISOString(), + }); + throw err; + } + }, + ``` + + Now log presence proves "we tried", `EMAIL_MAGIC_LINK_SUCCESS` proves "SES + accepted at SMTP", and `messageId` correlates to SES events. + +4. **Add `SES_CONFIGURATION_SET` env var** to the deployment platform + (Vercel / wherever). Value: `dev-ses-config`. + +5. **Terraform — add `renderingFailure` event type and output the cfg-set + name.** In the SES module: + + ```hcl + resource "aws_ses_event_destination" "sns" { + # ... + matching_types = [ + "send", + "bounce", + "reject", + "complaint", + "delivery", + "renderingFailure", # ← add + ] + # ... + } + + output "configuration_set_name" { + value = aws_ses_configuration_set.this.name + } + ``` + +### Acceptance criteria + +- Sending a magic link produces exactly one of: `EMAIL_MAGIC_LINK_SUCCESS` + (with messageId) or `EMAIL_MAGIC_LINK_FAILURE` (with error). No bare + `EMAIL MAGIC LINK SENT` lines. +- The `messageId` in app logs matches the `mail.messageId` SES uses (verify + once via SES console for one test send). +- Terraform plan is clean after the two SES module changes. + +--- + +# IT track (parallel, out-of-band) + +These happen in parallel to PR 1; no engineer dependency. + +1. **Create `noreply@domna.homes` as a real shared mailbox** in M365. + Configure an auto-reply pointing to a monitored support address. Set + `Reply-To:` on outgoing mail in PR 2 once the mailbox exists, OR change + `EMAIL_FROM` to something like `accounts@domna.homes` and make *that* + monitored. Update `EMAIL_FROM` env var when ready. + +2. **Update the DMARC TXT record** at `_dmarc.domna.homes` from: + + ``` + v=DMARC1; p=none; + ``` + + to: + + ``` + v=DMARC1; p=none; rua=mailto:; fo=1; adkim=r; aspf=r; pct=100; + ``` + + Recommend Postmark DMARC Monitor (free) for the reporting address — they + provide a unique mailbox like `@inbound-smtp.dmarc.postmarkapp.com` + and parse the XML into readable reports. + +3. **Progression** (after ~4 weeks of clean reports): bump to + `p=quarantine; pct=10;`, then `p=quarantine; pct=100;`, then + `p=reject; pct=100;`. Don't skip the ramp — HubSpot / Outlook sends need + to be confirmed aligned first. + +### Nice-to-have (do when convenient, not blocking) + +- **Custom MAIL FROM domain** (`mail.domna.homes`) — adds SPF alignment via + *our* domain instead of `*.amazonses.com`. In Terraform: + ```hcl + resource "aws_ses_domain_mail_from" "this" { + domain = aws_ses_domain_identity.this.domain + mail_from_domain = "mail.${var.domain_name}" + } + ``` + Plus the corresponding `MX 10 feedback-smtp.eu-west-2.amazonses.com` and + `TXT v=spf1 include:amazonses.com -all` at `mail.domna.homes`. Modest but + real deliverability win. + +--- + +# PR 2 — Code fallback flow (behind feature flag) + +**Goal:** every magic-link email now also contains a 6-digit code. The +post-submit UX leads with code entry; the link still works as a fast path. + +**Risk:** medium — changes the user-facing auth flow. Feature flag mitigates. + +**Effort:** ~2-3 days. + +### Schema migration + +Add to the existing `verificationToken` table in +[src/app/db/schema/users.ts](../../src/app/db/schema/users.ts): + +```ts +export const verificationTokens = pgTable("verificationToken", { + identifier: text("identifier").notNull(), + token: text("token").notNull(), + expires: timestamp("expires", { mode: "date" }).notNull(), + codeHash: text("code_hash"), // ← new (nullable) + attempts: integer("attempts").notNull().default(0), // ← new +}); +``` + +Also new table for rate-limit state: + +```ts +export const authRateLimits = pgTable("authRateLimits", { + scope: text("scope").notNull(), // "email-send" | "ip-verify" + key: text("key").notNull(), // email or IP + count: integer("count").notNull().default(0), + windowStart: timestamp("window_start").notNull(), +}, (t) => ({ + pk: primaryKey({ columns: [t.scope, t.key] }), +})); +``` + +Drizzle migration is additive — existing rows get `code_hash = NULL`, +`attempts = 0`. Pre-deploy magic-link emails (those without codes) continue +to work via the link path; users on the code-entry page will fail and need +to resend. Acceptable transient. + +### Code utility — `src/lib/auth/code.ts` + +```ts +import crypto from "crypto"; + +export function generateCode(): string { + return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0"); +} + +export function hashCode(code: string, secret: string): string { + return crypto.createHash("sha256").update(code + secret).digest("hex"); +} +``` + +`crypto.randomInt` not `Math.random` — must be CSPRNG. + +### `sendVerificationRequest` update + +In [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94): + +1. Generate code via `generateCode()` +2. Hash with `NEXTAUTH_SECRET` +3. `UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?` +4. Pass `code` (plaintext) into `MagicLinksEmail` +5. Existing logging unchanged + +### Email template update + +[magic_link.ts](../../src/app/email_templates/magic_link.ts) takes a new +`code` argument. The HTML and plain-text bodies lead with: + +> **Your sign-in code: 482 911** +> +> Enter this at `ara.domna.homes/auth/verify-code` +> +> Or click here to skip the code step: [Sign in to Ara] → + +Code rendered large and monospace; the link is a smaller secondary button. + +### NextAuth providers — add `CredentialsProvider` + +In [authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts): + +```ts +import CredentialsProvider from "next-auth/providers/credentials"; + +CredentialsProvider({ + id: "email-code", + name: "Email Code", + credentials: { + email: { type: "text" }, + code: { type: "text" }, + }, + async authorize(credentials) { + if (!credentials?.email || !credentials?.code) return null; + const email = credentials.email.toLowerCase(); + const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!); + + // Look up the row by (identifier, code_hash, not-expired) + // If found: check attempts < 5, delete row, return user + // If found but attempts >= 5: delete row, return null + // If not found: increment attempts on the row matching identifier (if any), return null + // Rate-limit check before any of this; on exceed, return null silently + }, +}), +``` + +Verify the `signIn` callback at +[authOptions.ts:114-206](../../src/app/api/auth/[...nextauth]/authOptions.ts#L114-L206) +handles `account.type === "credentials"` cleanly (it should fall through the +OAuth branches at lines 142 and 180, but worth a unit test). + +### Drop magic-link maxAge + +Change [authOptions.ts:84](../../src/app/api/auth/[...nextauth]/authOptions.ts#L84) +from `60 * 60` to `60 * 10`. Code and link now share the 10-min window. + +### New page — `src/app/auth/verify-code/page.tsx` + +Code-input form. POSTs `{email, code}` to NextAuth credentials endpoint via +`signIn("email-code", { email, code, redirect: false })`. Includes a +"Resend code" button (rate-limited via the new `auth_rate_limits` table). + +### Redirect change + +[authOptions.ts:103](../../src/app/api/auth/[...nextauth]/authOptions.ts#L103): +change `verifyRequest: "/auth/verify-request"` to +`verifyRequest: "/auth/verify-code?email="` so post-submit lands on +the code-entry page with the email pre-filled. + +### Existing `/verify/[token]` page + +Unchanged behaviour — the link path still works. Optionally add a small +"Type the code instead" link to the code-entry page for the rare user who +prefers it. + +### Feature flag + +Env var `AUTH_CODE_FALLBACK_ENABLED=true|false`. When `false`, skip the code +generation in `sendVerificationRequest` and redirect to the old +`/auth/verify-request` page. Lets us canary and roll back via env-var flip. + +### Acceptance criteria + +- Submitting email at `/` lands on `/auth/verify-code` (when flag enabled) +- The email body contains a clear 6-digit code AND a working link +- Typing the code on the verify-code page signs the user in +- Clicking the link still signs the user in (single artifact — using the + link invalidates the code, and vice versa) +- 5 wrong code attempts deletes the row; a 6th attempt with the correct code + fails +- 6th code request within an hour to the same email is silently no-op'd +- After successful sign-in, behaviour identical to today (lands on `/home`, + user record + `lastLogin` updated, etc.) + +--- + +# Testing strategy + +### CI (every PR) + +- **Vitest** for pure-function unit tests (existing pattern — see + [src/app/email_templates/buildMailHeaders.test.ts](../../src/app/email_templates/buildMailHeaders.test.ts) + added in PR 1). +- **Cypress** for E2E (existing harness at `cypress/e2e/`) covering: + - Code path: submit → poll Mailpit → extract code → enter → land on `/home` + - Link path: submit → poll Mailpit → extract URL → visit → confirm → land on `/home` + - Wrong code: 5 wrong attempts invalidate; 6th with correct code fails + - Expired code: code submitted >10 min after generation fails +- **Mailpit** as a docker-compose service. SMTP at `:1025`, JSON API at + `:8025`. Cypress task helpers (`cy.task`) call the Mailpit JSON API to + extract codes/links from captured emails. +- **Unit tests** for `generateCode()` distribution, `hashCode()` + determinism, rate-limit math (Vitest). + +### Pre-release (manual gate on template changes) + +- Run the template through [mail-tester.com](https://mail-tester.com) before + shipping. Target >9/10. Free, 60 seconds. + +### Production + +- Eyeball the SES Account dashboard weekly: bounce rate, complaint rate, + reputation status. +- One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team. +- When investigating an incident, temporarily subscribe an SQS queue or + email endpoint to `dev-ses-events` SNS topic to capture events for the + duration. Unsubscribe after. + +When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the +SES event-consumer Lambda. + +--- + +# Files touched (summary) + +| Phase | File | Change | +|---|---|---| +| PR 1 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Add `X-SES-CONFIGURATION-SET` header; return `messageId` | +| PR 1 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Replace `sendVerificationRequest` logging with try/catch + messageId | +| PR 1 | Terraform `modules/ses/` | Add `renderingFailure` event type + cfg-set name output | +| PR 1 | Deployment env vars | Add `SES_CONFIGURATION_SET=dev-ses-config` | +| IT | DNS — `_dmarc.domna.homes` | Update DMARC TXT value | +| IT | M365 admin | Create `noreply@domna.homes` shared mailbox | +| PR 2 | [src/app/db/schema/users.ts](../../src/app/db/schema/users.ts) | Add `code_hash`, `attempts` columns; new `authRateLimits` table | +| PR 2 | `src/lib/auth/code.ts` (new) | `generateCode()`, `hashCode()` | +| PR 2 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Accept `code` arg; render code prominently in HTML + plaintext | +| PR 2 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Add `CredentialsProvider`; generate+persist code in `sendVerificationRequest`; drop `maxAge` to 600; change `verifyRequest` redirect | +| PR 2 | `src/app/auth/verify-code/page.tsx` (new) | Code-input form + resend button | +| PR 2 | `src/lib/auth/rate-limit.ts` (new) | Per-email-send, per-code-attempt, per-IP-verify limiters backed by `authRateLimits` table | +| PR 2 | `.github/workflows/test.yml` | Mailpit service + Cypress E2E | +| PR 2 | `docker-compose.yml` | Mailpit local-dev service | +| PR 2 | Deployment env vars | Add `AUTH_CODE_FALLBACK_ENABLED` flag | + +--- + +# Open questions + +None blocking. Possibilities to revisit after PR 2 ships: + +- Do enterprise customers (paid relationships, formal SLAs) want Microsoft + work-account OAuth as their primary sign-in? +- Does the `.homes` TLD continue to be a deliverability bottleneck after + MAIL FROM + DMARC `p=reject` are in place? If yes, evaluate moving sends + to a `.com` / `.co.uk` we own. +- Does the SES event volume justify the Lambda from the backlog ticket? diff --git a/src/app/api/auth/[...nextauth]/authOptions.ts b/src/app/api/auth/[...nextauth]/authOptions.ts index 09b6b70..3a3b13b 100644 --- a/src/app/api/auth/[...nextauth]/authOptions.ts +++ b/src/app/api/auth/[...nextauth]/authOptions.ts @@ -82,15 +82,26 @@ export const AuthOptions: NextAuthOptions = { }, from: EMAIL_FROM, // noreply email maxAge: 60 * 60, // magic link valid for 1 hour - // Slightly extended magic link email sender, to log sends sendVerificationRequest: async ({ identifier, url, provider }) => { - console.log("EMAIL MAGIC LINK SENT:", { - email: identifier, - url, - timestamp: new Date().toISOString(), - }); - - await MagicLinksEmail({ identifier, url, provider }); + try { + const { messageId } = await MagicLinksEmail({ + identifier, + url, + provider, + }); + console.log("EMAIL_MAGIC_LINK_SUCCESS", { + email: identifier, + messageId, + timestamp: new Date().toISOString(), + }); + } catch (err) { + console.error("EMAIL_MAGIC_LINK_FAILURE", { + email: identifier, + error: err instanceof Error ? err.message : String(err), + timestamp: new Date().toISOString(), + }); + throw err; + } }, }), ], diff --git a/src/app/email_templates/buildMailHeaders.test.ts b/src/app/email_templates/buildMailHeaders.test.ts new file mode 100644 index 0000000..bc02dcc --- /dev/null +++ b/src/app/email_templates/buildMailHeaders.test.ts @@ -0,0 +1,31 @@ +import { describe, expect, it } from "vitest"; +import { buildMailHeaders } from "./buildMailHeaders"; + +describe("buildMailHeaders", () => { + it("includes X-SES-CONFIGURATION-SET when a configuration set is provided", () => { + const headers = buildMailHeaders({ + fromAddress: "noreply@domna.homes", + sesConfigurationSet: "dev-ses-config", + }); + + expect(headers["X-SES-CONFIGURATION-SET"]).toBe("dev-ses-config"); + }); + + it("omits X-SES-CONFIGURATION-SET entirely when no configuration set is provided", () => { + const headers = buildMailHeaders({ + fromAddress: "noreply@domna.homes", + sesConfigurationSet: undefined, + }); + + expect(headers).not.toHaveProperty("X-SES-CONFIGURATION-SET"); + }); + + it("always includes a List-Unsubscribe mailto pointing at the from address", () => { + const headers = buildMailHeaders({ + fromAddress: "noreply@domna.homes", + sesConfigurationSet: undefined, + }); + + expect(headers["List-Unsubscribe"]).toBe(""); + }); +}); diff --git a/src/app/email_templates/buildMailHeaders.ts b/src/app/email_templates/buildMailHeaders.ts new file mode 100644 index 0000000..9aed8b4 --- /dev/null +++ b/src/app/email_templates/buildMailHeaders.ts @@ -0,0 +1,12 @@ +export function buildMailHeaders(opts: { + fromAddress: string; + sesConfigurationSet: string | undefined; +}): Record { + const headers: Record = { + "List-Unsubscribe": ``, + }; + if (opts.sesConfigurationSet) { + headers["X-SES-CONFIGURATION-SET"] = opts.sesConfigurationSet; + } + return headers; +} diff --git a/src/app/email_templates/magic_link.ts b/src/app/email_templates/magic_link.ts index de84b41..f129489 100644 --- a/src/app/email_templates/magic_link.ts +++ b/src/app/email_templates/magic_link.ts @@ -3,6 +3,7 @@ // links import { createTransport } from "nodemailer"; +import { buildMailHeaders } from "./buildMailHeaders"; export async function MagicLinksEmail({ identifier, @@ -12,7 +13,7 @@ export async function MagicLinksEmail({ identifier: string; url: string; provider: { server: any; from: string }; -}) { +}): Promise<{ messageId: string }> { const parsed = new URL(url); const host = parsed.host; @@ -50,15 +51,18 @@ export async function MagicLinksEmail({ brown, background, }), - headers: { - "List-Unsubscribe": ``, - }, + headers: buildMailHeaders({ + fromAddress: provider.from, + sesConfigurationSet: process.env.SES_CONFIGURATION_SET, + }), }); const failed = result.rejected.filter(Boolean); if (failed.length) { throw new Error(`Email(s) (${failed.join(", ")}) could not be sent`); } + + return { messageId: result.messageId }; } function domnaHtml({