assessment-model/docs/wip/auth-email-code-fallback-plan.md
Khalim Conn-Kowlessar 9c569f5584 Add SES observability foundation for email auth (PR 1 of code-fallback)
Wires the X-SES-CONFIGURATION-SET header on outbound auth emails so SES
bounce/delivery events flow through dev-ses-config to the dev-ses-events
SNS topic. Replaces the fire-and-forget "EMAIL MAGIC LINK SENT" log
(which fired before the SMTP transaction and swallowed downstream errors)
with structured EMAIL_MAGIC_LINK_SUCCESS/_FAILURE logs carrying the
Nodemailer messageId, so app-side sends are now correlatable with SES
events.

Motivated by the Atkins / Sustainable Building UK silent-quarantine
incidents where we couldn't tell whether SES had even tried to send.

Plan doc at docs/wip/auth-email-code-fallback-plan.md tracks the
broader email-code-fallback design that PR 2 will implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:15:25 +00:00

451 lines
18 KiB
Markdown

# Email-auth code-fallback — implementation plan
**Status:** Ready to implement (2026-05-27)
**Author:** Khalim (with Claude, via `/grill-me`)
**Trigger:** Multiple corporate-domain users (Atkins, Sustainable Building UK,
Arup pre-fix) reporting magic-link emails never arriving. One IT department
confirmed direct sender-policy block.
## Problem
[Magic-link sign-in](../../src/app/api/auth/[...nextauth]/authOptions.ts#L74-L95)
emails are being **silently quarantined** by some corporate email gateways
(Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the
recipient's MX, is accepted at SMTP, then disappears post-acceptance — never
hits the inbox or junk folder. SES sees a clean `Delivery` event; the user
sees nothing.
This is invisible to the sender by design. We confirmed it via the Atkins-
adjacent case (Craig Williams, sustainablebuildinguk.com) where the
recipient's IT department reached out, confirmed the block, and manually
allow-listed `noreply@domna.homes`.
### What's been ruled out
- SPF, DKIM are correctly configured (verified via `dig`)
- SES suppression list does not contain the failing users
- Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine
- Link pre-fetching is already defended by
[/verify/[token]/page.tsx](../../src/app/verify/[token]/page.tsx) (two-step
consent before token consumption)
### What's contributing but not the root cause
- `noreply@domna.homes` is not a real mailbox → small reputation hit and
broken `List-Unsubscribe` ([magic_link.ts:54](../../src/app/email_templates/magic_link.ts#L54))
- DMARC is at `p=none` with no `rua=` reporting — zero visibility into
alignment failures
- `.homes` is a 2014-era new gTLD with thin aggregate reputation at some
receivers (structural, hard to fully mitigate without changing sender domain)
- App's `EMAIL_MAGIC_LINK_SENT` log fires *before* the SMTP transaction and
isn't wrapped in try/catch, so it can't distinguish "tried" from "sent"
([authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94))
- SES configuration set exists but the app never sets the
`X-SES-CONFIGURATION-SET` header, so events don't flow through it
([magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55))
## Decisions (grilled and locked)
| # | Decision | Rationale |
|---|---|---|
| D1 | Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email | Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device |
| D2 | 6-digit numeric code | Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile |
| D3 | 10-minute expiry for both code and link (single artifact) | Code's lower entropy demands shorter window; unifies behaviour |
| D4 | Single DB row per verification — extend `verificationToken` with `code_hash` and `attempts` columns | No new table; both paths consume the same row; deletion on first success enforces single-use |
| D5 | Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres | Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case |
| D6 | NextAuth v4 `CredentialsProvider` alongside existing `EmailProvider` | Cleanest extension point; works with existing JWT session strategy at [authOptions.ts:269](../../src/app/api/auth/[...nextauth]/authOptions.ts#L269); reuses existing user / `signIn` callback |
| D7 | Ship in two PRs: observability first, code-flow second behind feature flag | Lets us measure the improvement; isolates risk |
| D8 | Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod | Lean — no paid SaaS at current scale |
| D9 | Skip the SES event-consumer Lambda for now; ticket on backlog | Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today |
## Out of scope / deferred
- **Microsoft work-account OAuth** (Azure AD multi-tenant) — would skip email
entirely for M365 corporate users but locks them to MS SSO; revisit if
enterprise customers demand it
- **Switching sending domain off `.homes`** — measurable deliverability lift
but requires re-warmup and brand thinking; revisit if `domna.homes`
reputation stays a bottleneck after the DMARC + MAIL FROM work
- **SES event-consumer Lambda** — ticketed for the backlog (see planning
board: "Wire up SES event consumer for email auth forensics")
- **BIMI / VMC, dedicated SES IP, GlockApps placement testing** — overkill at
current scale
- **Case-sensitivity bug in portfolio invitations** (separate issue) — user
`Craig.Williams@…` invited with mixed case, then signed in lowercase,
creating two user records. The lookup at
[authOptions.ts:117](../../src/app/api/auth/[...nextauth]/authOptions.ts#L117)
normalises but the *invitation* path doesn't. Track separately — not in
this plan.
---
# PR 1 — Observability + logging fixes
**Goal:** when the next Atkins-shaped incident happens, we know within minutes
whether SES tried, whether SES accepted, and whether the recipient SMTP
accepted. Today we know none of these.
**Risk:** near-zero. Additive only.
**Effort:** ~half a day.
### Changes
1. **Add `X-SES-CONFIGURATION-SET` header** to outbound mail.
In [magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55):
```ts
headers: {
...(process.env.SES_CONFIGURATION_SET && {
"X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET,
}),
"List-Unsubscribe": `<mailto:${provider.from}>`,
},
```
Conditional so CI (Mailpit) doesn't need the env var.
2. **Capture and return Nodemailer messageId** from `MagicLinksEmail`.
In [magic_link.ts:39-62](../../src/app/email_templates/magic_link.ts#L39-L62),
change the return type to `Promise<{ messageId: string }>` and return
`{ messageId: result.messageId }` after the `result.rejected` check.
3. **Fix the misleading log** in
[authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94):
```ts
sendVerificationRequest: async ({ identifier, url, provider }) => {
try {
const { messageId } = await MagicLinksEmail({ identifier, url, provider });
console.log("EMAIL_MAGIC_LINK_SUCCESS", {
email: identifier,
messageId,
timestamp: new Date().toISOString(),
});
} catch (err) {
console.error("EMAIL_MAGIC_LINK_FAILURE", {
email: identifier,
error: err instanceof Error ? err.message : String(err),
timestamp: new Date().toISOString(),
});
throw err;
}
},
```
Now log presence proves "we tried", `EMAIL_MAGIC_LINK_SUCCESS` proves "SES
accepted at SMTP", and `messageId` correlates to SES events.
4. **Add `SES_CONFIGURATION_SET` env var** to the deployment platform
(Vercel / wherever). Value: `dev-ses-config`.
5. **Terraform — add `renderingFailure` event type and output the cfg-set
name.** In the SES module:
```hcl
resource "aws_ses_event_destination" "sns" {
# ...
matching_types = [
"send",
"bounce",
"reject",
"complaint",
"delivery",
"renderingFailure", # ← add
]
# ...
}
output "configuration_set_name" {
value = aws_ses_configuration_set.this.name
}
```
### Acceptance criteria
- Sending a magic link produces exactly one of: `EMAIL_MAGIC_LINK_SUCCESS`
(with messageId) or `EMAIL_MAGIC_LINK_FAILURE` (with error). No bare
`EMAIL MAGIC LINK SENT` lines.
- The `messageId` in app logs matches the `mail.messageId` SES uses (verify
once via SES console for one test send).
- Terraform plan is clean after the two SES module changes.
---
# IT track (parallel, out-of-band)
These happen in parallel to PR 1; no engineer dependency.
1. **Create `noreply@domna.homes` as a real shared mailbox** in M365.
Configure an auto-reply pointing to a monitored support address. Set
`Reply-To:` on outgoing mail in PR 2 once the mailbox exists, OR change
`EMAIL_FROM` to something like `accounts@domna.homes` and make *that*
monitored. Update `EMAIL_FROM` env var when ready.
2. **Update the DMARC TXT record** at `_dmarc.domna.homes` from:
```
v=DMARC1; p=none;
```
to:
```
v=DMARC1; p=none; rua=mailto:<reporting-address>; fo=1; adkim=r; aspf=r; pct=100;
```
Recommend Postmark DMARC Monitor (free) for the reporting address — they
provide a unique mailbox like `<uuid>@inbound-smtp.dmarc.postmarkapp.com`
and parse the XML into readable reports.
3. **Progression** (after ~4 weeks of clean reports): bump to
`p=quarantine; pct=10;`, then `p=quarantine; pct=100;`, then
`p=reject; pct=100;`. Don't skip the ramp — HubSpot / Outlook sends need
to be confirmed aligned first.
### Nice-to-have (do when convenient, not blocking)
- **Custom MAIL FROM domain** (`mail.domna.homes`) — adds SPF alignment via
*our* domain instead of `*.amazonses.com`. In Terraform:
```hcl
resource "aws_ses_domain_mail_from" "this" {
domain = aws_ses_domain_identity.this.domain
mail_from_domain = "mail.${var.domain_name}"
}
```
Plus the corresponding `MX 10 feedback-smtp.eu-west-2.amazonses.com` and
`TXT v=spf1 include:amazonses.com -all` at `mail.domna.homes`. Modest but
real deliverability win.
---
# PR 2 — Code fallback flow (behind feature flag)
**Goal:** every magic-link email now also contains a 6-digit code. The
post-submit UX leads with code entry; the link still works as a fast path.
**Risk:** medium — changes the user-facing auth flow. Feature flag mitigates.
**Effort:** ~2-3 days.
### Schema migration
Add to the existing `verificationToken` table in
[src/app/db/schema/users.ts](../../src/app/db/schema/users.ts):
```ts
export const verificationTokens = pgTable("verificationToken", {
identifier: text("identifier").notNull(),
token: text("token").notNull(),
expires: timestamp("expires", { mode: "date" }).notNull(),
codeHash: text("code_hash"), // ← new (nullable)
attempts: integer("attempts").notNull().default(0), // ← new
});
```
Also new table for rate-limit state:
```ts
export const authRateLimits = pgTable("authRateLimits", {
scope: text("scope").notNull(), // "email-send" | "ip-verify"
key: text("key").notNull(), // email or IP
count: integer("count").notNull().default(0),
windowStart: timestamp("window_start").notNull(),
}, (t) => ({
pk: primaryKey({ columns: [t.scope, t.key] }),
}));
```
Drizzle migration is additive — existing rows get `code_hash = NULL`,
`attempts = 0`. Pre-deploy magic-link emails (those without codes) continue
to work via the link path; users on the code-entry page will fail and need
to resend. Acceptable transient.
### Code utility — `src/lib/auth/code.ts`
```ts
import crypto from "crypto";
export function generateCode(): string {
return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0");
}
export function hashCode(code: string, secret: string): string {
return crypto.createHash("sha256").update(code + secret).digest("hex");
}
```
`crypto.randomInt` not `Math.random` — must be CSPRNG.
### `sendVerificationRequest` update
In [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94):
1. Generate code via `generateCode()`
2. Hash with `NEXTAUTH_SECRET`
3. `UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?`
4. Pass `code` (plaintext) into `MagicLinksEmail`
5. Existing logging unchanged
### Email template update
[magic_link.ts](../../src/app/email_templates/magic_link.ts) takes a new
`code` argument. The HTML and plain-text bodies lead with:
> **Your sign-in code: 482 911**
>
> Enter this at `ara.domna.homes/auth/verify-code`
>
> Or click here to skip the code step: [Sign in to Ara] →
Code rendered large and monospace; the link is a smaller secondary button.
### NextAuth providers — add `CredentialsProvider`
In [authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts):
```ts
import CredentialsProvider from "next-auth/providers/credentials";
CredentialsProvider({
id: "email-code",
name: "Email Code",
credentials: {
email: { type: "text" },
code: { type: "text" },
},
async authorize(credentials) {
if (!credentials?.email || !credentials?.code) return null;
const email = credentials.email.toLowerCase();
const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!);
// Look up the row by (identifier, code_hash, not-expired)
// If found: check attempts < 5, delete row, return user
// If found but attempts >= 5: delete row, return null
// If not found: increment attempts on the row matching identifier (if any), return null
// Rate-limit check before any of this; on exceed, return null silently
},
}),
```
Verify the `signIn` callback at
[authOptions.ts:114-206](../../src/app/api/auth/[...nextauth]/authOptions.ts#L114-L206)
handles `account.type === "credentials"` cleanly (it should fall through the
OAuth branches at lines 142 and 180, but worth a unit test).
### Drop magic-link maxAge
Change [authOptions.ts:84](../../src/app/api/auth/[...nextauth]/authOptions.ts#L84)
from `60 * 60` to `60 * 10`. Code and link now share the 10-min window.
### New page — `src/app/auth/verify-code/page.tsx`
Code-input form. POSTs `{email, code}` to NextAuth credentials endpoint via
`signIn("email-code", { email, code, redirect: false })`. Includes a
"Resend code" button (rate-limited via the new `auth_rate_limits` table).
### Redirect change
[authOptions.ts:103](../../src/app/api/auth/[...nextauth]/authOptions.ts#L103):
change `verifyRequest: "/auth/verify-request"` to
`verifyRequest: "/auth/verify-code?email=<encoded>"` so post-submit lands on
the code-entry page with the email pre-filled.
### Existing `/verify/[token]` page
Unchanged behaviour — the link path still works. Optionally add a small
"Type the code instead" link to the code-entry page for the rare user who
prefers it.
### Feature flag
Env var `AUTH_CODE_FALLBACK_ENABLED=true|false`. When `false`, skip the code
generation in `sendVerificationRequest` and redirect to the old
`/auth/verify-request` page. Lets us canary and roll back via env-var flip.
### Acceptance criteria
- Submitting email at `/` lands on `/auth/verify-code` (when flag enabled)
- The email body contains a clear 6-digit code AND a working link
- Typing the code on the verify-code page signs the user in
- Clicking the link still signs the user in (single artifact — using the
link invalidates the code, and vice versa)
- 5 wrong code attempts deletes the row; a 6th attempt with the correct code
fails
- 6th code request within an hour to the same email is silently no-op'd
- After successful sign-in, behaviour identical to today (lands on `/home`,
user record + `lastLogin` updated, etc.)
---
# Testing strategy
### CI (every PR)
- **Vitest** for pure-function unit tests (existing pattern — see
[src/app/email_templates/buildMailHeaders.test.ts](../../src/app/email_templates/buildMailHeaders.test.ts)
added in PR 1).
- **Cypress** for E2E (existing harness at `cypress/e2e/`) covering:
- Code path: submit → poll Mailpit → extract code → enter → land on `/home`
- Link path: submit → poll Mailpit → extract URL → visit → confirm → land on `/home`
- Wrong code: 5 wrong attempts invalidate; 6th with correct code fails
- Expired code: code submitted >10 min after generation fails
- **Mailpit** as a docker-compose service. SMTP at `:1025`, JSON API at
`:8025`. Cypress task helpers (`cy.task`) call the Mailpit JSON API to
extract codes/links from captured emails.
- **Unit tests** for `generateCode()` distribution, `hashCode()`
determinism, rate-limit math (Vitest).
### Pre-release (manual gate on template changes)
- Run the template through [mail-tester.com](https://mail-tester.com) before
shipping. Target >9/10. Free, 60 seconds.
### Production
- Eyeball the SES Account dashboard weekly: bounce rate, complaint rate,
reputation status.
- One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team.
- When investigating an incident, temporarily subscribe an SQS queue or
email endpoint to `dev-ses-events` SNS topic to capture events for the
duration. Unsubscribe after.
When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the
SES event-consumer Lambda.
---
# Files touched (summary)
| Phase | File | Change |
|---|---|---|
| PR 1 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Add `X-SES-CONFIGURATION-SET` header; return `messageId` |
| PR 1 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Replace `sendVerificationRequest` logging with try/catch + messageId |
| PR 1 | Terraform `modules/ses/` | Add `renderingFailure` event type + cfg-set name output |
| PR 1 | Deployment env vars | Add `SES_CONFIGURATION_SET=dev-ses-config` |
| IT | DNS — `_dmarc.domna.homes` | Update DMARC TXT value |
| IT | M365 admin | Create `noreply@domna.homes` shared mailbox |
| PR 2 | [src/app/db/schema/users.ts](../../src/app/db/schema/users.ts) | Add `code_hash`, `attempts` columns; new `authRateLimits` table |
| PR 2 | `src/lib/auth/code.ts` (new) | `generateCode()`, `hashCode()` |
| PR 2 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Accept `code` arg; render code prominently in HTML + plaintext |
| PR 2 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Add `CredentialsProvider`; generate+persist code in `sendVerificationRequest`; drop `maxAge` to 600; change `verifyRequest` redirect |
| PR 2 | `src/app/auth/verify-code/page.tsx` (new) | Code-input form + resend button |
| PR 2 | `src/lib/auth/rate-limit.ts` (new) | Per-email-send, per-code-attempt, per-IP-verify limiters backed by `authRateLimits` table |
| PR 2 | `.github/workflows/test.yml` | Mailpit service + Cypress E2E |
| PR 2 | `docker-compose.yml` | Mailpit local-dev service |
| PR 2 | Deployment env vars | Add `AUTH_CODE_FALLBACK_ENABLED` flag |
---
# Open questions
None blocking. Possibilities to revisit after PR 2 ships:
- Do enterprise customers (paid relationships, formal SLAs) want Microsoft
work-account OAuth as their primary sign-in?
- Does the `.homes` TLD continue to be a deliverability bottleneck after
MAIL FROM + DMARC `p=reject` are in place? If yes, evaluate moving sends
to a `.com` / `.co.uk` we own.
- Does the SES event volume justify the Lambda from the backlog ticket?