mirror of
https://github.com/Hestia-Homes/assessment-model.git
synced 2026-06-08 11:37:25 +00:00
Wires the X-SES-CONFIGURATION-SET header on outbound auth emails so SES bounce/delivery events flow through dev-ses-config to the dev-ses-events SNS topic. Replaces the fire-and-forget "EMAIL MAGIC LINK SENT" log (which fired before the SMTP transaction and swallowed downstream errors) with structured EMAIL_MAGIC_LINK_SUCCESS/_FAILURE logs carrying the Nodemailer messageId, so app-side sends are now correlatable with SES events. Motivated by the Atkins / Sustainable Building UK silent-quarantine incidents where we couldn't tell whether SES had even tried to send. Plan doc at docs/wip/auth-email-code-fallback-plan.md tracks the broader email-code-fallback design that PR 2 will implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
451 lines
18 KiB
Markdown
451 lines
18 KiB
Markdown
# Email-auth code-fallback — implementation plan
|
|
|
|
**Status:** Ready to implement (2026-05-27)
|
|
**Author:** Khalim (with Claude, via `/grill-me`)
|
|
**Trigger:** Multiple corporate-domain users (Atkins, Sustainable Building UK,
|
|
Arup pre-fix) reporting magic-link emails never arriving. One IT department
|
|
confirmed direct sender-policy block.
|
|
|
|
## Problem
|
|
|
|
[Magic-link sign-in](../../src/app/api/auth/[...nextauth]/authOptions.ts#L74-L95)
|
|
emails are being **silently quarantined** by some corporate email gateways
|
|
(Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the
|
|
recipient's MX, is accepted at SMTP, then disappears post-acceptance — never
|
|
hits the inbox or junk folder. SES sees a clean `Delivery` event; the user
|
|
sees nothing.
|
|
|
|
This is invisible to the sender by design. We confirmed it via the Atkins-
|
|
adjacent case (Craig Williams, sustainablebuildinguk.com) where the
|
|
recipient's IT department reached out, confirmed the block, and manually
|
|
allow-listed `noreply@domna.homes`.
|
|
|
|
### What's been ruled out
|
|
|
|
- SPF, DKIM are correctly configured (verified via `dig`)
|
|
- SES suppression list does not contain the failing users
|
|
- Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine
|
|
- Link pre-fetching is already defended by
|
|
[/verify/[token]/page.tsx](../../src/app/verify/[token]/page.tsx) (two-step
|
|
consent before token consumption)
|
|
|
|
### What's contributing but not the root cause
|
|
|
|
- `noreply@domna.homes` is not a real mailbox → small reputation hit and
|
|
broken `List-Unsubscribe` ([magic_link.ts:54](../../src/app/email_templates/magic_link.ts#L54))
|
|
- DMARC is at `p=none` with no `rua=` reporting — zero visibility into
|
|
alignment failures
|
|
- `.homes` is a 2014-era new gTLD with thin aggregate reputation at some
|
|
receivers (structural, hard to fully mitigate without changing sender domain)
|
|
- App's `EMAIL_MAGIC_LINK_SENT` log fires *before* the SMTP transaction and
|
|
isn't wrapped in try/catch, so it can't distinguish "tried" from "sent"
|
|
([authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94))
|
|
- SES configuration set exists but the app never sets the
|
|
`X-SES-CONFIGURATION-SET` header, so events don't flow through it
|
|
([magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55))
|
|
|
|
## Decisions (grilled and locked)
|
|
|
|
| # | Decision | Rationale |
|
|
|---|---|---|
|
|
| D1 | Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email | Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device |
|
|
| D2 | 6-digit numeric code | Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile |
|
|
| D3 | 10-minute expiry for both code and link (single artifact) | Code's lower entropy demands shorter window; unifies behaviour |
|
|
| D4 | Single DB row per verification — extend `verificationToken` with `code_hash` and `attempts` columns | No new table; both paths consume the same row; deletion on first success enforces single-use |
|
|
| D5 | Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres | Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case |
|
|
| D6 | NextAuth v4 `CredentialsProvider` alongside existing `EmailProvider` | Cleanest extension point; works with existing JWT session strategy at [authOptions.ts:269](../../src/app/api/auth/[...nextauth]/authOptions.ts#L269); reuses existing user / `signIn` callback |
|
|
| D7 | Ship in two PRs: observability first, code-flow second behind feature flag | Lets us measure the improvement; isolates risk |
|
|
| D8 | Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod | Lean — no paid SaaS at current scale |
|
|
| D9 | Skip the SES event-consumer Lambda for now; ticket on backlog | Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today |
|
|
|
|
## Out of scope / deferred
|
|
|
|
- **Microsoft work-account OAuth** (Azure AD multi-tenant) — would skip email
|
|
entirely for M365 corporate users but locks them to MS SSO; revisit if
|
|
enterprise customers demand it
|
|
- **Switching sending domain off `.homes`** — measurable deliverability lift
|
|
but requires re-warmup and brand thinking; revisit if `domna.homes`
|
|
reputation stays a bottleneck after the DMARC + MAIL FROM work
|
|
- **SES event-consumer Lambda** — ticketed for the backlog (see planning
|
|
board: "Wire up SES event consumer for email auth forensics")
|
|
- **BIMI / VMC, dedicated SES IP, GlockApps placement testing** — overkill at
|
|
current scale
|
|
- **Case-sensitivity bug in portfolio invitations** (separate issue) — user
|
|
`Craig.Williams@…` invited with mixed case, then signed in lowercase,
|
|
creating two user records. The lookup at
|
|
[authOptions.ts:117](../../src/app/api/auth/[...nextauth]/authOptions.ts#L117)
|
|
normalises but the *invitation* path doesn't. Track separately — not in
|
|
this plan.
|
|
|
|
---
|
|
|
|
# PR 1 — Observability + logging fixes
|
|
|
|
**Goal:** when the next Atkins-shaped incident happens, we know within minutes
|
|
whether SES tried, whether SES accepted, and whether the recipient SMTP
|
|
accepted. Today we know none of these.
|
|
|
|
**Risk:** near-zero. Additive only.
|
|
|
|
**Effort:** ~half a day.
|
|
|
|
### Changes
|
|
|
|
1. **Add `X-SES-CONFIGURATION-SET` header** to outbound mail.
|
|
|
|
In [magic_link.ts:53-55](../../src/app/email_templates/magic_link.ts#L53-L55):
|
|
|
|
```ts
|
|
headers: {
|
|
...(process.env.SES_CONFIGURATION_SET && {
|
|
"X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET,
|
|
}),
|
|
"List-Unsubscribe": `<mailto:${provider.from}>`,
|
|
},
|
|
```
|
|
|
|
Conditional so CI (Mailpit) doesn't need the env var.
|
|
|
|
2. **Capture and return Nodemailer messageId** from `MagicLinksEmail`.
|
|
|
|
In [magic_link.ts:39-62](../../src/app/email_templates/magic_link.ts#L39-L62),
|
|
change the return type to `Promise<{ messageId: string }>` and return
|
|
`{ messageId: result.messageId }` after the `result.rejected` check.
|
|
|
|
3. **Fix the misleading log** in
|
|
[authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94):
|
|
|
|
```ts
|
|
sendVerificationRequest: async ({ identifier, url, provider }) => {
|
|
try {
|
|
const { messageId } = await MagicLinksEmail({ identifier, url, provider });
|
|
console.log("EMAIL_MAGIC_LINK_SUCCESS", {
|
|
email: identifier,
|
|
messageId,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
} catch (err) {
|
|
console.error("EMAIL_MAGIC_LINK_FAILURE", {
|
|
email: identifier,
|
|
error: err instanceof Error ? err.message : String(err),
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
throw err;
|
|
}
|
|
},
|
|
```
|
|
|
|
Now log presence proves "we tried", `EMAIL_MAGIC_LINK_SUCCESS` proves "SES
|
|
accepted at SMTP", and `messageId` correlates to SES events.
|
|
|
|
4. **Add `SES_CONFIGURATION_SET` env var** to the deployment platform
|
|
(Vercel / wherever). Value: `dev-ses-config`.
|
|
|
|
5. **Terraform — add `renderingFailure` event type and output the cfg-set
|
|
name.** In the SES module:
|
|
|
|
```hcl
|
|
resource "aws_ses_event_destination" "sns" {
|
|
# ...
|
|
matching_types = [
|
|
"send",
|
|
"bounce",
|
|
"reject",
|
|
"complaint",
|
|
"delivery",
|
|
"renderingFailure", # ← add
|
|
]
|
|
# ...
|
|
}
|
|
|
|
output "configuration_set_name" {
|
|
value = aws_ses_configuration_set.this.name
|
|
}
|
|
```
|
|
|
|
### Acceptance criteria
|
|
|
|
- Sending a magic link produces exactly one of: `EMAIL_MAGIC_LINK_SUCCESS`
|
|
(with messageId) or `EMAIL_MAGIC_LINK_FAILURE` (with error). No bare
|
|
`EMAIL MAGIC LINK SENT` lines.
|
|
- The `messageId` in app logs matches the `mail.messageId` SES uses (verify
|
|
once via SES console for one test send).
|
|
- Terraform plan is clean after the two SES module changes.
|
|
|
|
---
|
|
|
|
# IT track (parallel, out-of-band)
|
|
|
|
These happen in parallel to PR 1; no engineer dependency.
|
|
|
|
1. **Create `noreply@domna.homes` as a real shared mailbox** in M365.
|
|
Configure an auto-reply pointing to a monitored support address. Set
|
|
`Reply-To:` on outgoing mail in PR 2 once the mailbox exists, OR change
|
|
`EMAIL_FROM` to something like `accounts@domna.homes` and make *that*
|
|
monitored. Update `EMAIL_FROM` env var when ready.
|
|
|
|
2. **Update the DMARC TXT record** at `_dmarc.domna.homes` from:
|
|
|
|
```
|
|
v=DMARC1; p=none;
|
|
```
|
|
|
|
to:
|
|
|
|
```
|
|
v=DMARC1; p=none; rua=mailto:<reporting-address>; fo=1; adkim=r; aspf=r; pct=100;
|
|
```
|
|
|
|
Recommend Postmark DMARC Monitor (free) for the reporting address — they
|
|
provide a unique mailbox like `<uuid>@inbound-smtp.dmarc.postmarkapp.com`
|
|
and parse the XML into readable reports.
|
|
|
|
3. **Progression** (after ~4 weeks of clean reports): bump to
|
|
`p=quarantine; pct=10;`, then `p=quarantine; pct=100;`, then
|
|
`p=reject; pct=100;`. Don't skip the ramp — HubSpot / Outlook sends need
|
|
to be confirmed aligned first.
|
|
|
|
### Nice-to-have (do when convenient, not blocking)
|
|
|
|
- **Custom MAIL FROM domain** (`mail.domna.homes`) — adds SPF alignment via
|
|
*our* domain instead of `*.amazonses.com`. In Terraform:
|
|
```hcl
|
|
resource "aws_ses_domain_mail_from" "this" {
|
|
domain = aws_ses_domain_identity.this.domain
|
|
mail_from_domain = "mail.${var.domain_name}"
|
|
}
|
|
```
|
|
Plus the corresponding `MX 10 feedback-smtp.eu-west-2.amazonses.com` and
|
|
`TXT v=spf1 include:amazonses.com -all` at `mail.domna.homes`. Modest but
|
|
real deliverability win.
|
|
|
|
---
|
|
|
|
# PR 2 — Code fallback flow (behind feature flag)
|
|
|
|
**Goal:** every magic-link email now also contains a 6-digit code. The
|
|
post-submit UX leads with code entry; the link still works as a fast path.
|
|
|
|
**Risk:** medium — changes the user-facing auth flow. Feature flag mitigates.
|
|
|
|
**Effort:** ~2-3 days.
|
|
|
|
### Schema migration
|
|
|
|
Add to the existing `verificationToken` table in
|
|
[src/app/db/schema/users.ts](../../src/app/db/schema/users.ts):
|
|
|
|
```ts
|
|
export const verificationTokens = pgTable("verificationToken", {
|
|
identifier: text("identifier").notNull(),
|
|
token: text("token").notNull(),
|
|
expires: timestamp("expires", { mode: "date" }).notNull(),
|
|
codeHash: text("code_hash"), // ← new (nullable)
|
|
attempts: integer("attempts").notNull().default(0), // ← new
|
|
});
|
|
```
|
|
|
|
Also new table for rate-limit state:
|
|
|
|
```ts
|
|
export const authRateLimits = pgTable("authRateLimits", {
|
|
scope: text("scope").notNull(), // "email-send" | "ip-verify"
|
|
key: text("key").notNull(), // email or IP
|
|
count: integer("count").notNull().default(0),
|
|
windowStart: timestamp("window_start").notNull(),
|
|
}, (t) => ({
|
|
pk: primaryKey({ columns: [t.scope, t.key] }),
|
|
}));
|
|
```
|
|
|
|
Drizzle migration is additive — existing rows get `code_hash = NULL`,
|
|
`attempts = 0`. Pre-deploy magic-link emails (those without codes) continue
|
|
to work via the link path; users on the code-entry page will fail and need
|
|
to resend. Acceptable transient.
|
|
|
|
### Code utility — `src/lib/auth/code.ts`
|
|
|
|
```ts
|
|
import crypto from "crypto";
|
|
|
|
export function generateCode(): string {
|
|
return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0");
|
|
}
|
|
|
|
export function hashCode(code: string, secret: string): string {
|
|
return crypto.createHash("sha256").update(code + secret).digest("hex");
|
|
}
|
|
```
|
|
|
|
`crypto.randomInt` not `Math.random` — must be CSPRNG.
|
|
|
|
### `sendVerificationRequest` update
|
|
|
|
In [authOptions.ts:86-94](../../src/app/api/auth/[...nextauth]/authOptions.ts#L86-L94):
|
|
|
|
1. Generate code via `generateCode()`
|
|
2. Hash with `NEXTAUTH_SECRET`
|
|
3. `UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?`
|
|
4. Pass `code` (plaintext) into `MagicLinksEmail`
|
|
5. Existing logging unchanged
|
|
|
|
### Email template update
|
|
|
|
[magic_link.ts](../../src/app/email_templates/magic_link.ts) takes a new
|
|
`code` argument. The HTML and plain-text bodies lead with:
|
|
|
|
> **Your sign-in code: 482 911**
|
|
>
|
|
> Enter this at `ara.domna.homes/auth/verify-code`
|
|
>
|
|
> Or click here to skip the code step: [Sign in to Ara] →
|
|
|
|
Code rendered large and monospace; the link is a smaller secondary button.
|
|
|
|
### NextAuth providers — add `CredentialsProvider`
|
|
|
|
In [authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts):
|
|
|
|
```ts
|
|
import CredentialsProvider from "next-auth/providers/credentials";
|
|
|
|
CredentialsProvider({
|
|
id: "email-code",
|
|
name: "Email Code",
|
|
credentials: {
|
|
email: { type: "text" },
|
|
code: { type: "text" },
|
|
},
|
|
async authorize(credentials) {
|
|
if (!credentials?.email || !credentials?.code) return null;
|
|
const email = credentials.email.toLowerCase();
|
|
const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!);
|
|
|
|
// Look up the row by (identifier, code_hash, not-expired)
|
|
// If found: check attempts < 5, delete row, return user
|
|
// If found but attempts >= 5: delete row, return null
|
|
// If not found: increment attempts on the row matching identifier (if any), return null
|
|
// Rate-limit check before any of this; on exceed, return null silently
|
|
},
|
|
}),
|
|
```
|
|
|
|
Verify the `signIn` callback at
|
|
[authOptions.ts:114-206](../../src/app/api/auth/[...nextauth]/authOptions.ts#L114-L206)
|
|
handles `account.type === "credentials"` cleanly (it should fall through the
|
|
OAuth branches at lines 142 and 180, but worth a unit test).
|
|
|
|
### Drop magic-link maxAge
|
|
|
|
Change [authOptions.ts:84](../../src/app/api/auth/[...nextauth]/authOptions.ts#L84)
|
|
from `60 * 60` to `60 * 10`. Code and link now share the 10-min window.
|
|
|
|
### New page — `src/app/auth/verify-code/page.tsx`
|
|
|
|
Code-input form. POSTs `{email, code}` to NextAuth credentials endpoint via
|
|
`signIn("email-code", { email, code, redirect: false })`. Includes a
|
|
"Resend code" button (rate-limited via the new `auth_rate_limits` table).
|
|
|
|
### Redirect change
|
|
|
|
[authOptions.ts:103](../../src/app/api/auth/[...nextauth]/authOptions.ts#L103):
|
|
change `verifyRequest: "/auth/verify-request"` to
|
|
`verifyRequest: "/auth/verify-code?email=<encoded>"` so post-submit lands on
|
|
the code-entry page with the email pre-filled.
|
|
|
|
### Existing `/verify/[token]` page
|
|
|
|
Unchanged behaviour — the link path still works. Optionally add a small
|
|
"Type the code instead" link to the code-entry page for the rare user who
|
|
prefers it.
|
|
|
|
### Feature flag
|
|
|
|
Env var `AUTH_CODE_FALLBACK_ENABLED=true|false`. When `false`, skip the code
|
|
generation in `sendVerificationRequest` and redirect to the old
|
|
`/auth/verify-request` page. Lets us canary and roll back via env-var flip.
|
|
|
|
### Acceptance criteria
|
|
|
|
- Submitting email at `/` lands on `/auth/verify-code` (when flag enabled)
|
|
- The email body contains a clear 6-digit code AND a working link
|
|
- Typing the code on the verify-code page signs the user in
|
|
- Clicking the link still signs the user in (single artifact — using the
|
|
link invalidates the code, and vice versa)
|
|
- 5 wrong code attempts deletes the row; a 6th attempt with the correct code
|
|
fails
|
|
- 6th code request within an hour to the same email is silently no-op'd
|
|
- After successful sign-in, behaviour identical to today (lands on `/home`,
|
|
user record + `lastLogin` updated, etc.)
|
|
|
|
---
|
|
|
|
# Testing strategy
|
|
|
|
### CI (every PR)
|
|
|
|
- **Vitest** for pure-function unit tests (existing pattern — see
|
|
[src/app/email_templates/buildMailHeaders.test.ts](../../src/app/email_templates/buildMailHeaders.test.ts)
|
|
added in PR 1).
|
|
- **Cypress** for E2E (existing harness at `cypress/e2e/`) covering:
|
|
- Code path: submit → poll Mailpit → extract code → enter → land on `/home`
|
|
- Link path: submit → poll Mailpit → extract URL → visit → confirm → land on `/home`
|
|
- Wrong code: 5 wrong attempts invalidate; 6th with correct code fails
|
|
- Expired code: code submitted >10 min after generation fails
|
|
- **Mailpit** as a docker-compose service. SMTP at `:1025`, JSON API at
|
|
`:8025`. Cypress task helpers (`cy.task`) call the Mailpit JSON API to
|
|
extract codes/links from captured emails.
|
|
- **Unit tests** for `generateCode()` distribution, `hashCode()`
|
|
determinism, rate-limit math (Vitest).
|
|
|
|
### Pre-release (manual gate on template changes)
|
|
|
|
- Run the template through [mail-tester.com](https://mail-tester.com) before
|
|
shipping. Target >9/10. Free, 60 seconds.
|
|
|
|
### Production
|
|
|
|
- Eyeball the SES Account dashboard weekly: bounce rate, complaint rate,
|
|
reputation status.
|
|
- One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team.
|
|
- When investigating an incident, temporarily subscribe an SQS queue or
|
|
email endpoint to `dev-ses-events` SNS topic to capture events for the
|
|
duration. Unsubscribe after.
|
|
|
|
When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the
|
|
SES event-consumer Lambda.
|
|
|
|
---
|
|
|
|
# Files touched (summary)
|
|
|
|
| Phase | File | Change |
|
|
|---|---|---|
|
|
| PR 1 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Add `X-SES-CONFIGURATION-SET` header; return `messageId` |
|
|
| PR 1 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Replace `sendVerificationRequest` logging with try/catch + messageId |
|
|
| PR 1 | Terraform `modules/ses/` | Add `renderingFailure` event type + cfg-set name output |
|
|
| PR 1 | Deployment env vars | Add `SES_CONFIGURATION_SET=dev-ses-config` |
|
|
| IT | DNS — `_dmarc.domna.homes` | Update DMARC TXT value |
|
|
| IT | M365 admin | Create `noreply@domna.homes` shared mailbox |
|
|
| PR 2 | [src/app/db/schema/users.ts](../../src/app/db/schema/users.ts) | Add `code_hash`, `attempts` columns; new `authRateLimits` table |
|
|
| PR 2 | `src/lib/auth/code.ts` (new) | `generateCode()`, `hashCode()` |
|
|
| PR 2 | [src/app/email_templates/magic_link.ts](../../src/app/email_templates/magic_link.ts) | Accept `code` arg; render code prominently in HTML + plaintext |
|
|
| PR 2 | [src/app/api/auth/[...nextauth]/authOptions.ts](../../src/app/api/auth/[...nextauth]/authOptions.ts) | Add `CredentialsProvider`; generate+persist code in `sendVerificationRequest`; drop `maxAge` to 600; change `verifyRequest` redirect |
|
|
| PR 2 | `src/app/auth/verify-code/page.tsx` (new) | Code-input form + resend button |
|
|
| PR 2 | `src/lib/auth/rate-limit.ts` (new) | Per-email-send, per-code-attempt, per-IP-verify limiters backed by `authRateLimits` table |
|
|
| PR 2 | `.github/workflows/test.yml` | Mailpit service + Cypress E2E |
|
|
| PR 2 | `docker-compose.yml` | Mailpit local-dev service |
|
|
| PR 2 | Deployment env vars | Add `AUTH_CODE_FALLBACK_ENABLED` flag |
|
|
|
|
---
|
|
|
|
# Open questions
|
|
|
|
None blocking. Possibilities to revisit after PR 2 ships:
|
|
|
|
- Do enterprise customers (paid relationships, formal SLAs) want Microsoft
|
|
work-account OAuth as their primary sign-in?
|
|
- Does the `.homes` TLD continue to be a deliverability bottleneck after
|
|
MAIL FROM + DMARC `p=reject` are in place? If yes, evaluate moving sends
|
|
to a `.com` / `.co.uk` we own.
|
|
- Does the SES event volume justify the Lambda from the backlog ticket?
|