mirror of https://github.com/Hestia-Homes/assessment-model.git synced 2026-06-08 11:37:25 +00:00

Khalim Conn-Kowlessar 9c569f5584 Add SES observability foundation for email auth (PR 1 of code-fallback)

Wires the X-SES-CONFIGURATION-SET header on outbound auth emails so SES
bounce/delivery events flow through dev-ses-config to the dev-ses-events
SNS topic. Replaces the fire-and-forget "EMAIL MAGIC LINK SENT" log
(which fired before the SMTP transaction and swallowed downstream errors)
with structured EMAIL_MAGIC_LINK_SUCCESS/_FAILURE logs carrying the
Nodemailer messageId, so app-side sends are now correlatable with SES
events.

Motivated by the Atkins / Sustainable Building UK silent-quarantine
incidents where we couldn't tell whether SES had even tried to send.

Plan doc at docs/wip/auth-email-code-fallback-plan.md tracks the
broader email-code-fallback design that PR 2 will implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 13:15:25 +00:00

18 KiB

Raw Permalink Blame History

Email-auth code-fallback — implementation plan

Status: Ready to implement (2026-05-27) Author: Khalim (with Claude, via /grill-me) Trigger: Multiple corporate-domain users (Atkins, Sustainable Building UK, Arup pre-fix) reporting magic-link emails never arriving. One IT department confirmed direct sender-policy block.

Problem

Magic-link sign-in emails are being silently quarantined by some corporate email gateways (Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the recipient's MX, is accepted at SMTP, then disappears post-acceptance — never hits the inbox or junk folder. SES sees a clean Delivery event; the user sees nothing.

This is invisible to the sender by design. We confirmed it via the Atkins- adjacent case (Craig Williams, sustainablebuildinguk.com) where the recipient's IT department reached out, confirmed the block, and manually allow-listed noreply@domna.homes.

What's been ruled out

SPF, DKIM are correctly configured (verified via dig)
SES suppression list does not contain the failing users
Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine
Link pre-fetching is already defended by /verify/[token]/page.tsx (two-step consent before token consumption)

What's contributing but not the root cause

noreply@domna.homes is not a real mailbox → small reputation hit and broken List-Unsubscribe (magic_link.ts:54)
DMARC is at p=none with no rua= reporting — zero visibility into alignment failures
.homes is a 2014-era new gTLD with thin aggregate reputation at some receivers (structural, hard to fully mitigate without changing sender domain)
App's EMAIL_MAGIC_LINK_SENT log fires before the SMTP transaction and isn't wrapped in try/catch, so it can't distinguish "tried" from "sent" (authOptions.ts:86-94)
SES configuration set exists but the app never sets the X-SES-CONFIGURATION-SET header, so events don't flow through it (magic_link.ts:53-55)

Decisions (grilled and locked)

#	Decision	Rationale
D1	Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email	Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device
D2	6-digit numeric code	Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile
D3	10-minute expiry for both code and link (single artifact)	Code's lower entropy demands shorter window; unifies behaviour
D4	Single DB row per verification — extend `verificationToken` with `code_hash` and `attempts` columns	No new table; both paths consume the same row; deletion on first success enforces single-use
D5	Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres	Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case
D6	NextAuth v4 `CredentialsProvider` alongside existing `EmailProvider`	Cleanest extension point; works with existing JWT session strategy at authOptions.ts:269; reuses existing user / `signIn` callback
D7	Ship in two PRs: observability first, code-flow second behind feature flag	Lets us measure the improvement; isolates risk
D8	Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod	Lean — no paid SaaS at current scale
D9	Skip the SES event-consumer Lambda for now; ticket on backlog	Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today

Out of scope / deferred

Microsoft work-account OAuth (Azure AD multi-tenant) — would skip email entirely for M365 corporate users but locks them to MS SSO; revisit if enterprise customers demand it
Switching sending domain off .homes — measurable deliverability lift but requires re-warmup and brand thinking; revisit if domna.homes reputation stays a bottleneck after the DMARC + MAIL FROM work
SES event-consumer Lambda — ticketed for the backlog (see planning board: "Wire up SES event consumer for email auth forensics")
BIMI / VMC, dedicated SES IP, GlockApps placement testing — overkill at current scale
Case-sensitivity bug in portfolio invitations (separate issue) — user Craig.Williams@… invited with mixed case, then signed in lowercase, creating two user records. The lookup at authOptions.ts:117 normalises but the invitation path doesn't. Track separately — not in this plan.

PR 1 — Observability + logging fixes

Goal: when the next Atkins-shaped incident happens, we know within minutes whether SES tried, whether SES accepted, and whether the recipient SMTP accepted. Today we know none of these.

Risk: near-zero. Additive only.

Effort: ~half a day.

Changes

Add X-SES-CONFIGURATION-SET header to outbound mail.

In magic_link.ts:53-55:

headers: {
  ...(process.env.SES_CONFIGURATION_SET && {
    "X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET,
  }),
  "List-Unsubscribe": `<mailto:${provider.from}>`,
},

Conditional so CI (Mailpit) doesn't need the env var.

Capture and return Nodemailer messageId from MagicLinksEmail.

In magic_link.ts:39-62, change the return type to Promise<{ messageId: string }> and return { messageId: result.messageId } after the result.rejected check.

Fix the misleading log in authOptions.ts:86-94:

sendVerificationRequest: async ({ identifier, url, provider }) => {
  try {
    const { messageId } = await MagicLinksEmail({ identifier, url, provider });
    console.log("EMAIL_MAGIC_LINK_SUCCESS", {
      email: identifier,
      messageId,
      timestamp: new Date().toISOString(),
    });
  } catch (err) {
    console.error("EMAIL_MAGIC_LINK_FAILURE", {
      email: identifier,
      error: err instanceof Error ? err.message : String(err),
      timestamp: new Date().toISOString(),
    });
    throw err;
  }
},

Now log presence proves "we tried", EMAIL_MAGIC_LINK_SUCCESS proves "SES accepted at SMTP", and messageId correlates to SES events.

Add SES_CONFIGURATION_SET env var to the deployment platform (Vercel / wherever). Value: dev-ses-config.

Terraform — add renderingFailure event type and output the cfg-set name. In the SES module:

resource "aws_ses_event_destination" "sns" {
  # ...
  matching_types = [
    "send",
    "bounce",
    "reject",
    "complaint",
    "delivery",
    "renderingFailure",   # ← add
  ]
  # ...
}

output "configuration_set_name" {
  value = aws_ses_configuration_set.this.name
}

Acceptance criteria

Sending a magic link produces exactly one of: EMAIL_MAGIC_LINK_SUCCESS (with messageId) or EMAIL_MAGIC_LINK_FAILURE (with error). No bare EMAIL MAGIC LINK SENT lines.
The messageId in app logs matches the mail.messageId SES uses (verify once via SES console for one test send).
Terraform plan is clean after the two SES module changes.

IT track (parallel, out-of-band)

These happen in parallel to PR 1; no engineer dependency.

Create noreply@domna.homes as a real shared mailbox in M365. Configure an auto-reply pointing to a monitored support address. Set Reply-To: on outgoing mail in PR 2 once the mailbox exists, OR change EMAIL_FROM to something like accounts@domna.homes and make that monitored. Update EMAIL_FROM env var when ready.
Update the DMARC TXT record at _dmarc.domna.homes from:
```
v=DMARC1; p=none;
```
to:
```
v=DMARC1; p=none; rua=mailto:<reporting-address>; fo=1; adkim=r; aspf=r; pct=100;
```
Recommend Postmark DMARC Monitor (free) for the reporting address — they provide a unique mailbox like <uuid>@inbound-smtp.dmarc.postmarkapp.com and parse the XML into readable reports.
Progression (after ~4 weeks of clean reports): bump to p=quarantine; pct=10;, then p=quarantine; pct=100;, then p=reject; pct=100;. Don't skip the ramp — HubSpot / Outlook sends need to be confirmed aligned first.

Nice-to-have (do when convenient, not blocking)

Custom MAIL FROM domain (mail.domna.homes) — adds SPF alignment via our domain instead of *.amazonses.com. In Terraform:
```
resource "aws_ses_domain_mail_from" "this" {
  domain           = aws_ses_domain_identity.this.domain
  mail_from_domain = "mail.${var.domain_name}"
}
```
Plus the corresponding MX 10 feedback-smtp.eu-west-2.amazonses.com and TXT v=spf1 include:amazonses.com -all at mail.domna.homes. Modest but real deliverability win.

PR 2 — Code fallback flow (behind feature flag)

Goal: every magic-link email now also contains a 6-digit code. The post-submit UX leads with code entry; the link still works as a fast path.

Risk: medium — changes the user-facing auth flow. Feature flag mitigates.

Effort: ~2-3 days.

Schema migration

Add to the existing verificationToken table in src/app/db/schema/users.ts:

export const verificationTokens = pgTable("verificationToken", {
  identifier: text("identifier").notNull(),
  token: text("token").notNull(),
  expires: timestamp("expires", { mode: "date" }).notNull(),
  codeHash: text("code_hash"),                              // ← new (nullable)
  attempts: integer("attempts").notNull().default(0),       // ← new
});

Also new table for rate-limit state:

export const authRateLimits = pgTable("authRateLimits", {
  scope: text("scope").notNull(),         // "email-send" | "ip-verify"
  key: text("key").notNull(),             // email or IP
  count: integer("count").notNull().default(0),
  windowStart: timestamp("window_start").notNull(),
}, (t) => ({
  pk: primaryKey({ columns: [t.scope, t.key] }),
}));

Drizzle migration is additive — existing rows get code_hash = NULL, attempts = 0. Pre-deploy magic-link emails (those without codes) continue to work via the link path; users on the code-entry page will fail and need to resend. Acceptable transient.

Code utility — `src/lib/auth/code.ts`

import crypto from "crypto";

export function generateCode(): string {
  return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0");
}

export function hashCode(code: string, secret: string): string {
  return crypto.createHash("sha256").update(code + secret).digest("hex");
}

crypto.randomInt not Math.random — must be CSPRNG.

`sendVerificationRequest` update

In authOptions.ts:86-94:

Generate code via generateCode()
Hash with NEXTAUTH_SECRET
UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?
Pass code (plaintext) into MagicLinksEmail
Existing logging unchanged

Email template update

magic_link.ts takes a new code argument. The HTML and plain-text bodies lead with:

Your sign-in code: 482 911

Enter this at ara.domna.homes/auth/verify-code

Or click here to skip the code step: [Sign in to Ara] →

Code rendered large and monospace; the link is a smaller secondary button.

NextAuth providers — add `CredentialsProvider`

In authOptions.ts:

import CredentialsProvider from "next-auth/providers/credentials";

CredentialsProvider({
  id: "email-code",
  name: "Email Code",
  credentials: {
    email: { type: "text" },
    code: { type: "text" },
  },
  async authorize(credentials) {
    if (!credentials?.email || !credentials?.code) return null;
    const email = credentials.email.toLowerCase();
    const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!);

    // Look up the row by (identifier, code_hash, not-expired)
    // If found: check attempts < 5, delete row, return user
    // If found but attempts >= 5: delete row, return null
    // If not found: increment attempts on the row matching identifier (if any), return null
    // Rate-limit check before any of this; on exceed, return null silently
  },
}),

Verify the signIn callback at authOptions.ts:114-206 handles account.type === "credentials" cleanly (it should fall through the OAuth branches at lines 142 and 180, but worth a unit test).

Drop magic-link maxAge

Change authOptions.ts:84 from 60 * 60 to 60 * 10. Code and link now share the 10-min window.

New page — `src/app/auth/verify-code/page.tsx`

Code-input form. POSTs {email, code} to NextAuth credentials endpoint via signIn("email-code", { email, code, redirect: false }). Includes a "Resend code" button (rate-limited via the new auth_rate_limits table).

Redirect change

authOptions.ts:103: change verifyRequest: "/auth/verify-request" to verifyRequest: "/auth/verify-code?email=<encoded>" so post-submit lands on the code-entry page with the email pre-filled.

Existing `/verify/[token]` page

Unchanged behaviour — the link path still works. Optionally add a small "Type the code instead" link to the code-entry page for the rare user who prefers it.

Feature flag

Env var AUTH_CODE_FALLBACK_ENABLED=true|false. When false, skip the code generation in sendVerificationRequest and redirect to the old /auth/verify-request page. Lets us canary and roll back via env-var flip.

Acceptance criteria

Submitting email at / lands on /auth/verify-code (when flag enabled)
The email body contains a clear 6-digit code AND a working link
Typing the code on the verify-code page signs the user in
Clicking the link still signs the user in (single artifact — using the link invalidates the code, and vice versa)
5 wrong code attempts deletes the row; a 6th attempt with the correct code fails
6th code request within an hour to the same email is silently no-op'd
After successful sign-in, behaviour identical to today (lands on /home, user record + lastLogin updated, etc.)

Testing strategy

CI (every PR)

Vitest for pure-function unit tests (existing pattern — see src/app/email_templates/buildMailHeaders.test.ts added in PR 1).
Cypress for E2E (existing harness at cypress/e2e/) covering:
- Code path: submit → poll Mailpit → extract code → enter → land on /home
- Link path: submit → poll Mailpit → extract URL → visit → confirm → land on /home
- Wrong code: 5 wrong attempts invalidate; 6th with correct code fails
- Expired code: code submitted >10 min after generation fails
Mailpit as a docker-compose service. SMTP at :1025, JSON API at :8025. Cypress task helpers (cy.task) call the Mailpit JSON API to extract codes/links from captured emails.
Unit tests for generateCode() distribution, hashCode() determinism, rate-limit math (Vitest).

Pre-release (manual gate on template changes)

Run the template through mail-tester.com before shipping. Target >9/10. Free, 60 seconds.

Production

Eyeball the SES Account dashboard weekly: bounce rate, complaint rate, reputation status.
One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team.
When investigating an incident, temporarily subscribe an SQS queue or email endpoint to dev-ses-events SNS topic to capture events for the duration. Unsubscribe after.

When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the SES event-consumer Lambda.

Files touched (summary)

Phase	File	Change
PR 1	src/app/email_templates/magic_link.ts	Add `X-SES-CONFIGURATION-SET` header; return `messageId`
PR 1	src/app/api/auth/[...nextauth]/authOptions.ts	Replace `sendVerificationRequest` logging with try/catch + messageId
PR 1	Terraform `modules/ses/`	Add `renderingFailure` event type + cfg-set name output
PR 1	Deployment env vars	Add `SES_CONFIGURATION_SET=dev-ses-config`
IT	DNS — `_dmarc.domna.homes`	Update DMARC TXT value
IT	M365 admin	Create `noreply@domna.homes` shared mailbox
PR 2	src/app/db/schema/users.ts	Add `code_hash`, `attempts` columns; new `authRateLimits` table
PR 2	`src/lib/auth/code.ts` (new)	`generateCode()`, `hashCode()`
PR 2	src/app/email_templates/magic_link.ts	Accept `code` arg; render code prominently in HTML + plaintext
PR 2	src/app/api/auth/[...nextauth]/authOptions.ts	Add `CredentialsProvider`; generate+persist code in `sendVerificationRequest`; drop `maxAge` to 600; change `verifyRequest` redirect
PR 2	`src/app/auth/verify-code/page.tsx` (new)	Code-input form + resend button
PR 2	`src/lib/auth/rate-limit.ts` (new)	Per-email-send, per-code-attempt, per-IP-verify limiters backed by `authRateLimits` table
PR 2	`.github/workflows/test.yml`	Mailpit service + Cypress E2E
PR 2	`docker-compose.yml`	Mailpit local-dev service
PR 2	Deployment env vars	Add `AUTH_CODE_FALLBACK_ENABLED` flag

Open questions

None blocking. Possibilities to revisit after PR 2 ships:

Do enterprise customers (paid relationships, formal SLAs) want Microsoft work-account OAuth as their primary sign-in?
Does the .homes TLD continue to be a deliverability bottleneck after MAIL FROM + DMARC p=reject are in place? If yes, evaluate moving sends to a .com / .co.uk we own.
Does the SES event volume justify the Lambda from the backlog ticket?

18 KiB Raw Permalink Blame History

Email-auth code-fallback — implementation plan

Problem

What's been ruled out

What's contributing but not the root cause

Decisions (grilled and locked)

Out of scope / deferred

PR 1 — Observability + logging fixes

Changes

Acceptance criteria

IT track (parallel, out-of-band)

Nice-to-have (do when convenient, not blocking)

PR 2 — Code fallback flow (behind feature flag)

Schema migration

Code utility — src/lib/auth/code.ts

sendVerificationRequest update

Email template update

NextAuth providers — add CredentialsProvider

Drop magic-link maxAge

New page — src/app/auth/verify-code/page.tsx

Redirect change

Existing /verify/[token] page

Feature flag

Acceptance criteria

Testing strategy

CI (every PR)

Pre-release (manual gate on template changes)

Production

Files touched (summary)

Open questions

18 KiB

Raw Permalink Blame History

Code utility — `src/lib/auth/code.ts`

`sendVerificationRequest` update

NextAuth providers — add `CredentialsProvider`

New page — `src/app/auth/verify-code/page.tsx`

Existing `/verify/[token]` page