assessment-model/docs/wip/auth-email-code-fallback-plan.md
Khalim Conn-Kowlessar 9c569f5584 Add SES observability foundation for email auth (PR 1 of code-fallback)
Wires the X-SES-CONFIGURATION-SET header on outbound auth emails so SES
bounce/delivery events flow through dev-ses-config to the dev-ses-events
SNS topic. Replaces the fire-and-forget "EMAIL MAGIC LINK SENT" log
(which fired before the SMTP transaction and swallowed downstream errors)
with structured EMAIL_MAGIC_LINK_SUCCESS/_FAILURE logs carrying the
Nodemailer messageId, so app-side sends are now correlatable with SES
events.

Motivated by the Atkins / Sustainable Building UK silent-quarantine
incidents where we couldn't tell whether SES had even tried to send.

Plan doc at docs/wip/auth-email-code-fallback-plan.md tracks the
broader email-code-fallback design that PR 2 will implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:15:25 +00:00

18 KiB

Email-auth code-fallback — implementation plan

Status: Ready to implement (2026-05-27) Author: Khalim (with Claude, via /grill-me) Trigger: Multiple corporate-domain users (Atkins, Sustainable Building UK, Arup pre-fix) reporting magic-link emails never arriving. One IT department confirmed direct sender-policy block.

Problem

Magic-link sign-in emails are being silently quarantined by some corporate email gateways (Microsoft 365 Defender / Mimecast / Proofpoint family). The mail reaches the recipient's MX, is accepted at SMTP, then disappears post-acceptance — never hits the inbox or junk folder. SES sees a clean Delivery event; the user sees nothing.

This is invisible to the sender by design. We confirmed it via the Atkins- adjacent case (Craig Williams, sustainablebuildinguk.com) where the recipient's IT department reached out, confirmed the block, and manually allow-listed noreply@domna.homes.

What's been ruled out

  • SPF, DKIM are correctly configured (verified via dig)
  • SES suppression list does not contain the failing users
  • Bounce rate is 0.03%, complaint rate 0.01% — reputation is fine
  • Link pre-fetching is already defended by /verify/[token]/page.tsx (two-step consent before token consumption)

What's contributing but not the root cause

  • noreply@domna.homes is not a real mailbox → small reputation hit and broken List-Unsubscribe (magic_link.ts:54)
  • DMARC is at p=none with no rua= reporting — zero visibility into alignment failures
  • .homes is a 2014-era new gTLD with thin aggregate reputation at some receivers (structural, hard to fully mitigate without changing sender domain)
  • App's EMAIL_MAGIC_LINK_SENT log fires before the SMTP transaction and isn't wrapped in try/catch, so it can't distinguish "tried" from "sent" (authOptions.ts:86-94)
  • SES configuration set exists but the app never sets the X-SES-CONFIGURATION-SET header, so events don't flow through it (magic_link.ts:53-55)

Decisions (grilled and locked)

# Decision Rationale
D1 Code-first email auth with link as fast-path fallback (option b2) — both delivered in the same email Code is more filter-friendly than a long opaque URL; SafeLinks can't break a 6-digit code; cross-device UX is naturally same-device
D2 6-digit numeric code Industry standard (AWS/Stripe/GitHub/Anthropic Console); easy to read aloud and type on mobile
D3 10-minute expiry for both code and link (single artifact) Code's lower entropy demands shorter window; unifies behaviour
D4 Single DB row per verification — extend verificationToken with code_hash and attempts columns No new table; both paths consume the same row; deletion on first success enforces single-use
D5 Rate limits: 5 attempts/code, 5 codes/email/hour, 20 verify-attempts/IP/10min — stored in Postgres Postgres avoids Redis dependency; numbers give attacker ~1-in-40k/hour worst case
D6 NextAuth v4 CredentialsProvider alongside existing EmailProvider Cleanest extension point; works with existing JWT session strategy at authOptions.ts:269; reuses existing user / signIn callback
D7 Ship in two PRs: observability first, code-flow second behind feature flag Lets us measure the improvement; isolates risk
D8 Testing: Mailpit + Playwright in CI, mail-tester.com manual on template changes, SES dashboard for prod Lean — no paid SaaS at current scale
D9 Skip the SES event-consumer Lambda for now; ticket on backlog Volume doesn't justify it; SES dashboard + suppression list + customer-success channel are enough for forensics today

Out of scope / deferred

  • Microsoft work-account OAuth (Azure AD multi-tenant) — would skip email entirely for M365 corporate users but locks them to MS SSO; revisit if enterprise customers demand it
  • Switching sending domain off .homes — measurable deliverability lift but requires re-warmup and brand thinking; revisit if domna.homes reputation stays a bottleneck after the DMARC + MAIL FROM work
  • SES event-consumer Lambda — ticketed for the backlog (see planning board: "Wire up SES event consumer for email auth forensics")
  • BIMI / VMC, dedicated SES IP, GlockApps placement testing — overkill at current scale
  • Case-sensitivity bug in portfolio invitations (separate issue) — user Craig.Williams@… invited with mixed case, then signed in lowercase, creating two user records. The lookup at authOptions.ts:117 normalises but the invitation path doesn't. Track separately — not in this plan.

PR 1 — Observability + logging fixes

Goal: when the next Atkins-shaped incident happens, we know within minutes whether SES tried, whether SES accepted, and whether the recipient SMTP accepted. Today we know none of these.

Risk: near-zero. Additive only.

Effort: ~half a day.

Changes

  1. Add X-SES-CONFIGURATION-SET header to outbound mail.

    In magic_link.ts:53-55:

    headers: {
      ...(process.env.SES_CONFIGURATION_SET && {
        "X-SES-CONFIGURATION-SET": process.env.SES_CONFIGURATION_SET,
      }),
      "List-Unsubscribe": `<mailto:${provider.from}>`,
    },
    

    Conditional so CI (Mailpit) doesn't need the env var.

  2. Capture and return Nodemailer messageId from MagicLinksEmail.

    In magic_link.ts:39-62, change the return type to Promise<{ messageId: string }> and return { messageId: result.messageId } after the result.rejected check.

  3. Fix the misleading log in authOptions.ts:86-94:

    sendVerificationRequest: async ({ identifier, url, provider }) => {
      try {
        const { messageId } = await MagicLinksEmail({ identifier, url, provider });
        console.log("EMAIL_MAGIC_LINK_SUCCESS", {
          email: identifier,
          messageId,
          timestamp: new Date().toISOString(),
        });
      } catch (err) {
        console.error("EMAIL_MAGIC_LINK_FAILURE", {
          email: identifier,
          error: err instanceof Error ? err.message : String(err),
          timestamp: new Date().toISOString(),
        });
        throw err;
      }
    },
    

    Now log presence proves "we tried", EMAIL_MAGIC_LINK_SUCCESS proves "SES accepted at SMTP", and messageId correlates to SES events.

  4. Add SES_CONFIGURATION_SET env var to the deployment platform (Vercel / wherever). Value: dev-ses-config.

  5. Terraform — add renderingFailure event type and output the cfg-set name. In the SES module:

    resource "aws_ses_event_destination" "sns" {
      # ...
      matching_types = [
        "send",
        "bounce",
        "reject",
        "complaint",
        "delivery",
        "renderingFailure",   # ← add
      ]
      # ...
    }
    
    output "configuration_set_name" {
      value = aws_ses_configuration_set.this.name
    }
    

Acceptance criteria

  • Sending a magic link produces exactly one of: EMAIL_MAGIC_LINK_SUCCESS (with messageId) or EMAIL_MAGIC_LINK_FAILURE (with error). No bare EMAIL MAGIC LINK SENT lines.
  • The messageId in app logs matches the mail.messageId SES uses (verify once via SES console for one test send).
  • Terraform plan is clean after the two SES module changes.

IT track (parallel, out-of-band)

These happen in parallel to PR 1; no engineer dependency.

  1. Create noreply@domna.homes as a real shared mailbox in M365. Configure an auto-reply pointing to a monitored support address. Set Reply-To: on outgoing mail in PR 2 once the mailbox exists, OR change EMAIL_FROM to something like accounts@domna.homes and make that monitored. Update EMAIL_FROM env var when ready.

  2. Update the DMARC TXT record at _dmarc.domna.homes from:

    v=DMARC1; p=none;
    

    to:

    v=DMARC1; p=none; rua=mailto:<reporting-address>; fo=1; adkim=r; aspf=r; pct=100;
    

    Recommend Postmark DMARC Monitor (free) for the reporting address — they provide a unique mailbox like <uuid>@inbound-smtp.dmarc.postmarkapp.com and parse the XML into readable reports.

  3. Progression (after ~4 weeks of clean reports): bump to p=quarantine; pct=10;, then p=quarantine; pct=100;, then p=reject; pct=100;. Don't skip the ramp — HubSpot / Outlook sends need to be confirmed aligned first.

Nice-to-have (do when convenient, not blocking)

  • Custom MAIL FROM domain (mail.domna.homes) — adds SPF alignment via our domain instead of *.amazonses.com. In Terraform:
    resource "aws_ses_domain_mail_from" "this" {
      domain           = aws_ses_domain_identity.this.domain
      mail_from_domain = "mail.${var.domain_name}"
    }
    
    Plus the corresponding MX 10 feedback-smtp.eu-west-2.amazonses.com and TXT v=spf1 include:amazonses.com -all at mail.domna.homes. Modest but real deliverability win.

PR 2 — Code fallback flow (behind feature flag)

Goal: every magic-link email now also contains a 6-digit code. The post-submit UX leads with code entry; the link still works as a fast path.

Risk: medium — changes the user-facing auth flow. Feature flag mitigates.

Effort: ~2-3 days.

Schema migration

Add to the existing verificationToken table in src/app/db/schema/users.ts:

export const verificationTokens = pgTable("verificationToken", {
  identifier: text("identifier").notNull(),
  token: text("token").notNull(),
  expires: timestamp("expires", { mode: "date" }).notNull(),
  codeHash: text("code_hash"),                              // ← new (nullable)
  attempts: integer("attempts").notNull().default(0),       // ← new
});

Also new table for rate-limit state:

export const authRateLimits = pgTable("authRateLimits", {
  scope: text("scope").notNull(),         // "email-send" | "ip-verify"
  key: text("key").notNull(),             // email or IP
  count: integer("count").notNull().default(0),
  windowStart: timestamp("window_start").notNull(),
}, (t) => ({
  pk: primaryKey({ columns: [t.scope, t.key] }),
}));

Drizzle migration is additive — existing rows get code_hash = NULL, attempts = 0. Pre-deploy magic-link emails (those without codes) continue to work via the link path; users on the code-entry page will fail and need to resend. Acceptable transient.

Code utility — src/lib/auth/code.ts

import crypto from "crypto";

export function generateCode(): string {
  return crypto.randomInt(0, 1_000_000).toString().padStart(6, "0");
}

export function hashCode(code: string, secret: string): string {
  return crypto.createHash("sha256").update(code + secret).digest("hex");
}

crypto.randomInt not Math.random — must be CSPRNG.

sendVerificationRequest update

In authOptions.ts:86-94:

  1. Generate code via generateCode()
  2. Hash with NEXTAUTH_SECRET
  3. UPDATE verificationToken SET code_hash = ?, attempts = 0 WHERE identifier = ? AND token = ?
  4. Pass code (plaintext) into MagicLinksEmail
  5. Existing logging unchanged

Email template update

magic_link.ts takes a new code argument. The HTML and plain-text bodies lead with:

Your sign-in code: 482 911

Enter this at ara.domna.homes/auth/verify-code

Or click here to skip the code step: [Sign in to Ara] →

Code rendered large and monospace; the link is a smaller secondary button.

NextAuth providers — add CredentialsProvider

In authOptions.ts:

import CredentialsProvider from "next-auth/providers/credentials";

CredentialsProvider({
  id: "email-code",
  name: "Email Code",
  credentials: {
    email: { type: "text" },
    code: { type: "text" },
  },
  async authorize(credentials) {
    if (!credentials?.email || !credentials?.code) return null;
    const email = credentials.email.toLowerCase();
    const hashed = hashCode(credentials.code, process.env.NEXTAUTH_SECRET!);

    // Look up the row by (identifier, code_hash, not-expired)
    // If found: check attempts < 5, delete row, return user
    // If found but attempts >= 5: delete row, return null
    // If not found: increment attempts on the row matching identifier (if any), return null
    // Rate-limit check before any of this; on exceed, return null silently
  },
}),

Verify the signIn callback at authOptions.ts:114-206 handles account.type === "credentials" cleanly (it should fall through the OAuth branches at lines 142 and 180, but worth a unit test).

Change authOptions.ts:84 from 60 * 60 to 60 * 10. Code and link now share the 10-min window.

New page — src/app/auth/verify-code/page.tsx

Code-input form. POSTs {email, code} to NextAuth credentials endpoint via signIn("email-code", { email, code, redirect: false }). Includes a "Resend code" button (rate-limited via the new auth_rate_limits table).

Redirect change

authOptions.ts:103: change verifyRequest: "/auth/verify-request" to verifyRequest: "/auth/verify-code?email=<encoded>" so post-submit lands on the code-entry page with the email pre-filled.

Existing /verify/[token] page

Unchanged behaviour — the link path still works. Optionally add a small "Type the code instead" link to the code-entry page for the rare user who prefers it.

Feature flag

Env var AUTH_CODE_FALLBACK_ENABLED=true|false. When false, skip the code generation in sendVerificationRequest and redirect to the old /auth/verify-request page. Lets us canary and roll back via env-var flip.

Acceptance criteria

  • Submitting email at / lands on /auth/verify-code (when flag enabled)
  • The email body contains a clear 6-digit code AND a working link
  • Typing the code on the verify-code page signs the user in
  • Clicking the link still signs the user in (single artifact — using the link invalidates the code, and vice versa)
  • 5 wrong code attempts deletes the row; a 6th attempt with the correct code fails
  • 6th code request within an hour to the same email is silently no-op'd
  • After successful sign-in, behaviour identical to today (lands on /home, user record + lastLogin updated, etc.)

Testing strategy

CI (every PR)

  • Vitest for pure-function unit tests (existing pattern — see src/app/email_templates/buildMailHeaders.test.ts added in PR 1).
  • Cypress for E2E (existing harness at cypress/e2e/) covering:
    • Code path: submit → poll Mailpit → extract code → enter → land on /home
    • Link path: submit → poll Mailpit → extract URL → visit → confirm → land on /home
    • Wrong code: 5 wrong attempts invalidate; 6th with correct code fails
    • Expired code: code submitted >10 min after generation fails
  • Mailpit as a docker-compose service. SMTP at :1025, JSON API at :8025. Cypress task helpers (cy.task) call the Mailpit JSON API to extract codes/links from captured emails.
  • Unit tests for generateCode() distribution, hashCode() determinism, rate-limit math (Vitest).

Pre-release (manual gate on template changes)

  • Run the template through mail-tester.com before shipping. Target >9/10. Free, 60 seconds.

Production

  • Eyeball the SES Account dashboard weekly: bounce rate, complaint rate, reputation status.
  • One CloudWatch alarm: bounce rate >2% sustained → SNS → email the team.
  • When investigating an incident, temporarily subscribe an SQS queue or email endpoint to dev-ses-events SNS topic to capture events for the duration. Unsubscribe after.

When volume grows past ~50 sign-ups/day, pick up the backlog ticket for the SES event-consumer Lambda.


Files touched (summary)

Phase File Change
PR 1 src/app/email_templates/magic_link.ts Add X-SES-CONFIGURATION-SET header; return messageId
PR 1 src/app/api/auth/[...nextauth]/authOptions.ts Replace sendVerificationRequest logging with try/catch + messageId
PR 1 Terraform modules/ses/ Add renderingFailure event type + cfg-set name output
PR 1 Deployment env vars Add SES_CONFIGURATION_SET=dev-ses-config
IT DNS — _dmarc.domna.homes Update DMARC TXT value
IT M365 admin Create noreply@domna.homes shared mailbox
PR 2 src/app/db/schema/users.ts Add code_hash, attempts columns; new authRateLimits table
PR 2 src/lib/auth/code.ts (new) generateCode(), hashCode()
PR 2 src/app/email_templates/magic_link.ts Accept code arg; render code prominently in HTML + plaintext
PR 2 src/app/api/auth/[...nextauth]/authOptions.ts Add CredentialsProvider; generate+persist code in sendVerificationRequest; drop maxAge to 600; change verifyRequest redirect
PR 2 src/app/auth/verify-code/page.tsx (new) Code-input form + resend button
PR 2 src/lib/auth/rate-limit.ts (new) Per-email-send, per-code-attempt, per-IP-verify limiters backed by authRateLimits table
PR 2 .github/workflows/test.yml Mailpit service + Cypress E2E
PR 2 docker-compose.yml Mailpit local-dev service
PR 2 Deployment env vars Add AUTH_CODE_FALLBACK_ENABLED flag

Open questions

None blocking. Possibilities to revisit after PR 2 ships:

  • Do enterprise customers (paid relationships, formal SLAs) want Microsoft work-account OAuth as their primary sign-in?
  • Does the .homes TLD continue to be a deliverability bottleneck after MAIL FROM + DMARC p=reject are in place? If yes, evaluate moving sends to a .com / .co.uk we own.
  • Does the SES event volume justify the Lambda from the backlog ticket?