AWS AppSync Integration with NestJS in Four Days

Why We Made the Move

STRV Pulse is an internal feedback platform our teams use to run feedback cycles. Managers assign feedback requests, employees write reviews and the system generates Excel reports and Slack notifications. So it touches every person in the company.

The Original Architecture

The backend was entirely serverless on AWS:

API Layer: AWS AppSync (GraphQL) with 56 JavaScript resolvers using the APPSYNC_JS runtime
Database: DynamoDB single-table design with 5 Global Secondary Indexes
Business Logic: 5 Node.js Lambda functions for complex operations (Excel export, Slack messaging, data sync, coworker graph computation, user aggregation)
Auth: Amazon Cognito with Google OAuth, three privilege groups (ADMIN, EMPLOYEE, MANAGER)
Infrastructure: Terraform modules across dev/staging environments
Schema: 399-line GraphQL schema with 28 queries/mutations

What Pushed Us to Migrate?

AppSync's resolver model created friction as the product matured. Five problems kept compounding:

Limited testability: AppSync JS resolvers run in a restricted sandbox with no standard testing framework. Business logic lived in untested resolvers or had to be pushed to Lambda functions.
Fragmented business logic: Validation spread across resolvers, Lambdas and VTL templates. The create_feedback resolver alone was 240+ lines of validation logic that could not be unit tested.
Engineer experience: No local development server. Debugging required deploying to AWS. No TypeScript support in resolvers.
Limited middleware: No request pipeline, logging or error handling middleware. Each resolver reimplemented common patterns.
Operational complexity: 56 resolver files and 5 Lambda packages, each with their own package.json and Terraform wiring between them.

We needed testability, a unified codebase, standard tooling and faster iteration, all without disrupting the live system or touching the database.

How We Planned the Migration

Before writing any code, we locked in the key architectural decisions. Most were straightforward. The interesting one was choosing code-first GraphQL over schema-first.

The Coexistence Principle

The most important architectural decision was not migrating the database or auth system. NestJS connects directly to the same DynamoDB table using the same key patterns and the same Cognito user pool. This meant:

Both APIs could run simultaneously against production data
Regression tests could compare live responses from both systems
Rollback was trivial: just point traffic back to AppSync
No coordinated cutover was needed

Parallel Agent Architecture with Claude Code

The migration was designed for maximum parallelism using Claude Code's agent spawning. 3 waves of 3 agents each, all working in isolated git worktrees:

Phase 0 (Foundation): Sequential. One agent scaffolds the entire NestJS project
Wave 1 (3 agents in parallel): Users, Feedback Categories, Projects & Coworkers
Wave 2 (3 agents in parallel): Feedback Queries, Feedback Request Queries, Excel Export
Wave 3 (3 agents in parallel): Feedback Mutations, Create Feedback Request + Slack, Full Users Dashboard

Each wave's agents worked in isolated git worktrees, branching from master. The only shared conflict point was app.module.ts (adding module imports), trivially resolved during rebasing.

Why code-first GraphQL was essential for parallelism: with schema-first GraphQL, all agents would modify a single .graphql file, creating constant merge conflicts. Code-first decorators let each module define its own types independently.

The Migration: Day by Day

Day 1: Foundation + Wave 1

Phase 0 established the project skeleton: NestJS application scaffold, DynamoDB service wrapping AWS SDK v3, Cognito JWT authentication with global guards, role-based authorization, configuration validation and a health endpoint. This produced 11,369 lines, including comprehensive test helpers. PR #83 was reviewed and merged.

Wave 1 launched immediately after, with three parallel agents migrating Users (PR #84), Feedback Categories (PR #86) and Projects & Coworkers (PR #85). Each agent:

Read the original AppSync resolvers to extract exact DynamoDB key patterns
Implemented the NestJS module (resolver, service, entities, DTOs)
Wrote unit tests covering all query patterns
Created a draft PR with inline review comments explaining key decisions

A CI workflow (PR #87) was added in parallel for automated testing on every push.

Day 2: Wave 2

Three agents tackled the read-heavy middle tier: Feedback Queries (PR #92), Feedback Request Queries (PR #91) and Excel Export (PR #93). These depended on Wave 1 modules. Feedback queries needed the categories module for "recent" lookups; Excel export needed the users module.

The Excel export was the most complex single unit: querying users under a manager, fetching all feedback per user, generating XLSX with 5 different content-type layouts (engineering, design, product, leadership and non-tech IT), uploading to S3 and returning a pre-signed URL. The original Lambda was 400+ lines; the NestJS implementation matched it line-for-line while adding 47 unit tests.

Day 3: Wave 3

The final wave tackled mutations, the riskiest endpoints since they modify data:

Feedback Mutations (PR #94): Ported the 240-line create_feedback resolver's validation logic: rating range checks (0-5), required fields for non-drafts, hardcoded closed-category blocking, conditional cleanup of feedback requests.
Create Feedback Request + Slack (PR #95): 4-step pipeline — verify user exists, create record with duplicate prevention, increment category counter, publish to SNS (keeping the existing SNS > Lambda > Slack flow to minimize blast radius).
Full Users Dashboard (PR #96): A complex aggregation query for the manager view, querying feedback received, requests to give and requests to receive in parallel for each user.

Day 4: Regression Testing + CI

With all endpoints migrated, the focus shifted to verification. Three parallel agents wrote regression tests that sent identical GraphQL queries to both AppSync and NestJS, then deep-compared the responses:

PR #97: 11 tests for Users, Projects and Categories
PR #98: 15 tests for Feedback and Request Queries (both user and manager perspectives)
PR #99: 11 tests for Mutations (create > read-back > compare > cleanup)
PR #103: 3 tests for the Full Users manager dashboard query

A dedicated CI workflow (PR #100) runs regression tests on demand or when a PR carries the nestjs-migration label. The workflow authenticates test users via Cognito's USER_PASSWORD_AUTH flow, exchanging credentials for fresh ID tokens at runtime — avoiding the problem of storing expiring tokens as secrets.

Terraform changes (PRs #101, #102) added dedicated test users to Cognito and fixed the OIDC trust policy for CI.

Technical Deep Dive

DynamoDB Key Pattern Discovery

The project's DATA_GUIDE.md was outdated. It documented simple key patterns like FEEDBACK#<id>, but the actual resolvers used compound keys with embedded emails:

PK: FEEDBACK#FEEDBACK_FROM#alice@strv.com#CATEGORY#cat-123
SK: FEEDBACK_FOR#bob@strv.com
GSI1PK: FEEDBACK#FEEDBACK_FOR#bob@strv.com#CATEGORY#cat-123

Every agent was instructed to read the actual AppSync resolver files rather than trusting the documentation. This was formalized in the migration plan and proved critical: incorrect key patterns would have caused silent data mismatches.

Auth Mapping

AppSync uses directive-based auth. NestJS uses decorator-based guards:

// AppSync users(...): [FullUser]! @aws_cognito_user_pools(cognito_groups: ["managers"]) // NestJS equivalent @Roles(Role.MANAGER) @Query(() => [FullUser], { name: 'users' }) async users(...): Promise<FullUser[]> { ... }

A global APP_GUARD applies JWT authentication to all endpoints by default. The @Public() decorator opts out (health endpoint) and @Roles() adds group-based restrictions.

Regression Test Architecture

The regression tests run the NestJS app in-process via supertest while making HTTP calls to the live AppSync endpoint. A custom comparison helper handles:

Null normalization: undefined and null equivalence
Array ordering: Sort by stable keys (id, email) for order-independent comparison
Field filtering: Ignore fields where APIs intentionally differ (e.g., NestJS consistently maps name = full_name while AppSync does so only in some resolvers)
Partial error tolerance: AppSync returns null entries when DynamoDB data violates GraphQL schema constraints. The comparison filters these out since they represent data quality issues, not migration bugs.

Data Quality Issues Discovered

The regression tests exposed several data quality issues that AppSync had been silently handling:

Invalid enum values: Some users had empty strings for department, causing both APIs to fail serialization
Null non-nullable fields: Projects with null members or start_date that AppSync's schema declared as required
Inconsistent name mapping: AppSync's getUser resolver returned raw DynamoDB data without mapping name = full_name, while listUsers and me did the mapping. NestJS corrected this.

These findings are independently valuable: they inform future data cleanup work.

Results

Four days. One engineer. One AI coding partner. Here's what came out the other side:

Business Impact

Testability: From 0 tests covering business logic to 200 automated tests. Critical validation logic (feedback creation, role-based access) is now tested.
Engineer velocity: Local development server with hot reload, standard TypeScript tooling. No AWS deployment required to test changes.
Operational confidence: The regression test suite can verify any future change against the known-good AppSync baseline.
Reduced complexity: 56 AppSync resolvers + 5 Lambda packages consolidated into a single NestJS application.
Future flexibility: NestJS is framework-agnostic for the transport layer. Adding REST endpoints, WebSocket support or switching GraphQL engines requires minimal changes.

Lessons Learned

What Worked Well

Coexistence over cutover: Running both APIs against the same database eliminated migration risk. Regression tests provided automated verification instead of manual QA.
Parallel agent waves with code-first GraphQL: Three agents working simultaneously with zero schema conflicts. The only merge point (app.module.ts) was trivially resolvable.
Git worktrees for agent isolation: Each parallel agent worked in its own worktree, preventing file conflicts and enabling true concurrent development.
Reading resolvers over documentation: The DATA_GUIDE.md was wrong. Treating actual resolver code as the source of truth prevented subtle data access bugs.
Incremental PR structure: Small, focused PRs (one per migration unit) made review manageable and kept each change independently revertible.

What Required Iteration

Regression test robustness: Initial tests failed due to data quality issues in DynamoDB. The comparison helper needed filterNulls and ignoreFields options to handle real-world data gracefully.
CI auth tokens: ID tokens expire after 1 hour. The initial approach of storing tokens as secrets was replaced with storing test user credentials and exchanging them for fresh tokens at runtime.
OIDC trust policy: The GitHub Actions OIDC role trusted the old repository name. This was discovered only when the regression CI ran for the first time.
AppSync behavioral inconsistencies: Some resolvers mapped name = full_name, others did not. NestJS corrected this, which the regression tests needed to accommodate.

Recommendations for Similar Migrations

Start with the database layer: Get key patterns right by reading actual production code, not documentation.
Use code-first GraphQL if parallelizing: Schema-first creates a single-file bottleneck.
Write regression tests early: They catch subtle behavioral differences that unit tests miss.
Keep the existing notification/event infrastructure: Migrating SNS > Lambda > Slack flows alongside the API would double the risk.
Plan for data quality issues: Production data often violates the schema's stated constraints. Your tests need to handle this gracefully.

AWS AppSync to NestJS In Four Days