# AWS AppSync to NestJS In Four Days

[Marek Cermak](https://www.strv.com/blog/authors/marekcermak)  
Go Platform Engineering Manager

---

**TL;DR**  
One engineer. Claude Code. Four days. We migrated STRV Pulse's entire backend (our internal peer feedback platform): 28 GraphQL endpoints, 56 resolvers, five Lambdas from AWS AppSync to NestJS. Zero downtime, 200 automated tests.

## Why We Made the Move

STRV Pulse is an internal feedback platform our teams use to run feedback cycles. Managers assign feedback requests; employees write reviews; the system generates Excel reports and Slack notifications. So it touches every person in the company.

### The Original Architecture

The backend was entirely serverless on AWS:  
- **API Layer:** AWS AppSync (GraphQL) with 56 resolvers  
- **Database:** DynamoDB single-table design with five Global Secondary Indexes  
- **Business Logic:** five Node.js Lambda functions for complex operations (Excel export, Slack messaging, data sync, coworker graph computation, user aggregation)  
- **Auth:** Amazon Cognito with Google OAuth, three privilege groups (ADMIN, EMPLOYEE, MANAGER)  
- **Infrastructure:** Terraform modules across dev/staging environments  
- **Schema:** GraphQL schema with 28 queries/mutations

### What Pushed Us to Migrate?

AppSync's resolver model created friction as the product matured. Five problems kept compounding:  
1. **Limited testability:** AppSync JS resolvers run in a restricted sandbox with no standard testing framework. Business logic lived in untested resolvers or had to be pushed to Lambda functions.  
2. **Fragmented business logic:** Validation spread across resolvers, Lambdas and VTL templates. The `create_feedback` resolver alone packed a sprawling block of validation logic that could not be unit tested.  
3. **Engineer experience:** No local development server. Debugging required deploying to AWS. No TypeScript support in resolvers.  
4. **Limited middleware:** No request pipeline, logging or error handling middleware. Each resolver reimplemented common patterns.  
5. **Operational complexity:** 56 resolver files and five Lambda packages, each with their own package.json and Terraform wiring between them.

We needed testability, a unified codebase, standard tooling and faster iteration, all without disrupting the live system or touching the database.

## How We Planned the Migration

Before writing any code, we locked in the key architectural decisions. Most were straightforward. The interesting one was choosing code-first GraphQL over schema-first.

### The Coexistence Principle

The most important architectural decision was to keep the database and auth system as they were. NestJS connects directly to the same DynamoDB table using the same key patterns and the same Cognito user pool. This meant:  
- Both APIs could run simultaneously against production data  
- Regression tests could compare live responses from both systems  
- Rollback was trivial: just point traffic back to AppSync  
- No coordinated cutover was needed

### Parallel Agent Architecture with Claude Code

The migration was designed for maximum parallelism using Claude Code's agent spawning. Three waves of three agents each, all working in isolated git worktrees:  
- **Phase 0 (Foundation):** Sequential. One agent scaffolds the entire NestJS project  
- **Wave 1 (3 agents in parallel):** Users, Feedback Categories, Projects & Coworkers  
- **Wave 2 (3 agents in parallel):** Feedback Queries, Feedback Request Queries, Excel Export  
- **Wave 3 (3 agents in parallel):** Feedback Mutations, Create Feedback Request + Slack, Full Users Dashboard

Each wave's agents worked in isolated git worktrees, branching from master. The only shared conflict point was `app.module.ts` (adding module imports), trivially resolved during rebasing.  

Why code-first GraphQL was essential for parallelism: with schema-first GraphQL, all agents would modify a single `.graphql` file, creating constant merge conflicts. Code-first decorators let each module define its own types independently.

## The Migration: Day by Day

### Day 1: Foundation + Wave 1

Phase 0 established the project skeleton: NestJS application scaffold, DynamoDB service wrapping AWS SDK v3, Cognito JWT authentication with global guards, role-based authorization, configuration validation and a health endpoint. It also shipped a full set of test helpers. PR #83 was reviewed and merged.

Wave 1 launched immediately after, with three parallel agents migrating:  
- Users (PR #84)  
- Feedback Categories (PR #86)  
- Projects & Coworkers (PR #85)  

Each agent:  
1. Read the original AppSync resolvers to extract exact DynamoDB key patterns  
2. Implemented the NestJS module (resolver, service, entities, DTOs)  
3. Wrote unit tests covering all query patterns  
4. Created a draft PR with inline review comments explaining key decisions

A CI workflow (PR #87) was added in parallel for automated testing on every push.

### Day 2: Wave 2

Three agents tackled the read-heavy middle tier: Feedback Queries (PR #92), Feedback Request Queries (PR #91) and Excel Export (PR #93). These depended on Wave 1 modules: feedback queries needed categories for "recent" lookups; Excel export needed the users module.

The Excel export was the most complex: querying users under a manager, fetching feedback per user, generating XLSX with five different content-type layouts (engineering, design, product, leadership, non-tech IT), uploading to S3 and returning a pre-signed URL. The NestJS implementation reproduced the original Lambda's behavior end-to-end and added 47 unit tests.

### Day 3: Wave 3

The final wave addressed mutations, which are the riskiest endpoints because they modify data:  
- **Feedback Mutations (PR #94):** Ported the create_feedback resolver's validation (rating range, required fields, category blocking, cleanup of feedback requests)  
- **Create Feedback Request + Slack (PR #95):** Four-step pipeline: verify user exists, create record, increment category counter, publish to SNS  
- **Full Users Dashboard (PR #96):** Complex aggregation query for the manager view, querying feedback received, requests to give and requests to receive.

### Day 4: Regression Testing + CI

With all endpoints migrated, focus shifted to verification. Three parallel agents wrote regression tests sending identical GraphQL queries to both AppSync and NestJS, then comparing responses:  
- PR #97: 11 tests for Users, Projects, Categories  
- PR #98: 15 tests for Feedback and Request Queries  
- PR #99: 11 tests for Mutations  
- PR #103: 3 tests for the Full Users manager dashboard

A dedicated CI workflow (PR #100) runs regression tests on demand or when the `nestjs-migration` label is present, authenticating test users via Cognito. Terraform changes (PRs #101, #102) added dedicated test users and fixed the OIDC trust policy.

## Technical Deep Dive

### DynamoDB Key Pattern Discovery

The project's `DATA_GUIDE.md` was outdated. It documented simple key patterns like `FEEDBACK#<id>`, but actual resolvers used compound keys with embedded emails:  

PK: `FEEDBACK#FEEDBACK_FROM#alice@strv.com#CATEGORY#cat-123`  
SK: `FEEDBACK_FOR#bob@strv.com`  
GSI1PK: `FEEDBACK#FEEDBACK_FOR#bob@strv.com#CATEGORY#cat-123`  

Every agent was instructed to read actual resolver files rather than trusting documentation; this was formalized and proved critical.

### Auth Mapping

AppSync uses directive-based auth:

`users(...): [FullUser]! @aws_cognito_user_pools(cognito_groups: ["managers"])`

NestJS uses decorator-based guards:

```typescript
// AppSync
users(...): [FullUser]! @aws_cognito_user_pools(cognito_groups: ["managers"])

// NestJS
@Roles(Role.MANAGER)
@Query(() => [FullUser], { name: 'users' })
async users(...): Promise<FullUser[]> { ... }
```

A global `APP_GUARD` applies JWT auth; `@Public()` decorator opts out (health endpoint); `@Roles()` adds group restrictions.

### Regression Test Architecture

Regression tests run the NestJS app in-process via supertest, making HTTP calls to the live AppSync endpoint. A custom comparison helper handles:  
- Null normalization (undefined vs null)  
- Array ordering (sort by stable keys)  
- Field filtering (ignore API-differing fields like `name` vs `full_name`)  
- Partial error tolerance (filter null entries from DynamoDB inconsistencies)

### Data Quality Issues Discovered

Regression tests exposed silent data issues:  
1. Invalid enum values (e.g., empty strings for `department`)  
2. Null non-nullable fields (projects with null `members` or `start_date`)  
3. Inconsistent name mapping (`getUser` resolver vs `listUsers`)  

These insights aid future data cleanup.

## What We Shipped

Four days. One engineer. One AI coding partner. Here's what came out the other side:

### What Was Migrated

- 28 GraphQL endpoints and 56 resolvers to NestJS  
- Five Lambdas replaced by nested handlers  
- DynamoDB logic ported with key pattern fidelity  
- Authentication from Cognito delegated via decorators  
- Business logic integrated into NestJS modules

## PR Sequence

(Detailed PR list omitted for brevity, but includes PR numbers and purpose.)

### How This Migration Changed Our Engineering Workflow

- **Testability:** From zero to 200 automated tests covering core logic  
- **Engineer velocity:** Local development with hot reload, TypeScript, no AWS deployment needed  
- **Operational confidence:** regression tests verify future changes  
- **Reduced complexity:** 56 resolvers + five Lambda packages consolidated into a single NestJS app  
- **Future flexibility:** NestJS enables adding REST, WebSocket, or alternative GraphQL engines easily

## Lessons From a four-day Serverless API Migration

### What Worked Well

- **Coexistence over cutover:** Running both APIs against the same DB avoided risk  
- **Parallel agent waves with code-first GraphQL:** Zero schema conflicts with concurrent development  
- **Git worktrees for agent isolation:** Prevented conflicts and enabled true parallel work  
- **Reading resolvers over documentation:** Trusted actual code to avoid mismatches  
- **Incremental PRs:** Small, focused changes that are easy to review and revert

### What Required Iteration

- **Regression test robustness:** Data issues required filtering options for real-world data  
- **CI auth tokens:** Replaced stored tokens with runtime exchange of credentials  
- **OIDC trust policy:** Corrected GitHub Actions role trust after detection  
- **Inconsistent resolver behavior:** Corrected name mapping differences

### Recommendations for Similar Migrations

- Start with the database layer: read production code, not docs  
- Use code-first GraphQL for parallelization  
- Write regression tests early  
- Keep existing notification/event infrastructure during migration  
- Plan for data quality issues and handle them gracefully

---

**Don't miss anything**