# AWS AppSync to NestJS In Four Days

[Marek Cermak](https://www.strv.com/blog/authors/marekcermak)  
Go Platform Engineering Manager

---

**TL;DR**  
One engineer. Claude Code. Four days. We migrated STRV Pulse's entire backend (our internal peer feedback platform): 28 GraphQL endpoints, 56 resolvers, five Lambdas from AWS AppSync to NestJS. Zero downtime, 200 automated tests.

## Why We Made the Move

STRV Pulse is an internal feedback platform our teams use to run feedback cycles. Managers assign feedback requests; employees write reviews; the system generates Excel reports and Slack notifications. So it touches every person in the company.

### The Original Architecture

The backend was entirely serverless on AWS:
- **API Layer:** AWS AppSync (GraphQL) with 56 resolvers
- **Database:** DynamoDB single-table design with five Global Secondary Indexes
- **Business Logic:** five Node.js Lambda functions for complex operations (Excel export, Slack messaging, data sync, coworker graph computation, user aggregation)
- **Auth:** Amazon Cognito with Google OAuth, three privilege groups (ADMIN, EMPLOYEE, MANAGER)
- **Infrastructure:** Terraform modules across dev/staging environments
- **Schema:** GraphQL schema with 28 queries/mutations

### What Pushed Us to Migrate?

AppSync's resolver model created friction as the product matured. Five problems kept compounding:
1. **Limited testability:** AppSync JS resolvers run in a restricted sandbox with no standard testing framework. Business logic lived in untested resolvers or had to be pushed to Lambda functions.
2. **Fragmented business logic:** Validation spread across resolvers, Lambdas and VTL templates. The create_feedback resolver alone packed a sprawling block of validation logic that could not be unit tested.
3. **Engineer experience:** No local development server. Debugging required deploying to AWS. No TypeScript support in resolvers.
4. **Limited middleware:** No request pipeline, logging or error handling middleware. Each resolver reimplemented common patterns.
5. **Operational complexity:** 56 resolver files and five Lambda packages, each with their own package.json and Terraform wiring between them.

We needed testability, a unified codebase, standard tooling and faster iteration, all without disrupting the live system or touching the database.

## How We Planned the Migration

Before writing any code, we locked in the key architectural decisions. Most were straightforward. The interesting one was choosing code-first GraphQL over schema-first.

### The Coexistence Principle

The most important architectural decision was to keep the database and auth system as they were. NestJS connects directly to the same DynamoDB table using the same key patterns and the same Cognito user pool. This meant:
- Both APIs could run simultaneously against production data
- Regression tests could compare live responses from both systems
- Rollback was trivial: just point traffic back to AppSync
- No coordinated cutover was needed

### Parallel Agent Architecture with Claude Code

The migration was designed for maximum parallelism using Claude Code's agent spawning. Three waves of three agents each, all working in isolated git worktrees:
- **Phase 0 (Foundation):** Sequential. One agent scaffolds the entire NestJS project
- **Wave 1 (3 agents in parallel):** Users, Feedback Categories, Projects & Coworkers
- **Wave 2 (3 agents in parallel):** Feedback Queries, Feedback Request Queries, Excel Export
- **Wave 3 (3 agents in parallel):** Feedback Mutations, Create Feedback Request + Slack, Full Users Dashboard

Each wave's agents worked in isolated git worktrees, branching from master. The only shared conflict point was app.module.ts (adding module imports), trivially resolved during rebasing.

Why code-first GraphQL was essential for parallelism: with schema-first GraphQL, all agents would modify a single .graphql file, creating constant merge conflicts. Decorators in code-first allow each module to define its own types independently.

## The Migration: Day by Day

### Day 1: Foundation + Wave 1

Phase 0 established the project skeleton: NestJS app scaffold, DynamoDB service wrapping AWS SDK v3, Cognito JWT auth with global guards, role-based authorization, config validation, health endpoint. It also shipped test helpers. PR #83 was reviewed and merged.

Wave 1 immediately after migrating:
- Users (PR #84)
- Feedback Categories (PR #86)
- Projects & Coworkers (PR #85)

Each agent:
- Read original AppSync resolvers to extract DynamoDB key patterns
- Implemented NestJS module (resolver, service, entities, DTOs)
- Wrote unit tests
- Created draft PR with inline review comments

A CI workflow (PR #87) was added for automated testing on every push.

### Day 2: Wave 2

Agents tackled:
- Feedback Queries (PR #92)
- Feedback Request Queries (PR #91)
- Excel Export (PR #93)

These depended on Wave 1 modules.

Excel export was complex: querying users under a manager, fetching feedback per user, generating XLSX with various content types, uploading to S3, returning pre-signed URL. The NestJS implementation duplicated original Lambda behavior, with 47 additional unit tests.

### Day 3: Wave 3

Final wave addressed mutations:
- **Feedback Mutations (PR #94):** ported full validation, including rating checks, required fields, closed-category blocking, feedback request cleanup
- **Create Feedback Request + Slack (PR #95):** multi-step pipeline: verify user, create record, increment category counter, publish to SNS (to maintain existing SNS > Lambda > Slack flow)
- **Full Users Dashboard (PR #96):** complex aggregation for manager view

### Day 4: Regression Testing + CI

With all endpoints migrated, focus on verification. Regression tests in parallel:
- PR #97: Users, Projects, Categories
- PR #98: Feedback and Request Queries
- PR #99: Mutations
- PR #103: Full Users dashboard

A dedicated CI workflow (PR #100) runs regression tests on demand or when labeled. It authenticates test users via Cognito, exchanging credentials for fresh ID tokens, avoiding token expiration issues.

Terraform changes (PRs #101, #102) added test users and fixed OIDC trust policy.

## Technical Deep Dive

### DynamoDB Key Pattern Discovery

The DATA_GUIDE.md was outdated, documenting simple key patterns like `FEEDBACK#<id>`. Actual resolvers used compound keys with embedded emails:

```
PK: FEEDBACK#FEEDBACK_FROM#alice@strv.com#CATEGORY#cat-123
SK: FEEDBACK_FOR#bob@strv.com
GSI1PK: FEEDBACK#FEEDBACK_FOR#bob@strv.com#CATEGORY#cat-123
```

Agents read actual resolver code, not documentation, critical for correct key patterns.

### Auth Mapping

AppSync uses directive-based auth:

```graphql
users(...): [FullUser]! @aws_cognito_user_pools(cognito_groups: ["managers"])
```

NestJS uses guard decorators:

```typescript
@Roles(Role.MANAGER)
@Query(() => [FullUser], { name: 'users' })
async users(...): Promise<FullUser[]> { ... }
```

Global JWT guard applies auth. @Public() decorator skips auth for health. @Roles() enforces group restrictions.

### Regression Test Architecture

NestJS runs in-process with supertest, making HTTP calls to live AppSync:

- Null normalization: undefined == null
- Array ordering: sorts by stable keys (id, email)
- Field filtering: ignores fields that differ intentionally
- Partial error handling: filters nulls from data errors

### Data Quality Issues Discovered

Regression tests revealed issues AppSync handled silently:
- Empty strings for enum fields
- Null for non-nullable fields
- Name mapping inconsistencies

These insights help future data cleanup.

## What We Shipped

Four days. One engineer. One AI coding partner. Here's what we achieved:

### What Was Migrated

*Note: The HTML structure suggests a section here, but the provided text does not specify specific items, so we'll keep it as is.*

## PR Sequence

### How This Migration Changed Our Engineering Workflow

- **Testability:** From zero tests to 200! Critical logic now tested.
- **Engineer velocity:** Local dev server, hot reload, standard tooling, no AWS deployments.
- **Operational confidence:** Regression suite verifies future changes.
- **Reduced complexity:** 56 resolvers + five Lambdas consolidated.
- **Future flexibility:** Framework-agnostic transport layer, easy to extend.

### Lessons From a four-day Serverless API Migration

#### What Worked Well
- **Coexistence over cutover:** Running both APIs minimized risk, automated regression tests.
- **Parallel agent waves with code-first GraphQL:** Zero schema conflicts, minimal merge points.
- **Git worktrees for isolation:** Enabling true concurrent development.
- **Reading resolvers over docs:** Actual code as source of truth.
- **Incremental PRs:** Small, manageable, revertible.

#### What Required Iteration
- **Regression test robustness:** Needed filters. Data quality issues caused initial failures.
- **CI auth tokens:** Exchanging credentials at runtime instead of storing tokens.
- **OIDC trust policy:** Fixed GitHub Actions role trust.
- **Resolver inconsistencies:** Corrected schema mappings to match expectations.

#### Recommendations for Similar Migrations
- **Start with DB key patterns:** Validate with actual code, not docs.
- **Use code-first GraphQL:** Easier parallelization.
- **Write regression tests early:** Catch subtle behaviors.
- **Keep existing notification flows:** Avoid doubling risk.
- **Plan for data issues:** Data may violate constraints; tests should handle that.

---

Don't miss anything