AS
JD

Cost-Free Waiting in Serverless: The AWS Step Functions Callback Pattern

· 10 phút đọc

Cost-Free Waiting in Serverless: The AWS Step Functions Callback Pattern

Introduction

Long-running serverless tasks present a fundamental challenge: how do you wait for external operations to complete without burning compute costs? Whether you’re waiting for a third-party API to process data, a human to approve a request, or an external system to send a callback, the naive approach of polling or sleeping inside a Lambda function can quickly become expensive.

This post explores the waitForTaskToken callback pattern in AWS Step Functions—a technique that allows your workflow to pause execution completely, incurring zero compute charges, until an external event signals completion.

The Cost Problem

Consider a typical scenario: you need to call a third-party API that processes data asynchronously. The API accepts your request, does its work, and eventually calls your webhook with the results. This might take anywhere from seconds to minutes.

The Expensive Approaches

Polling (Active Waiting)

// DON'T DO THIS - expensive!
while (!isComplete) {
  await sleep(5000);
  const status = await checkStatus(jobId);
  isComplete = status === 'complete';
}
// You're paying for every second of this loop

Lambda with Long Timeout

// DON'T DO THIS - also expensive!
const response = await fetch(thirdPartyApi);
// Lambda sits idle waiting for response
// Billed for the entire duration

Cost Analysis:

  • Lambda pricing: ~$0.0000166667 per GB-second
  • A 1GB Lambda waiting 5 minutes = $0.005 per execution
  • At 10,000 daily executions = $50/day = $1,500/month just for waiting

The Solution: waitForTaskToken

Step Functions offers a callback pattern where the state machine pauses completely—with zero compute charges—until an external system signals completion.

How It Works

┌─────────────────────────────────────────────────────────────────────────┐
│                        Step Functions State Machine                      │
└─────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────┐
│  1. Lambda receives input + TaskToken                                    │
│     - Stores TaskToken in database                                       │
│     - Calls 3rd party API with webhook URL                               │
│     - Returns immediately (Lambda exits)                                 │
└─────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────┐
│  2. Step Functions PAUSES (no charges)                                   │
│     - State machine is suspended                                         │
│     - No compute resources consumed                                      │
│     - Can wait up to 1 year                                              │
└─────────────────────────────────────────────────────────────────────────┘

         ┌──────────────────────────┴──────────────────────────┐
         │         3rd Party API processes request              │
         │              (takes 1 min - 30 min)                  │
         └──────────────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────┐
│  3. Webhook receives callback                                            │
│     - Looks up TaskToken from database                                   │
│     - Calls SendTaskSuccess or SendTaskFailure                           │
│     - Step Functions RESUMES execution                                   │
└─────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────┐
│  4. Workflow continues with result data                                  │
└─────────────────────────────────────────────────────────────────────────┘

Why This Pattern?

  1. Zero wait-time charges - Step Functions only charges for state transitions, not for waiting
  2. Scalable - Can handle thousands of concurrent waiting workflows
  3. Reliable - AWS manages the state; your workflow survives Lambda restarts
  4. Flexible timeout - Can wait up to 1 year (vs Lambda’s 15-minute max)

Implementation Deep Dive

Let’s walk through a complete implementation using SST v3, demonstrating each component of the pattern.

1. Step Functions State Machine Definition (SST v3)

The key is setting integration: "token" on your Lambda task. This tells Step Functions to pass a TaskToken and wait for a callback instead of waiting for the Lambda to return.

// infra/stepFunctions.ts
import { dbUrl, linkedInScraperApiKey, linkedInScraperApiUrl } from "./config";
import { getDomain } from "./router";

// Lambda handler for the async task
export const scrapePersonLinkedIn = new sst.aws.Function(
  "ScrapePersonLinkedIn",
  {
    handler: "packages/functions/src/question-generation/scrape-person-li.handler",
    timeout: "2 minutes",
    link: [dbUrl, linkedInScraperApiKey, linkedInScraperApiUrl],
    environment: {
      WEBHOOK_BASE_URL: getDomain({ protocol: "https", skipLocalhost: false }),
    },
  },
);

// Define the task with waitForTaskToken pattern
// The key is `integration: "token"` - this enables the callback pattern
const scrapePersonTask = sst.aws.StepFunctions.lambdaInvoke({
  name: "ScrapePersonLinkedIn",
  function: scrapePersonLinkedIn,
  payload: {
    // Pass the original input
    Payload: "{% $states.input %}",
    // CRITICAL: Pass the TaskToken so Lambda can store it
    TaskToken: "{% $states.context.Task.Token %}",
  },
  // This is what enables waitForTaskToken!
  integration: "token",
  // Timeout for the callback (30 minutes)
  heartbeat: 1800,
});

// Handle failures gracefully
const scrapePersonFailed = sst.aws.StepFunctions.pass({
  name: "ScrapePersonLinkedInFailed",
  output: { error: "Person LinkedIn scraping failed" },
});

// Build the workflow with error handling
const workflow = scrapePersonTask.catch(scrapePersonFailed);

// Create the state machine
export const questionGenerationStateMachine = new sst.aws.StepFunctions(
  "QuestionGenerationStateMachine",
  {
    definition: workflow,
  },
);

2. Lambda Handler (Async Pattern)

The Lambda receives the TaskToken, stores it in a database, calls the third-party API, and returns immediately. The Step Function stays paused until the webhook resumes it.

// packages/functions/src/question-generation/scrape-person-li.ts
import type { Handler } from "aws-lambda";
import { getDb } from "@last10/core/src/sql";
import { getSecret } from "@last10/core/src/config/secret";
import { asyncJobs } from "@last10/core/src/sql/schema";

const ASYNC_JOB_TIMEOUT_MS = 30 * 60 * 1000; // 30 minutes

interface StepFunctionsTaskTokenInput<T> {
  Payload: T;
  TaskToken: string;
}

export const handler: Handler<
  StepFunctionsTaskTokenInput<{ customerProfile: any; demoId: string }>,
  void
> = async (event) => {
  const { Payload, TaskToken } = event;
  const { customerProfile, demoId } = Payload;

  // Validate TaskToken is present
  if (!TaskToken) {
    throw new Error("TaskToken is required for async operation");
  }

  const db = getDb();
  const jobId = crypto.randomUUID();
  const expiresAt = new Date(Date.now() + ASYNC_JOB_TIMEOUT_MS);

  // Build webhook URL for callback
  const webhookUrl = `${process.env.WEBHOOK_BASE_URL}/api/webhooks/linkedin-scraping`;

  // Get API credentials
  const apiKey = getSecret("LINKEDIN_SCRAPER_API_KEY");
  const apiUrl = getSecret("LINKEDIN_SCRAPER_API_URL");

  // Call 3rd party API with webhook URL
  const response = await fetch(apiUrl, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-API-Key": apiKey,
    },
    body: JSON.stringify({
      company_li_url: customerProfile.customerLinkedIn,
      callback_url: webhookUrl,
    }),
  });

  if (!response.ok) {
    throw new Error(`API returned ${response.status}`);
  }

  const apiResponse = await response.json();
  const externalJobId = apiResponse.jobId;

  // CRITICAL: Store TaskToken in database for webhook to retrieve
  await db.insert(asyncJobs).values({
    id: jobId,
    externalJobId,      // Correlates webhook callback to our job
    demoId,
    jobType: "linkedin_person_scraping",
    taskToken: TaskToken, // This is what the webhook needs!
    status: "pending",
    expiresAt,
  });

  console.log("Stored async job", { jobId, externalJobId, expiresAt });

  // Lambda returns immediately - Step Function waits for webhook
};

3. Webhook Handler (Resume Execution)

When the third-party API completes, it calls your webhook. The webhook looks up the TaskToken from the database and signals Step Functions to resume.

// packages/web/src/app/api/webhooks/linkedin-scraping/route.ts
import {
  SFNClient,
  SendTaskSuccessCommand,
  SendTaskFailureCommand,
} from "@aws-sdk/client-sfn";
import { and, eq, gt } from "drizzle-orm";
import { NextResponse } from "next/server";
import type { NextRequest } from "next/server";
import { getDb } from "@last10/core/src/sql";
import { asyncJobs } from "@last10/core/src/sql/schema";

export async function POST(request: NextRequest) {
  const payload = await request.json();
  const { jobId: externalJobId, success, data, error } = payload;

  if (!externalJobId) {
    return NextResponse.json({ error: "jobId is required" }, { status: 400 });
  }

  const db = getDb();

  // ATOMIC CLAIM: Prevents race conditions and replay attacks
  // Only succeeds if job is "pending" and not expired
  const [job] = await db
    .update(asyncJobs)
    .set({ status: "processing", updatedAt: new Date() })
    .where(
      and(
        eq(asyncJobs.externalJobId, externalJobId),
        eq(asyncJobs.status, "pending"),
        gt(asyncJobs.expiresAt, new Date()),
      ),
    )
    .returning();

  if (!job) {
    // Handle various error cases...
    const existingJob = await db.query.asyncJobs.findFirst({
      where: eq(asyncJobs.externalJobId, externalJobId),
    });

    if (!existingJob) {
      return NextResponse.json({ error: "Job not found" }, { status: 404 });
    }
    if (existingJob.expiresAt < new Date()) {
      return NextResponse.json({ error: "Job has expired" }, { status: 410 });
    }
    return NextResponse.json({ error: "Job already processed" }, { status: 409 });
  }

  const sfnClient = new SFNClient({});

  try {
    if (success && data) {
      // Resume Step Function with SUCCESS
      await sfnClient.send(
        new SendTaskSuccessCommand({
          taskToken: job.taskToken,
          output: JSON.stringify(data),  // This becomes the task output
        }),
      );

      await db
        .update(asyncJobs)
        .set({ status: "completed", result: data, updatedAt: new Date() })
        .where(eq(asyncJobs.id, job.id));
    } else {
      // Resume Step Function with FAILURE
      await sfnClient.send(
        new SendTaskFailureCommand({
          taskToken: job.taskToken,
          error: "ScrapingFailed",
          cause: error || "Unknown error",
        }),
      );

      await db
        .update(asyncJobs)
        .set({ status: "failed", error, updatedAt: new Date() })
        .where(eq(asyncJobs.id, job.id));
    }

    return NextResponse.json({ success: true });
  } catch (err) {
    console.error("Failed to resume Step Function", err);
    return NextResponse.json({ error: "Failed to resume" }, { status: 500 });
  }
}

4. Database Schema (Drizzle ORM)

Store task tokens with expiration and status tracking for reliability:

// packages/core/src/sql/schema/asyncJobs.ts
import { jsonb, text, timestamp, uuid, pgEnum } from "drizzle-orm/pg-core";

export const asyncJobStatusEnum = pgEnum("async_job_status", [
  "pending",
  "processing",
  "completed",
  "failed",
  "expired"
]);

export const asyncJobs = pgTable("async_jobs", {
  id: uuid("id").primaryKey().defaultRandom(),
  createdAt: timestamp("created_at").notNull().defaultNow(),
  updatedAt: timestamp("updated_at").notNull().defaultNow(),

  // Reference to parent entity
  demoId: uuid("demo_id").notNull().references(() => demos.id),

  // Job classification
  jobType: text("job_type").notNull(),

  // CRITICAL: Step Functions task token for callback
  taskToken: text("task_token").notNull(),

  // External job ID from 3rd party API (for correlation)
  externalJobId: text("external_job_id"),

  // Job status with enum
  status: asyncJobStatusEnum("status").default("pending").notNull(),

  // Expiration timestamp
  expiresAt: timestamp("expires_at").notNull(),

  // Result data from async operation
  result: jsonb("result"),

  // Error message if failed
  error: text("error"),
});

5. IAM Permissions

Grant your webhook handler permission to call Step Functions APIs:

// infra/nextPage.ts
import { questionGenerationStateMachine } from "./stepFunctions";

// IAM permissions for webhook to resume Step Functions
const stepFunctionsPermissions = {
  actions: ["states:SendTaskSuccess", "states:SendTaskFailure"],
  resources: [questionGenerationStateMachine.nodes.stateMachine.arn],
};

export const nextJsPage = new sst.aws.Nextjs("MyApp", {
  // ... other config
  link: [questionGenerationStateMachine],
  permissions: [stepFunctionsPermissions],
});

Security Considerations

1. Atomic Updates Prevent Race Conditions

The webhook handler uses an atomic UPDATE with WHERE clause to claim jobs:

const [job] = await db
  .update(asyncJobs)
  .set({ status: "processing" })
  .where(
    and(
      eq(asyncJobs.externalJobId, externalJobId),
      eq(asyncJobs.status, "pending"),  // Only claim if still pending
      gt(asyncJobs.expiresAt, new Date()),  // Only claim if not expired
    ),
  )
  .returning();

This ensures:

  • Only one webhook call can claim a job (prevents double-processing)
  • Expired jobs cannot be claimed
  • Race conditions are handled at the database level

2. Expiration Prevents Stale Tokens

Task tokens have a limited lifetime. The 30-minute expiration ensures:

  • Old tokens can’t be used to resume workflows
  • Failed/abandoned jobs don’t accumulate

3. Replay Prevention

The status transition pending → processing is atomic. Replayed webhooks will fail with “Job already processed” (409 Conflict).

4. External Job IDs as Correlation Keys

Using the third-party API’s job ID for correlation means:

  • Attackers can’t guess valid job IDs
  • Each callback maps to exactly one job

Cost Comparison

Scenario: 10,000 daily API calls, average 5-minute wait

Polling Approach:

  • Lambda running for 5 minutes × 10,000 = 50,000 minutes/day
  • At 1GB memory: 50,000 × 60 = 3,000,000 GB-seconds
  • Cost: 3,000,000 × $0.0000166667 = $50/day = $1,500/month

Callback Pattern:

  • Lambda runs for ~2 seconds (initiate + webhook) × 10,000 = 20,000 seconds
  • At 1GB memory: 20,000 GB-seconds
  • Lambda cost: 20,000 × $0.0000166667 = $0.33/day
  • Step Functions: 10,000 × 2 transitions × $0.000025 = $0.50/day
  • Total: $0.83/day = $25/month

Savings: 98% ($1,500 → $25/month)

Common Pitfalls

1. Forgetting to Pass TaskToken

// WRONG - no TaskToken in payload
payload: { Payload: "{% $states.input %}" }

// CORRECT - include TaskToken
payload: {
  Payload: "{% $states.input %}",
  TaskToken: "{% $states.context.Task.Token %}"
}

2. Lambda Returning a Value

With integration: "token", the Lambda’s return value is ignored. The output comes from SendTaskSuccess. Don’t waste time formatting a return value.

3. Missing IAM Permissions

The webhook handler needs states:SendTaskSuccess and states:SendTaskFailure permissions on the specific state machine ARN.

4. Not Handling Expiration

Task tokens expire. Always set and check expiration timestamps. Handle the TaskTimedOut error in your state machine.

Conclusion

The waitForTaskToken pattern transforms how you handle async operations in serverless architectures. By storing task tokens in a database and resuming execution via webhooks, you eliminate wait-time compute charges while building more resilient, scalable systems.

Key takeaways:

  1. Use integration: "token" to enable callback pattern
  2. Store TaskToken in a database with expiration
  3. Resume with SendTaskSuccess or SendTaskFailure
  4. Use atomic updates to prevent race conditions
  5. Expect 95%+ cost reduction for long-running async operations

This pattern is particularly valuable when integrating with:

  • Third-party APIs with webhook callbacks
  • Human approval workflows
  • Long-running batch processes
  • External systems with unpredictable response times

The code examples in this post are production-ready and battle-tested. Adapt them to your use case and start saving on serverless costs today.

Share: