Cost-Free Waiting in Serverless: The AWS Step Functions Callback Pattern
Cost-Free Waiting in Serverless: The AWS Step Functions Callback Pattern
Introduction
Long-running serverless tasks present a fundamental challenge: how do you wait for external operations to complete without burning compute costs? Whether you’re waiting for a third-party API to process data, a human to approve a request, or an external system to send a callback, the naive approach of polling or sleeping inside a Lambda function can quickly become expensive.
This post explores the waitForTaskToken callback pattern in AWS Step Functions—a technique that allows your workflow to pause execution completely, incurring zero compute charges, until an external event signals completion.
The Cost Problem
Consider a typical scenario: you need to call a third-party API that processes data asynchronously. The API accepts your request, does its work, and eventually calls your webhook with the results. This might take anywhere from seconds to minutes.
The Expensive Approaches
Polling (Active Waiting)
// DON'T DO THIS - expensive!
while (!isComplete) {
await sleep(5000);
const status = await checkStatus(jobId);
isComplete = status === 'complete';
}
// You're paying for every second of this loop
Lambda with Long Timeout
// DON'T DO THIS - also expensive!
const response = await fetch(thirdPartyApi);
// Lambda sits idle waiting for response
// Billed for the entire duration
Cost Analysis:
- Lambda pricing: ~$0.0000166667 per GB-second
- A 1GB Lambda waiting 5 minutes = $0.005 per execution
- At 10,000 daily executions = $50/day = $1,500/month just for waiting
The Solution: waitForTaskToken
Step Functions offers a callback pattern where the state machine pauses completely—with zero compute charges—until an external system signals completion.
How It Works
┌─────────────────────────────────────────────────────────────────────────┐
│ Step Functions State Machine │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 1. Lambda receives input + TaskToken │
│ - Stores TaskToken in database │
│ - Calls 3rd party API with webhook URL │
│ - Returns immediately (Lambda exits) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 2. Step Functions PAUSES (no charges) │
│ - State machine is suspended │
│ - No compute resources consumed │
│ - Can wait up to 1 year │
└─────────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────┴──────────────────────────┐
│ 3rd Party API processes request │
│ (takes 1 min - 30 min) │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 3. Webhook receives callback │
│ - Looks up TaskToken from database │
│ - Calls SendTaskSuccess or SendTaskFailure │
│ - Step Functions RESUMES execution │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 4. Workflow continues with result data │
└─────────────────────────────────────────────────────────────────────────┘
Why This Pattern?
- Zero wait-time charges - Step Functions only charges for state transitions, not for waiting
- Scalable - Can handle thousands of concurrent waiting workflows
- Reliable - AWS manages the state; your workflow survives Lambda restarts
- Flexible timeout - Can wait up to 1 year (vs Lambda’s 15-minute max)
Implementation Deep Dive
Let’s walk through a complete implementation using SST v3, demonstrating each component of the pattern.
1. Step Functions State Machine Definition (SST v3)
The key is setting integration: "token" on your Lambda task. This tells Step Functions to pass a TaskToken and wait for a callback instead of waiting for the Lambda to return.
// infra/stepFunctions.ts
import { dbUrl, linkedInScraperApiKey, linkedInScraperApiUrl } from "./config";
import { getDomain } from "./router";
// Lambda handler for the async task
export const scrapePersonLinkedIn = new sst.aws.Function(
"ScrapePersonLinkedIn",
{
handler: "packages/functions/src/question-generation/scrape-person-li.handler",
timeout: "2 minutes",
link: [dbUrl, linkedInScraperApiKey, linkedInScraperApiUrl],
environment: {
WEBHOOK_BASE_URL: getDomain({ protocol: "https", skipLocalhost: false }),
},
},
);
// Define the task with waitForTaskToken pattern
// The key is `integration: "token"` - this enables the callback pattern
const scrapePersonTask = sst.aws.StepFunctions.lambdaInvoke({
name: "ScrapePersonLinkedIn",
function: scrapePersonLinkedIn,
payload: {
// Pass the original input
Payload: "{% $states.input %}",
// CRITICAL: Pass the TaskToken so Lambda can store it
TaskToken: "{% $states.context.Task.Token %}",
},
// This is what enables waitForTaskToken!
integration: "token",
// Timeout for the callback (30 minutes)
heartbeat: 1800,
});
// Handle failures gracefully
const scrapePersonFailed = sst.aws.StepFunctions.pass({
name: "ScrapePersonLinkedInFailed",
output: { error: "Person LinkedIn scraping failed" },
});
// Build the workflow with error handling
const workflow = scrapePersonTask.catch(scrapePersonFailed);
// Create the state machine
export const questionGenerationStateMachine = new sst.aws.StepFunctions(
"QuestionGenerationStateMachine",
{
definition: workflow,
},
);
2. Lambda Handler (Async Pattern)
The Lambda receives the TaskToken, stores it in a database, calls the third-party API, and returns immediately. The Step Function stays paused until the webhook resumes it.
// packages/functions/src/question-generation/scrape-person-li.ts
import type { Handler } from "aws-lambda";
import { getDb } from "@last10/core/src/sql";
import { getSecret } from "@last10/core/src/config/secret";
import { asyncJobs } from "@last10/core/src/sql/schema";
const ASYNC_JOB_TIMEOUT_MS = 30 * 60 * 1000; // 30 minutes
interface StepFunctionsTaskTokenInput<T> {
Payload: T;
TaskToken: string;
}
export const handler: Handler<
StepFunctionsTaskTokenInput<{ customerProfile: any; demoId: string }>,
void
> = async (event) => {
const { Payload, TaskToken } = event;
const { customerProfile, demoId } = Payload;
// Validate TaskToken is present
if (!TaskToken) {
throw new Error("TaskToken is required for async operation");
}
const db = getDb();
const jobId = crypto.randomUUID();
const expiresAt = new Date(Date.now() + ASYNC_JOB_TIMEOUT_MS);
// Build webhook URL for callback
const webhookUrl = `${process.env.WEBHOOK_BASE_URL}/api/webhooks/linkedin-scraping`;
// Get API credentials
const apiKey = getSecret("LINKEDIN_SCRAPER_API_KEY");
const apiUrl = getSecret("LINKEDIN_SCRAPER_API_URL");
// Call 3rd party API with webhook URL
const response = await fetch(apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": apiKey,
},
body: JSON.stringify({
company_li_url: customerProfile.customerLinkedIn,
callback_url: webhookUrl,
}),
});
if (!response.ok) {
throw new Error(`API returned ${response.status}`);
}
const apiResponse = await response.json();
const externalJobId = apiResponse.jobId;
// CRITICAL: Store TaskToken in database for webhook to retrieve
await db.insert(asyncJobs).values({
id: jobId,
externalJobId, // Correlates webhook callback to our job
demoId,
jobType: "linkedin_person_scraping",
taskToken: TaskToken, // This is what the webhook needs!
status: "pending",
expiresAt,
});
console.log("Stored async job", { jobId, externalJobId, expiresAt });
// Lambda returns immediately - Step Function waits for webhook
};
3. Webhook Handler (Resume Execution)
When the third-party API completes, it calls your webhook. The webhook looks up the TaskToken from the database and signals Step Functions to resume.
// packages/web/src/app/api/webhooks/linkedin-scraping/route.ts
import {
SFNClient,
SendTaskSuccessCommand,
SendTaskFailureCommand,
} from "@aws-sdk/client-sfn";
import { and, eq, gt } from "drizzle-orm";
import { NextResponse } from "next/server";
import type { NextRequest } from "next/server";
import { getDb } from "@last10/core/src/sql";
import { asyncJobs } from "@last10/core/src/sql/schema";
export async function POST(request: NextRequest) {
const payload = await request.json();
const { jobId: externalJobId, success, data, error } = payload;
if (!externalJobId) {
return NextResponse.json({ error: "jobId is required" }, { status: 400 });
}
const db = getDb();
// ATOMIC CLAIM: Prevents race conditions and replay attacks
// Only succeeds if job is "pending" and not expired
const [job] = await db
.update(asyncJobs)
.set({ status: "processing", updatedAt: new Date() })
.where(
and(
eq(asyncJobs.externalJobId, externalJobId),
eq(asyncJobs.status, "pending"),
gt(asyncJobs.expiresAt, new Date()),
),
)
.returning();
if (!job) {
// Handle various error cases...
const existingJob = await db.query.asyncJobs.findFirst({
where: eq(asyncJobs.externalJobId, externalJobId),
});
if (!existingJob) {
return NextResponse.json({ error: "Job not found" }, { status: 404 });
}
if (existingJob.expiresAt < new Date()) {
return NextResponse.json({ error: "Job has expired" }, { status: 410 });
}
return NextResponse.json({ error: "Job already processed" }, { status: 409 });
}
const sfnClient = new SFNClient({});
try {
if (success && data) {
// Resume Step Function with SUCCESS
await sfnClient.send(
new SendTaskSuccessCommand({
taskToken: job.taskToken,
output: JSON.stringify(data), // This becomes the task output
}),
);
await db
.update(asyncJobs)
.set({ status: "completed", result: data, updatedAt: new Date() })
.where(eq(asyncJobs.id, job.id));
} else {
// Resume Step Function with FAILURE
await sfnClient.send(
new SendTaskFailureCommand({
taskToken: job.taskToken,
error: "ScrapingFailed",
cause: error || "Unknown error",
}),
);
await db
.update(asyncJobs)
.set({ status: "failed", error, updatedAt: new Date() })
.where(eq(asyncJobs.id, job.id));
}
return NextResponse.json({ success: true });
} catch (err) {
console.error("Failed to resume Step Function", err);
return NextResponse.json({ error: "Failed to resume" }, { status: 500 });
}
}
4. Database Schema (Drizzle ORM)
Store task tokens with expiration and status tracking for reliability:
// packages/core/src/sql/schema/asyncJobs.ts
import { jsonb, text, timestamp, uuid, pgEnum } from "drizzle-orm/pg-core";
export const asyncJobStatusEnum = pgEnum("async_job_status", [
"pending",
"processing",
"completed",
"failed",
"expired"
]);
export const asyncJobs = pgTable("async_jobs", {
id: uuid("id").primaryKey().defaultRandom(),
createdAt: timestamp("created_at").notNull().defaultNow(),
updatedAt: timestamp("updated_at").notNull().defaultNow(),
// Reference to parent entity
demoId: uuid("demo_id").notNull().references(() => demos.id),
// Job classification
jobType: text("job_type").notNull(),
// CRITICAL: Step Functions task token for callback
taskToken: text("task_token").notNull(),
// External job ID from 3rd party API (for correlation)
externalJobId: text("external_job_id"),
// Job status with enum
status: asyncJobStatusEnum("status").default("pending").notNull(),
// Expiration timestamp
expiresAt: timestamp("expires_at").notNull(),
// Result data from async operation
result: jsonb("result"),
// Error message if failed
error: text("error"),
});
5. IAM Permissions
Grant your webhook handler permission to call Step Functions APIs:
// infra/nextPage.ts
import { questionGenerationStateMachine } from "./stepFunctions";
// IAM permissions for webhook to resume Step Functions
const stepFunctionsPermissions = {
actions: ["states:SendTaskSuccess", "states:SendTaskFailure"],
resources: [questionGenerationStateMachine.nodes.stateMachine.arn],
};
export const nextJsPage = new sst.aws.Nextjs("MyApp", {
// ... other config
link: [questionGenerationStateMachine],
permissions: [stepFunctionsPermissions],
});
Security Considerations
1. Atomic Updates Prevent Race Conditions
The webhook handler uses an atomic UPDATE with WHERE clause to claim jobs:
const [job] = await db
.update(asyncJobs)
.set({ status: "processing" })
.where(
and(
eq(asyncJobs.externalJobId, externalJobId),
eq(asyncJobs.status, "pending"), // Only claim if still pending
gt(asyncJobs.expiresAt, new Date()), // Only claim if not expired
),
)
.returning();
This ensures:
- Only one webhook call can claim a job (prevents double-processing)
- Expired jobs cannot be claimed
- Race conditions are handled at the database level
2. Expiration Prevents Stale Tokens
Task tokens have a limited lifetime. The 30-minute expiration ensures:
- Old tokens can’t be used to resume workflows
- Failed/abandoned jobs don’t accumulate
3. Replay Prevention
The status transition pending → processing is atomic. Replayed webhooks will fail with “Job already processed” (409 Conflict).
4. External Job IDs as Correlation Keys
Using the third-party API’s job ID for correlation means:
- Attackers can’t guess valid job IDs
- Each callback maps to exactly one job
Cost Comparison
Scenario: 10,000 daily API calls, average 5-minute wait
Polling Approach:
- Lambda running for 5 minutes × 10,000 = 50,000 minutes/day
- At 1GB memory: 50,000 × 60 = 3,000,000 GB-seconds
- Cost: 3,000,000 × $0.0000166667 = $50/day = $1,500/month
Callback Pattern:
- Lambda runs for ~2 seconds (initiate + webhook) × 10,000 = 20,000 seconds
- At 1GB memory: 20,000 GB-seconds
- Lambda cost: 20,000 × $0.0000166667 = $0.33/day
- Step Functions: 10,000 × 2 transitions × $0.000025 = $0.50/day
- Total: $0.83/day = $25/month
Savings: 98% ($1,500 → $25/month)
Common Pitfalls
1. Forgetting to Pass TaskToken
// WRONG - no TaskToken in payload
payload: { Payload: "{% $states.input %}" }
// CORRECT - include TaskToken
payload: {
Payload: "{% $states.input %}",
TaskToken: "{% $states.context.Task.Token %}"
}
2. Lambda Returning a Value
With integration: "token", the Lambda’s return value is ignored. The output comes from SendTaskSuccess. Don’t waste time formatting a return value.
3. Missing IAM Permissions
The webhook handler needs states:SendTaskSuccess and states:SendTaskFailure permissions on the specific state machine ARN.
4. Not Handling Expiration
Task tokens expire. Always set and check expiration timestamps. Handle the TaskTimedOut error in your state machine.
Conclusion
The waitForTaskToken pattern transforms how you handle async operations in serverless architectures. By storing task tokens in a database and resuming execution via webhooks, you eliminate wait-time compute charges while building more resilient, scalable systems.
Key takeaways:
- Use
integration: "token"to enable callback pattern - Store TaskToken in a database with expiration
- Resume with
SendTaskSuccessorSendTaskFailure - Use atomic updates to prevent race conditions
- Expect 95%+ cost reduction for long-running async operations
This pattern is particularly valuable when integrating with:
- Third-party APIs with webhook callbacks
- Human approval workflows
- Long-running batch processes
- External systems with unpredictable response times
The code examples in this post are production-ready and battle-tested. Adapt them to your use case and start saving on serverless costs today.