Mastering Prompt Optimization with Amazon Bedrock: A Step-by-Step Migration and Improvement Guide

Introduction

Amazon Bedrock's Advanced Prompt Optimization tool empowers you to refine prompts for any supported model, compare up to five models side by side, and smoothly transition between models without losing performance. This guide walks you through the entire process—from preparing your data to launching an optimization job—so you can boost accuracy, reduce costs, and ensure your prompts work flawlessly across different LLMs.

Mastering Prompt Optimization with Amazon Bedrock: A Step-by-Step Migration and Improvement Guide — Source: aws.amazon.com

What You Need

An active AWS account with permissions to use Amazon Bedrock and access to the Advanced Prompt Optimization feature.
Prompt templates in JSONL format (one JSON object per line) that include:
- version (fixed value: bedrock-2026-05-14)
- templateId (unique string)
- promptTemplate (your prompt with variable placeholders)
- evaluationSamples (at least one sample with inputVariables and referenceResponse)
- Optional but recommended: steeringCriteria, customEvaluationMetricLabel, customLLMJConfig, or evaluationMetricLambdaArn
Example user inputs for your variable values (text, PNG, JPG, or PDF allowed for multimodal tasks).
Ground truth answers for each example to serve as reference responses.
An evaluation metric or rewriting guidance—choose one of:
- An AWS Lambda function ARN
- An LLM-as-a-judge rubric (custom prompt + model ID)
- A short natural language description
Up to five inference models you want to test (select your current model as baseline if migrating).

Step-by-Step Instructions

Step 1: Navigate to the Advanced Prompt Optimization Page

Log in to the Amazon Bedrock console and choose Advanced Prompt Optimization from the left navigation panel. Click Create prompt optimization to start a new job.

Step 2: Select Inference Models

On the model selection screen, pick up to five models that you want to evaluate. If you are migrating from an existing model, include your current model as a baseline. Otherwise, select your preferred model to compare the original and optimized versions.

Step 3: Prepare and Upload Your Prompt Templates

Create a JSONL file where each line is a valid JSON object. Use the structure described in the prerequisites. For example:

{
    "version": "bedrock-2026-05-14",
    "templateId": "doc-analysis-v1",
    "promptTemplate": "Analyze this document: ",
    "steeringCriteria": ["Focus on key insights"],
    "customEvaluationMetricLabel": "accuracy",
    "evaluationSamples": [
        {
            "inputVariables": [{"user_doc": "Sales report Q3..."}],
            "referenceResponse": "Revenue increased by 15%..."
        }
    ]
}

Upload the file via the console or use an S3 path.

Step 4: Define the Evaluation Metric

Choose one of these methods to guide optimization:

Lambda function – Provide an ARN that receives model responses and returns a score.
LLM-as-a-judge – Provide a custom prompt and a model ID to act as judge.
Natural language description – Write a short instruction like “Maximize factual accuracy and conciseness.”

If using a custom metric, specify a customEvaluationMetricLabel in your JSONL.

Step 5: Launch the Optimization Job

After uploading and configuring, click Start optimization. The tool runs a metric-driven feedback loop, iterating on your prompt and evaluating responses against your chosen metric. The process may take several minutes depending on the number of samples and models.

Step 6: Review Results

Once complete, you’ll see a report comparing original vs. optimized prompts for each model. The report includes:

Evaluation scores for each prompt version
Cost estimates per inference call after optimization
Latency figures

Use these to identify the best-performing prompt for your use case. If you selected multiple models, you can compare across them to find the sweet spot of accuracy, cost, and speed.

Step 7: Deploy the Optimized Prompt

Once satisfied, copy the final prompt template and integrate it into your application. Test on a few real-world examples to confirm no regressions occur on previously well-performing tasks.

Tips for Success

Start small: Begin with 3–5 representative samples to avoid long optimization runs. Scale up once you verify the process works.
Use multimodal inputs wisely: PNG, JPG, and PDF are supported—leverage them for tasks like document analysis or image captioning.
Validate ground truth: Ensure your reference responses are accurate and consistent; garbage in gives garbage out.
Compare baseline first: When migrating, always include your current model to confirm optimized prompts don’t degrade performance.
Iterate on steering criteria: Add concise steering criteria to guide the optimizer toward desired behavior (e.g., “Be concise” or “Always cite sources”).
Monitor costs: The report shows estimated costs—choose a model that balances quality and budget.
Use LLM-as-a-judge for nuanced tasks: A well-crafted judge prompt can capture complex evaluation dimensions better than a Lambda function.

Tags: