Mastering Prompt Optimization with Amazon Bedrock: Your Comprehensive Guide

Welcome to our deep dive into Amazon Bedrock's latest feature: Advanced Prompt Optimization. This powerful tool helps you fine-tune prompts for any model on the platform, compare performance across multiple models, and smoothly migrate to new LLMs—all while keeping your existing use cases intact. Whether you're looking to boost accuracy, cut costs, or reduce latency, this guide answers your most pressing questions. Use the links below to jump to each topic.

What exactly is Amazon Bedrock Advanced Prompt Optimization?
How does the prompt optimizer work behind the scenes?
When should I use this tool—migration or performance improvement?
How many models can I test at once, and do I need a baseline?
What input format does the optimizer require?
Can I work with multimodal inputs like images or PDFs?
What evaluation metrics does the optimizer support?
How do I get started in the AWS console?

What exactly is Amazon Bedrock Advanced Prompt Optimization?

Amazon Bedrock Advanced Prompt Optimization is a cloud-based tool that automatically refines your prompt templates to get better results from any model on the Bedrock platform. Instead of manually tweaking prompts through trial and error, you supply a template, example user inputs, ground truth answers, and an evaluation metric. The optimizer then runs a metric-driven feedback loop: it generates multiple prompt variants, tests them against the evaluation metric, and iterates until it finds the best-performing version. You can see side-by-side comparisons of original vs. optimized prompts, along with cost and latency estimates. This helps you migrate from one model to another without regression, or simply squeeze more performance out of your current setup. The whole process is designed to save time and improve prompt quality systematically.

Mastering Prompt Optimization with Amazon Bedrock: Your Comprehensive Guide — Source: aws.amazon.com

How does the prompt optimizer work behind the scenes?

The optimizer uses a metric-driven feedback loop. First, you upload a prompt template in JSONL format, along with sample inputs and correct outputs (ground truth). You also specify how success should be measured—either with a short natural language description, an AWS Lambda function, or an LLM-as-a-judge rubric. The system then generates several prompt variations, runs them through the chosen evaluation metric, and compares scores. It iterates this loop, progressively refining the prompt to maximize the metric. After optimization, it outputs the original and final prompt templates, plus evaluation scores, estimated costs, and latency for each. This feedback loop continues until the metric stops improving significantly, ensuring you get the most effective prompt possible for your specific task and model.

When should I use this tool—migration or performance improvement?

You can use Advanced Prompt Optimization in two primary scenarios. Model migration: If you're switching from one LLM to another (e.g., from Anthropic Claude to AI21 Labs Jurassic), select your current model as a baseline and up to four candidate models. The optimizer will tune prompts for each new model while testing your known use cases to avoid regressions. Performance improvement: Even if you're staying with your current model, you can still run optimization to boost accuracy on underperforming tasks. Simply select your current model (and optionally others for comparison). The tool will refine your prompt to better match your evaluation criteria, often revealing prompt patterns you hadn't considered. In both cases, you get concrete before-and-after metrics to validate improvements.

How many models can I test at once, and do I need a baseline?

You can test up to five inference models simultaneously with a single optimization run. When migrating, it's recommended to include your current model as a baseline. This ensures the optimized prompts for new models perform at least as well on your existing use cases. For pure optimization without migration, you can just select your current model alone, but adding other models gives you a broader view of which one best matches your optimized prompt. There's no strict requirement to use a baseline—if you're exploring new models, you can omit it. However, including a baseline makes regression testing straightforward. The console allows you to pick any combination of models available on Amazon Bedrock, so you can compare outputs side by side after optimization.

What input format does the optimizer require?

You need to prepare your prompt templates in JSONL format, where each JSON object is on a single line. The required fields include version (always "bedrock-2026-05-14"), templateId, and promptTemplate. The promptTemplate contains your base prompt with placeholder variables. You also provide an evaluationSamples array, each sample having inputVariables (key-value pairs for the placeholders) and a referenceResponse (the correct answer for that sample). Optional fields like steeringCriteria (guidelines for the optimizer), customEvaluationMetricLabel, and either a customLLMJConfig (LLM-as-a-judge prompt and model ID) or evaluationMetricLambdaArn (Lambda function ARN) let you customize evaluation. A good practice is to include at least 10–20 diverse samples for robust optimization.

Can I work with multimodal inputs like images or PDFs?

Yes! Advanced Prompt Optimization supports multimodal inputs including PNG, JPG, and PDF files. You can incorporate these as part of your prompt templates—for example, to optimize prompts for document analysis, image captioning, or visual question answering. In your JSONL file, the inputVariables can reference file paths or base64-encoded content of these media types. The optimizer will treat them as part of the prompt and evaluate the model's responses accordingly. This is particularly useful for tasks that require understanding both text and visual elements, such as extracting data from invoices, classifying images, or generating descriptions. Just ensure the total file size per sample stays within Bedrock's input limits (check the AWS documentation for current constraints).

What evaluation metrics does the optimizer support?

You have three ways to define the evaluation metric: a natural language description, an LLM-as-a-judge rubric, or an AWS Lambda function. A natural language description is the simplest—just write a short sentence like "prefer concise, factual answers." This is parsed by the optimizer to guide prompt refinement. The LLM-as-a-judge approach lets you provide a custom prompt in JSON (customLLMJConfig) that instructs a designated LLM (e.g., Claude Haiku) to score responses. You can set a rubric with specific criteria. For maximum control, use a Lambda function (evaluationMetricLambdaArn) that implements your own scoring logic, which could include exact match, semantic similarity, or domain-specific checks. Regardless of method, you must also provide a customEvaluationMetricLabel (e.g., "accuracy") when using either the LLM-as-a-judge or Lambda. The optimizer will then generate evaluation scores before and after optimization.

How do I get started in the AWS console?

To begin, navigate to the Amazon Bedrock console and choose Create prompt optimization on the Advanced Prompt Optimization page. You'll then select up to five models you want to optimize for. Next, upload your JSONL file containing the prompt template, sample inputs, ground truth responses, and evaluation instructions. You can optionally provide steering criteria or a custom evaluation setup. After submission, the optimizer runs and typically finishes within minutes. When done, you'll see a summary page comparing original and optimized prompts across all selected models, including evaluation scores, estimated costs, and latency. You can download the optimized prompt templates to use in your applications. For detailed steps, refer to the official AWS documentation: Advanced Prompt Optimization guide.

Tags: