LLM Test Bench

Production-ready benchmarking tool for comparing Large Language Model providers on vision tasks with advanced multi-tool testing and structured output validation.

Advanced AI Model Comparison

Test OpenAI, AWS Bedrock, and Google Gemini vision models with sophisticated benchmarking capabilities designed for production environments.

Multi-Provider Support

Compare OpenAI GPT-4V, AWS Bedrock (Claude, Llama 4, Pixtral), and Google Gemini vision models simultaneously with unified API handling.

🎯

Multi-Tool Testing

Define multiple analysis schemas and let AI choose the most appropriate method based on image content - test true AI decision-making capabilities.

📊

Structured Output Validation

Compare how well each provider follows JSON schemas with native structured output support (OpenAI json_schema, Claude tools, Gemini responseSchema).

🖼️

Multi-Image Processing

Batch process entire image directories automatically. Test prompts against multiple images without configuration changes.

🚀

Production Ready

Async operations, comprehensive error handling, rate limiting, and optimized for serverless deployment with AWS Lambda compatibility.

📈

Performance Analytics

Track latency, token usage, success rates, and costs across providers to optimize your AI infrastructure decisions.

See It In Action

Real benchmark results showing multi-provider comparison with detailed performance metrics and structured output validation.

llm_test_bench.py
$ python llm_test_bench.py

INFO: Running test case 1/1: Smart Object Analysis
INFO: Expanding test case to process 3 images
INFO: Testing openai_gpt4_nano...
INFO: Testing gemini_flash_lite...
INFO: Testing bedrock_sonnet_4...

🎉 Test complete!
📊 Test Cases: 1
🖼️ Images Processed: 3
✅ Successful Provider Calls: 9
❌ Failed Provider Calls: 0

📝 Smart Object Analysis (3 images):
  📸 image1:
    ✅ openai_gpt4_nano: 1,101ms (107 tokens)
    ✅ gemini_flash_lite: 987ms (134 tokens)
    ✅ bedrock_sonnet_4: 2,145ms (89 tokens)
  📸 image2:
    ✅ openai_gpt4_nano: 923ms (142 tokens)
    ✅ gemini_flash_lite: 876ms (156 tokens)
    ✅ bedrock_sonnet_4: 1,987ms (98 tokens)
  📸 image3:
    ✅ openai_gpt4_nano: 1,045ms (128 tokens)
    ✅ gemini_flash_lite: 934ms (167 tokens)
    ✅ bedrock_sonnet_4: 2,234ms (102 tokens)

📊 Results saved to results/test_results_2025-07-07_14-30-22.json
🏆 Fastest: Gemini Flash Lite (avg 932ms)
💰 Most Token Efficient: Bedrock Sonnet 4 (avg 96 tokens)

Technical Architecture

Built with modern async patterns and provider-specific optimizations for maximum performance and reliability.

🔧

Smart API Selection

Uses optimal API for each model (Converse for Llama 4 vision, InvokeModel for others)

📝

Native Structured Output

Provider-specific implementations: OpenAI json_schema, Claude tools, Gemini responseSchema

Async Operations

Non-blocking API calls with comprehensive error handling and retry logic

🛡️

Rate Limiting

Configurable delays and request throttling to respect API limits

📊

Optimized Storage

Efficient JSON format grouping results by test case for easy analysis

☁️

Serverless Ready

Environment variable configuration and Lambda-compatible architecture

Get Started in Minutes

Simple setup process to start benchmarking AI models with advanced structured output testing.

1

Clone & Install

Get the repository and install dependencies:

git clone https://github.com/realadeel/llm-test-bench.git
cd llm-test-bench
pip install -r requirements.txt
2

Configure API Keys

Set up your provider credentials in .env:

OPENAI_API_KEY=your_key_here
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
GEMINI_API_KEY=your_gemini_key
3

Configure Tests

Customize your test configuration:

cp config.yaml.example config.yaml
# Edit config.yaml with your test cases
# Add images to test_images/ directory
4

Run Benchmarks

Execute tests and analyze results:

python llm_test_bench.py
# Results saved to results/ directory
# View detailed JSON output and metrics