Production-ready benchmarking tool for comparing Large Language Model providers on vision tasks with advanced multi-tool testing and structured output validation.
Test OpenAI, AWS Bedrock, and Google Gemini vision models with sophisticated benchmarking capabilities designed for production environments.
Compare OpenAI GPT-4V, AWS Bedrock (Claude, Llama 4, Pixtral), and Google Gemini vision models simultaneously with unified API handling.
Define multiple analysis schemas and let AI choose the most appropriate method based on image content - test true AI decision-making capabilities.
Compare how well each provider follows JSON schemas with native structured output support (OpenAI json_schema, Claude tools, Gemini responseSchema).
Batch process entire image directories automatically. Test prompts against multiple images without configuration changes.
Async operations, comprehensive error handling, rate limiting, and optimized for serverless deployment with AWS Lambda compatibility.
Track latency, token usage, success rates, and costs across providers to optimize your AI infrastructure decisions.
Real benchmark results showing multi-provider comparison with detailed performance metrics and structured output validation.
$ python llm_test_bench.py INFO: Running test case 1/1: Smart Object Analysis INFO: Expanding test case to process 3 images INFO: Testing openai_gpt4_nano... INFO: Testing gemini_flash_lite... INFO: Testing bedrock_sonnet_4... 🎉 Test complete! 📊 Test Cases: 1 🖼️ Images Processed: 3 ✅ Successful Provider Calls: 9 ❌ Failed Provider Calls: 0 📝 Smart Object Analysis (3 images): 📸 image1: ✅ openai_gpt4_nano: 1,101ms (107 tokens) ✅ gemini_flash_lite: 987ms (134 tokens) ✅ bedrock_sonnet_4: 2,145ms (89 tokens) 📸 image2: ✅ openai_gpt4_nano: 923ms (142 tokens) ✅ gemini_flash_lite: 876ms (156 tokens) ✅ bedrock_sonnet_4: 1,987ms (98 tokens) 📸 image3: ✅ openai_gpt4_nano: 1,045ms (128 tokens) ✅ gemini_flash_lite: 934ms (167 tokens) ✅ bedrock_sonnet_4: 2,234ms (102 tokens) 📊 Results saved to results/test_results_2025-07-07_14-30-22.json 🏆 Fastest: Gemini Flash Lite (avg 932ms) 💰 Most Token Efficient: Bedrock Sonnet 4 (avg 96 tokens)
Built with modern async patterns and provider-specific optimizations for maximum performance and reliability.
Uses optimal API for each model (Converse for Llama 4 vision, InvokeModel for others)
Provider-specific implementations: OpenAI json_schema, Claude tools, Gemini responseSchema
Non-blocking API calls with comprehensive error handling and retry logic
Configurable delays and request throttling to respect API limits
Efficient JSON format grouping results by test case for easy analysis
Environment variable configuration and Lambda-compatible architecture
Simple setup process to start benchmarking AI models with advanced structured output testing.
Get the repository and install dependencies:
Set up your provider credentials in .env:
Customize your test configuration:
Execute tests and analyze results: