This is a modified version of TIGER-AI-Lab/MMLU-Pro, and it lets you run MMLU-Pro benchmark via the OpenAI Chat Completion API. It's tested on Ollama and Llama.cpp, but it should also work with LMStudio, Koboldcpp, Oobabooga with openai extension, etc.
For example, in order to run benchmark against Phi3 on Ollama, use:
pip install -r requirements.txt
python run_openai.py --url https://localhost:11434/v1 --model phi3
As default, it tests against all subjects, but you can use --category option to test only specific subject.
Subjects include: 'business', 'law', 'psychology', 'biology', 'chemistry', 'history', 'other', 'health', 'economics', 'math', 'physics', 'computer science', 'philosophy', 'engineering'
The default timeout is 600 seconds (10 minutes). If the model being tested takes a long time to respond, and you encounter "error Request timed out" message, use --timeout number_of_seconds option to increase.
You can optionally run multiple tests in parallel by using --parallel option. For example, to run 2 tests in parallel:
python run_openai.py --url https://localhost:11434/v1 --model llama3 --parallel 2