Releases · svilupp/Julia-LLM-Leaderboard · GitHub

02 Feb 20:51

Added

Added new models (OpenAI "0125" versions, Codellama, and more)
Capability to evaluate code with AgentCodeFixer loop (set codefixing_num_rounds>0 )
Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
Re-scored all past submissions with the new methodology

Fixed

Improved code loading and debugging via Julia's code loading mechanism (include_string), which allows to better locate the lines that caused the errors (run evaluate(....; verbose=true) to see which lines caused the errors or return_debug=true to return the debug information as a secondary output).
Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
Fixed mkdir bug in run_benchmark

Removed

@timeout macro has been upstreamed to PromptingTools

Case Studies

Quantization effects on Yi34b and Magicoder 7b
Effect of English vs Chinese on performance with Yi34b

Assets 2

29 Dec 20:07

Added

Documentation with detailed methodology, test case definitions, and results across various data cuts.
Added ~5 samples for each model/prompt/test case combination for more robust results.

Assets 2