Skip to content

Releases: svilupp/Julia-LLM-Leaderboard

v0.2.0

02 Feb 20:51
Compare
Choose a tag to compare

Added

  • Added new models (OpenAI "0125" versions, Codellama, and more)
  • Capability to evaluate code with AgentCodeFixer loop (set codefixing_num_rounds>0 )
  • Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
  • Re-scored all past submissions with the new methodology

Fixed

  • Improved code loading and debugging via Julia's code loading mechanism (include_string), which allows to better locate the lines that caused the errors (run evaluate(....; verbose=true) to see which lines caused the errors or return_debug=true to return the debug information as a secondary output).
  • Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
  • Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
  • Fixed mkdir bug in run_benchmark

Removed

  • @timeout macro has been upstreamed to PromptingTools

Case Studies

  • Quantization effects on Yi34b and Magicoder 7b
  • Effect of English vs Chinese on performance with Yi34b

v0.1.0

29 Dec 20:07
Compare
Choose a tag to compare

Added

  • Documentation with detailed methodology, test case definitions, and results across various data cuts.
  • Added ~5 samples for each model/prompt/test case combination for more robust results.