evaluation

Star

Here are 1,187 public repositories matching this topic...

mrgloom / awesome-semantic-segmentation

Star

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

Updated May 8, 2021

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Sep 4, 2024
TypeScript

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Sep 4, 2024
TypeScript

Knetic / govaluate

Star

Arbitrary expression evaluation for golang

go parsing evaluation expression

Updated May 31, 2024
Go

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Sep 4, 2024
Python

MichaelGrupp / evo

Star

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated Sep 4, 2024
Python

sdiehl / write-you-a-haskell

Star

Building a modern functional compiler from first principles. (https://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

Updated Jan 11, 2021
Haskell

viebel / klipse

Sponsor

Star

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

react javascript ruby python scheme clojure lua clojurescript reactjs common-lisp ocaml brainfuck evaluation prolog codemirror-editor reasonml interactive-snippets code-evaluation klipse-plugin

Updated Oct 7, 2022
HTML

CLUEbenchmark / SuperCLUE

Star

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

evaluation chinese gpt-4 foundation-models chatgpt

Updated May 23, 2024

zzw922cn / Automatic_Speech_Recognition

Star

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio deep-learning tensorflow paper end-to-end evaluation cnn lstm speech-recognition rnn automatic-speech-recognition feature-vector data-preprocessing phonemes timit-dataset layer-normalization rnn-encoder-decoder chinese-speech-recognition

Updated Mar 24, 2023
Python

microsoft / promptbench

Star

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

Updated Aug 20, 2024
Python

ianarawjo / ChainForge

Sponsor

Star

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models prompt-engineering llms llmops

Updated Aug 16, 2024
TypeScript

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

huggingface / evaluate

Star

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

machine-learning evaluation

Updated Aug 14, 2024
Python

ContinualAI / avalanche

Star

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

training library framework deep-learning metrics evaluation pytorch benchmarks strategies lifelong-learning continual-learning continualai

Updated Jun 21, 2024
Python

Cloud-CV / EvalAI

Star

☁️ 🚀 📊 📈 Evaluating state of the art in AI

python angularjs docker challenge machine-learning django ai reproducible-research leaderboard evaluation artificial-intelligence ai-challenges reproducibility evalai angular7

Updated Aug 29, 2024
Python

Helicone / helicone

Star

🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23