Skip to content
View yawen-d's full-sized avatar

Block or report yawen-d

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.
Showing results

FlagEval is an evaluation toolkit for AI large foundation models.

Python 298 28 Updated Jul 13, 2024

AI Verify

Python 121 34 Updated Nov 2, 2024

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 37,076 5,887 Updated Aug 19, 2024

Collection of evals for Inspect AI

Python 15 20 Updated Oct 31, 2024

A collection of projects designed to help developers quickly get started with building deployable applications using the Anthropic API

TypeScript 6,460 875 Updated Oct 29, 2024

RuLES: a benchmark for evaluating rule-following in language models

Python 210 15 Updated Sep 30, 2024

Contains all assets to run with Moonshot Library (Connectors, Datasets and Metrics)

Python 18 17 Updated Nov 2, 2024

Web UI for moonshot

TypeScript 7 4 Updated Nov 1, 2024

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

40 3 Updated Oct 27, 2024

LLM evaluation.

Python 13 2 Updated Nov 7, 2023

Deep learning for dummies. All the practical details and useful utilities that go into working with real models.

Python 703 36 Updated Sep 24, 2024

A fast + lightweight implementation of the GCG algorithm in PyTorch

Python 113 29 Updated Oct 21, 2024

A curated list of awesome resources dedicated to Scaling Laws for LLMs

62 3 Updated Apr 10, 2023

[NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Python 4,771 445 Updated Jun 22, 2024

LLM101n: Let's build a Storyteller

29,586 1,620 Updated Aug 1, 2024

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

Python 173 34 Updated Nov 2, 2024

A benchmark for prompt injection detection systems.

Jupyter Notebook 86 10 Updated Sep 10, 2024

Make your GenAI Apps Safe & Secure 🚀 Test & harden your system prompt

Python 395 51 Updated Oct 16, 2024

A framework for few-shot evaluation of language models.

Python 6,862 1,830 Updated Nov 1, 2024

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights i…

960 54 Updated Oct 28, 2024

METR Task Standard

TypeScript 118 28 Updated Oct 30, 2024

A fast, clean, responsive Hugo theme.

HTML 9,900 2,691 Updated Sep 15, 2024

Inspect: A framework for large language model evaluations

Python 597 111 Updated Nov 2, 2024

【ACL 2024】 SALAD benchmark & MD-Judge

Python 103 11 Updated Oct 11, 2024

The implementation of Sophon

Python 8 2 Updated Jun 22, 2024

[NeurIPS 2024] SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challen…

Python 13,615 1,379 Updated Oct 31, 2024

A Comprehensive Assessment of Trustworthiness in GPT Models

Python 258 55 Updated Sep 16, 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Jupyter Notebook 316 53 Updated Aug 16, 2024
Next