{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/mshumer/gpt-llm-trainer/blob/main/claude_llm_trainer.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Describe your model -> fine-tuned LLaMA 2\n",
        "By Matt Shumer (https://twitter.com/mattshumer_)\n",
        "\n",
        "The goal of this notebook is to experiment with a new way to make it very easy to build a task-specific model for your use-case.\n",
        "\n",
        "First, use the best GPU available (go to Runtime -> change runtime type)\n",
        "\n",
        "To create your model, just go to the first code cell, and describe the model you want to build in the prompt. Be descriptive and clear.\n",
        "\n",
        "Select a temperature (high=creative, low=precise), and the number of training examples to generate to train the model. From there, just run all the cells.\n",
        "\n",
        "You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell."
      ],
      "metadata": {
        "id": "wM8MRkf8Dr94"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Data generation step"
      ],
      "metadata": {
        "id": "Way3_PuPpIuE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Write your prompt here. Make it as descriptive as possible!\n",
        "\n",
        "Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.\n",
        "\n",
        "Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start."
      ],
      "metadata": {
        "id": "lY-3DvlIpVSl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prompt = \"A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Spanish.\"\n",
        "temperature = .5\n",
        "number_of_examples = 100\n",
        "\n",
        "ANTHROPIC_API_KEY = \"YOUR API KEY\""
      ],
      "metadata": {
        "id": "R7WKZyxtpUPS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Run this to generate the dataset."
      ],
      "metadata": {
        "id": "1snNou5PrIci"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "import random\n",
        "import requests\n",
        "\n",
        "def generate_example(prompt, prev_examples, temperature=0.5):\n",
        "    messages = [{\"role\": \"user\", \"content\": f'Now, generate a prompt/response pair for `{prompt}`. Do so in the exact format requested:\\n```\\n<prompt>prompt</prompt>\\n<response>response_goes_here</response>\\n```\\n\\nOnly one prompt/response pair should be generated per turn.'}]\n",
        "\n",
        "    if len(prev_examples) > 0:\n",
        "        if len(prev_examples) > 10:\n",
        "            prev_examples = random.sample(prev_examples, 10)\n",
        "\n",
        "        for example in prev_examples:\n",
        "            messages.append({\n",
        "                \"role\": \"assistant\",\n",
        "                \"content\": example\n",
        "            })\n",
        "\n",
        "            messages.append({\n",
        "                \"role\": \"user\",\n",
        "                \"content\": 'Now, generate another prompt/response pair. Make it unique.'\n",
        "            })\n",
        "\n",
        "    headers = {\n",
        "        \"x-api-key\": ANTHROPIC_API_KEY,\n",
        "        \"anthropic-version\": \"2023-06-01\",\n",
        "        \"content-type\": \"application/json\"\n",
        "    }\n",
        "\n",
        "    data = {\n",
        "        \"model\": 'claude-3-haiku-20240307', # \"claude-3-opus-20240229\",\n",
        "        \"max_tokens\": 1354,\n",
        "        \"temperature\": temperature,\n",
        "        \"messages\": messages,\n",
        "        \"system\": f\"You are generating data which will be used to train a machine learning model.\\n\\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\\n\\nYou will do so in this format:\\n```\\n<prompt>prompt</prompt>\\n<response>response_goes_here</response>\\n```\\n\\nOnly one prompt/response pair should be generated per turn.\\n\\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\\n\\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\\n\\nHere is the type of model we want to train:\\n`{prompt}`\"\n",
        "    }\n",
        "    print(messages)\n",
        "\n",
        "    response = requests.post(\"https://api.anthropic.com/v1/messages\", headers=headers, json=data)\n",
        "    print(response.json())\n",
        "\n",
        "    return '<prompt>' + response.json()['content'][0]['text'].split('<prompt>')[1]\n",
        "\n",
        "# Generate examples\n",
        "prev_examples = []\n",
        "for i in range(number_of_examples):\n",
        "    print(f'Generating example {i}')\n",
        "    example = generate_example(prompt, prev_examples, temperature)\n",
        "    print(example)\n",
        "    prev_examples.append(example)\n",
        "\n",
        "print(prev_examples)"
      ],
      "metadata": {
        "id": "Rdsd82ngpHCG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We also need to generate a system message."
      ],
      "metadata": {
        "id": "KC6iJzXjugJ-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def generate_system_message(prompt):\n",
        "    headers = {\n",
        "        \"x-api-key\": ANTHROPIC_API_KEY,\n",
        "        \"anthropic-version\": \"2023-06-01\",\n",
        "        \"content-type\": \"application/json\"\n",
        "    }\n",
        "    data = {\n",
        "        \"model\": \"claude-3-opus-20240229\",\n",
        "        \"max_tokens\": 500,\n",
        "        \"temperature\": temperature,\n",
        "        \"system\":\"\"\"\n",
        "                You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given WHAT_THE_MODEL_SHOULD_DO.`.\\n\\nMake it as concise as possible. Include nothing but the system prompt in your response.\\n\\nFor example, never write: `\\\"SYSTEM_PROMPT_HERE`.\\\"\n",
        "                \"\"\".strip(),\n",
        "        \"messages\": [\n",
        "            {\n",
        "                \"role\": \"user\",\n",
        "                \"content\": f\"Here is the prompt: `{prompt.strip()}`. Write a fantastic system message.\",\n",
        "            }\n",
        "        ],\n",
        "    }\n",
        "    response = requests.post(\"https://api.anthropic.com/v1/messages\", headers=headers, json=data)\n",
        "    return response.json()['content'][0]['text']\n",
        "\n",
        "system_message = generate_system_message(prompt)\n",
        "\n",
        "print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')"
      ],
      "metadata": {
        "id": "xMcfhW6Guh2E"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's put our examples into a dataframe and turn them into a final pair of datasets."
      ],
      "metadata": {
        "id": "G6BqZ-hjseBF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "\n",
        "# Initialize lists to store prompts and responses\n",
        "prompts = []\n",
        "responses = []\n",
        "\n",
        "# Parse out prompts and responses from examples\n",
        "for example in prev_examples:\n",
        "    try:\n",
        "        prompt_start = example.index('<prompt>') + len('<prompt>')\n",
        "        prompt_end = example.index('</prompt>')\n",
        "        prompt = example[prompt_start:prompt_end].strip()\n",
        "\n",
        "        response_start = example.index('<response>') + len('<response>')\n",
        "        response_end = example.index('</response>')\n",
        "        response = example[response_start:response_end].strip()\n",
        "\n",
        "        prompts.append(prompt)\n",
        "        responses.append(response)\n",
        "    except (ValueError, IndexError):\n",
        "        pass\n",
        "\n",
        "# Create a DataFrame\n",
        "df = pd.DataFrame({\n",
        "    'prompt': prompts,\n",
        "    'response': responses\n",
        "})\n",
        "\n",
        "# Remove duplicates\n",
        "df = df.drop_duplicates()\n",
        "\n",
        "print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')\n",
        "df.head()"
      ],
      "metadata": {
        "id": "7CEdkYeRsdmB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Split into train and test sets."
      ],
      "metadata": {
        "id": "A-8dt5qqtpgM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Split the data into train and test sets, with 90% in the train set\n",
        "train_df = df.sample(frac=0.9, random_state=42)\n",
        "test_df = df.drop(train_df.index)\n",
        "\n",
        "# Save the dataframes to .jsonl files\n",
        "train_df.to_json('train.jsonl', orient='records', lines=True)\n",
        "test_df.to_json('test.jsonl', orient='records', lines=True)"
      ],
      "metadata": {
        "id": "GFPEn1omtrXM"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install necessary libraries"
      ],
      "metadata": {
        "id": "AbrFgrhG_xYi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7\n",
        "\n",
        "import os\n",
        "import torch\n",
        "from datasets import load_dataset\n",
        "from transformers import (\n",
        "    AutoModelForCausalLM,\n",
        "    AutoTokenizer,\n",
        "    BitsAndBytesConfig,\n",
        "    HfArgumentParser,\n",
        "    TrainingArguments,\n",
        "    pipeline,\n",
        "    logging,\n",
        ")\n",
        "from peft import LoraConfig, PeftModel\n",
        "from trl import SFTTrainer"
      ],
      "metadata": {
        "id": "lPG7wEPetFx2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Define Hyperparameters"
      ],
      "metadata": {
        "id": "moVo0led-6tu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "model_name = \"NousResearch/llama-2-7b-chat-hf\" # use this if you have access to the official LLaMA 2 model \"meta-llama/Llama-2-7b-chat-hf\", though keep in mind you'll need to pass a Hugging Face key argument\n",
        "dataset_name = \"/content/train.jsonl\"\n",
        "new_model = \"llama-2-7b-custom\"\n",
        "lora_r = 32\n",
        "lora_alpha = 16\n",
        "lora_dropout = 0.1\n",
        "use_4bit = True\n",
        "bnb_4bit_compute_dtype = \"float16\"\n",
        "bnb_4bit_quant_type = \"nf4\"\n",
        "use_nested_quant = False\n",
        "output_dir = \"./results\"\n",
        "num_train_epochs = 1\n",
        "fp16 = False\n",
        "bf16 = False\n",
        "per_device_train_batch_size = 4\n",
        "per_device_eval_batch_size = 4\n",
        "gradient_accumulation_steps = 1\n",
        "gradient_checkpointing = True\n",
        "max_grad_norm = 0.3\n",
        "learning_rate = 2e-4\n",
        "weight_decay = 0.001\n",
        "optim = \"paged_adamw_32bit\"\n",
        "lr_scheduler_type = \"constant\"\n",
        "max_steps = -1\n",
        "warmup_ratio = 0.03\n",
        "group_by_length = True\n",
        "save_steps = 25\n",
        "logging_steps = 5\n",
        "max_seq_length = None\n",
        "packing = False\n",
        "device_map = {\"\": 0}"
      ],
      "metadata": {
        "id": "bqfbhUZI-4c_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Load Datasets and Train"
      ],
      "metadata": {
        "id": "F-J5p5KS_MZY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load datasets\n",
        "train_dataset = load_dataset('json', data_files='/content/train.jsonl', split=\"train\")\n",
        "valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split=\"train\")\n",
        "\n",
        "# Preprocess datasets\n",
        "train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\\n{system_message.strip()}\\n<</SYS>>\\n\\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)\n",
        "valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\\n{system_message.strip()}\\n<</SYS>>\\n\\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)\n",
        "\n",
        "compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n",
        "bnb_config = BitsAndBytesConfig(\n",
        "    load_in_4bit=use_4bit,\n",
        "    bnb_4bit_quant_type=bnb_4bit_quant_type,\n",
        "    bnb_4bit_compute_dtype=compute_dtype,\n",
        "    bnb_4bit_use_double_quant=use_nested_quant,\n",
        ")\n",
        "model = AutoModelForCausalLM.from_pretrained(\n",
        "    model_name,\n",
        "    quantization_config=bnb_config,\n",
        "    device_map=device_map\n",
        ")\n",
        "model.config.use_cache = False\n",
        "model.config.pretraining_tp = 1\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
        "tokenizer.pad_token = tokenizer.eos_token\n",
        "tokenizer.padding_side = \"right\"\n",
        "peft_config = LoraConfig(\n",
        "    lora_alpha=lora_alpha,\n",
        "    lora_dropout=lora_dropout,\n",
        "    r=lora_r,\n",
        "    bias=\"none\",\n",
        "    task_type=\"CAUSAL_LM\",\n",
        ")\n",
        "# Set training parameters\n",
        "training_arguments = TrainingArguments(\n",
        "    output_dir=output_dir,\n",
        "    num_train_epochs=num_train_epochs,\n",
        "    per_device_train_batch_size=per_device_train_batch_size,\n",
        "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
        "    optim=optim,\n",
        "    save_steps=save_steps,\n",
        "    logging_steps=logging_steps,\n",
        "    learning_rate=learning_rate,\n",
        "    weight_decay=weight_decay,\n",
        "    fp16=fp16,\n",
        "    bf16=bf16,\n",
        "    max_grad_norm=max_grad_norm,\n",
        "    max_steps=max_steps,\n",
        "    warmup_ratio=warmup_ratio,\n",
        "    group_by_length=group_by_length,\n",
        "    lr_scheduler_type=lr_scheduler_type,\n",
        "    report_to=\"all\",\n",
        "    evaluation_strategy=\"steps\",\n",
        "    eval_steps=5  # Evaluate every 20 steps\n",
        ")\n",
        "# Set supervised fine-tuning parameters\n",
        "trainer = SFTTrainer(\n",
        "    model=model,\n",
        "    train_dataset=train_dataset_mapped,\n",
        "    eval_dataset=valid_dataset_mapped,  # Pass validation dataset here\n",
        "    peft_config=peft_config,\n",
        "    dataset_text_field=\"text\",\n",
        "    max_seq_length=max_seq_length,\n",
        "    tokenizer=tokenizer,\n",
        "    args=training_arguments,\n",
        "    packing=packing,\n",
        ")\n",
        "trainer.train()\n",
        "trainer.model.save_pretrained(new_model)\n",
        "\n",
        "# Cell 4: Test the model\n",
        "logging.set_verbosity(logging.CRITICAL)\n",
        "prompt = f\"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\\n\\nWrite a function that reverses a string. [/INST]\" # replace the command here with something relevant to your task\n",
        "pipe = pipeline(task=\"text-generation\", model=model, tokenizer=tokenizer, max_length=200)\n",
        "result = pipe(prompt)\n",
        "print(result[0]['generated_text'])"
      ],
      "metadata": {
        "id": "qf1qxbiF-x6p"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Run Inference"
      ],
      "metadata": {
        "id": "F6fux9om_c4-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import pipeline\n",
        "\n",
        "prompt = f\"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\\n\\nWrite a function that reverses a string. [/INST]\" # replace the command here with something relevant to your task\n",
        "num_new_tokens = 100  # change to the number of new tokens you want to generate\n",
        "\n",
        "# Count the number of tokens in the prompt\n",
        "num_prompt_tokens = len(tokenizer(prompt)['input_ids'])\n",
        "\n",
        "# Calculate the maximum length for the generation\n",
        "max_length = num_prompt_tokens + num_new_tokens\n",
        "\n",
        "gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)\n",
        "result = gen(prompt)\n",
        "print(result[0]['generated_text'].replace(prompt, ''))"
      ],
      "metadata": {
        "id": "7hxQ_Ero2IJe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Merge the model and store in Google Drive"
      ],
      "metadata": {
        "id": "Ko6UkINu_qSx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Merge and save the fine-tuned model\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "model_path = \"/content/drive/MyDrive/llama-2-7b-custom\"  # change to your preferred path\n",
        "\n",
        "# Reload model in FP16 and merge it with LoRA weights\n",
        "base_model = AutoModelForCausalLM.from_pretrained(\n",
        "    model_name,\n",
        "    low_cpu_mem_usage=True,\n",
        "    return_dict=True,\n",
        "    torch_dtype=torch.float16,\n",
        "    device_map=device_map,\n",
        ")\n",
        "model = PeftModel.from_pretrained(base_model, new_model)\n",
        "model = model.merge_and_unload()\n",
        "\n",
        "# Reload tokenizer to save it\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
        "tokenizer.pad_token = tokenizer.eos_token\n",
        "tokenizer.padding_side = \"right\"\n",
        "\n",
        "# Save the merged model\n",
        "model.save_pretrained(model_path)\n",
        "tokenizer.save_pretrained(model_path)"
      ],
      "metadata": {
        "id": "AgKCL7fTyp9u"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Load a fine-tuned model from Drive and run inference"
      ],
      "metadata": {
        "id": "do-dFdE5zWGO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import drive\n",
        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
        "\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "model_path = \"/content/drive/MyDrive/llama-2-7b-custom\"  # change to the path where your model is saved\n",
        "\n",
        "model = AutoModelForCausalLM.from_pretrained(model_path)\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_path)"
      ],
      "metadata": {
        "id": "xg6nHPsLzMw-"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import pipeline\n",
        "\n",
        "prompt = \"What is 2 + 2?\"  # change to your desired prompt\n",
        "gen = pipeline('text-generation', model=model, tokenizer=tokenizer)\n",
        "result = gen(prompt)\n",
        "print(result[0]['generated_text'])"
      ],
      "metadata": {
        "id": "fBK2aE2KzZ05"
      },
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "machine_shape": "hm",
      "provenance": [],
      "gpuType": "A100",
      "authorship_tag": "ABX9TyPbwUEGHVxF1MtqNEeRSs+9",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}