Add text-clustering-embedding-vs-prompting.ipynb

intellectronica · Aug 22, 2023 · ba0bb85 · ba0bb85
1 parent db89811
commit ba0bb85
Showing 1 changed file with 346 additions and 0 deletions.
diff --git a/text-clustering-embedding-vs-prompting.ipynb b/text-clustering-embedding-vs-prompting.ipynb
@@ -0,0 +1,346 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Embeddings are a foundational building block of LLMs - by locating semantics in a multi-dimensional space they allow LLMs to make sense of meaning and \"understand\" what texts are about. In this sample, we look at two ways of using embeddings for the purpose of clustering texts by their meaning. In the first, we use the embeddings directly and learn the clusters based on their embedding vectors. In the second, with summarise the texts and pass them on to a powerful LLM, prompting it to cluster the articles. As we'll see, both approaches work well, and each approach has its advantages.\n",
+ "\n",
+ "Learning clusters directly from embeddings:\n",
+ "- Fast and inexpensive\n",
+ "- Requires an understanding of basic machine learning methods (K-Means clustering)\n",
+ "- Does not provide any additional information about the clusters, since the simple machine learning algorithm we're using is working directly with the numerical representation of the embeddings and does not have any insight into their meaning.\n",
+ "\n",
+ "Prompting an LLM to cluster the articles:\n",
+ "- Relatively slow and expensive - this is a task for GPT-4, simpler LLMs do not achieve it with convincing results\n",
+ "- Easy to accomplish - we're \"programming\" the LLM in English, no additional knowledge of machine-learning techniques required\n",
+ "- Can provides additional information about the clusters, since the clustering is happening within the context of the LLM, where the summary of the articles is available"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install openai sklearn numpy requests BeautifulSoup4\n",
+ "from IPython.display import clear_output ; clear_output()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After we installed the necessary package, we'll gather some data for our experiment by fetching 23 random articles from Wikipedia."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "from bs4 import BeautifulSoup\n",
+ "\n",
+ "articles = []\n",
+ "\n",
+ "def get_random_wikipedia_article_title():\n",
+ " response = requests.get(\"https://en.wikipedia.org/wiki/Special:Random\")\n",
+ " soup = BeautifulSoup(response.content, \"html.parser\")\n",
+ " title = soup.find(class_=\"firstHeading\").text\n",
+ " text = soup.get_text()\n",
+ " return title, text\n",
+ "\n",
+ "for _ in range(23):\n",
+ " title, text = get_random_wikipedia_article_title()\n",
+ " articles.append({\n",
+ " \"title\": title,\n",
+ " \"text\": text,\n",
+ " })"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We configure Azure Open AI with our deployment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import openai\n",
+ "\n",
+ "openai.api_type = \"azure\"\n",
+ "openai.api_version = \"2023-06-01-preview\"\n",
+ "openai.api_key = \" ... \" # Replace with your Azure Open AI key\n",
+ "openai.api_base = \" ... \" # Replace with your Azure Open AI endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We'll use the Open AI embeddings endpoint to calculate embeddings for all articles.\n",
+ "\n",
+ "Note that we're trimming the text of the article a bit to fit it within the available context window of the model. This shouldn't be a problem, since encyclopedia articles usually contain the most important information in the beginning of the article."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for article in articles:\n",
+ " response = openai.Embedding.create(\n",
+ " engine=\"text-embedding-ada-002\",\n",
+ " input=[article[\"text\"][:5555]],\n",
+ " )\n",
+ " article[\"embedding\"] = response[\"data\"][0][\"embedding\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In addition to embeddings, we'll also create a summary for each article, which we will use later to get an LLM to do the clustering without having to include the complete articles. GPT-3.5 is more than capable for this task, and we're using the version of the model with a larger context window to be able to comfortably fit more of the article content in our request."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for article in articles:\n",
+ " completion = openai.ChatCompletion.create(\n",
+ " engine = \"gpt-35-turbo-16k\",\n",
+ " messages=[\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": (\n",
+ " \"You are an expert encyclopedia editor. \"\n",
+ " \"Your task is to read an article from Wikipedia and write a short summary of it. \"\n",
+ " \"Your response should always be a single paragraph of text with the article summary, \"\n",
+ " \"and should never include any titles, headings or markup.\"\n",
+ " )\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": (\n",
+ " f\"Read the following article and write a short summary of it.\"\n",
+ " f\"\\n\\nTitle:\\n{article['title']}\\nArticle:\\n{article['text'][:9999]}\"\n",
+ " )\n",
+ " },\n",
+ " ],\n",
+ " temperature=0.3,\n",
+ " max_tokens=250,\n",
+ " )\n",
+ " article[\"summary\"] = completion['choices'][0]['message'][\"content\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's look at option 1 - learning the clustering of the articles from their embeddings. We'll use K-Means clustering, a simple but very effective clustering model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Cluster 1\n",
+ " - 2010 Campania regional election\n",
+ " - List of mayors of Vicenza\n",
+ "\n",
+ "Cluster 2\n",
+ " - Boxing at the 2014 Commonwealth Games – Lightweight\n",
+ " - 1965 Nuneaton by-election\n",
+ " - 2019 Copa do Brasil\n",
+ " - 1934 County Championship\n",
+ " - 1699 in science\n",
+ "\n",
+ "Cluster 3\n",
+ " - Paracongroides\n",
+ " - Berlin Correspondent\n",
+ " - Karl Gether Bomhoff\n",
+ " - Zeitraumexit\n",
+ " - Fotonovela (film)\n",
+ "\n",
+ "Cluster 4\n",
+ " - Jewish Neo-Aramaic dialect of Koy Sanjaq\n",
+ " - Special Assault Team\n",
+ " - Olympic Security Command Centre\n",
+ "\n",
+ "Cluster 5\n",
+ " - Carlos Uzabeaga\n",
+ " - Thunder Live\n",
+ " - Luke Kibet Bowen\n",
+ " - Dan Duffy (artist)\n",
+ " - Arthit Sunthornpit\n",
+ "\n",
+ "Cluster 6\n",
+ " - Mohtarma Benazir Bhutto Shaheed Medical College\n",
+ "\n",
+ "Cluster 7\n",
+ " - Array Network Facility\n",
+ " - Young African Leaders Initiative\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.cluster import KMeans\n",
+ "import numpy as np\n",
+ "\n",
+ "embeddings = np.array([article[\"embedding\"] for article in articles])\n",
+ "kmeans = KMeans(n_clusters=7, random_state=0, n_init=10).fit(embeddings)\n",
+ "\n",
+ "for cluster in range(7):\n",
+ " print(f\"Cluster {cluster + 1}\")\n",
+ " for article in articles:\n",
+ " if kmeans.predict([article[\"embedding\"]])[0] == cluster:\n",
+ " print(f\" - {article['title']}\")\n",
+ " print()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Looks like K-means got us pretty good results. We don't have an explicitly name title for each cluster, but from looking at the included articles we can see that it did a decent job grouping them by topic. We may be able to achieve even better results by adjusting the parameters of the model, but for a generic solution this is not bad, if a bit obscure."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, let's see what a powerful LLM (GPT-4) can come up with, with just prompting and with the titles and and summaries of the articles included in the context."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Cluster 1: Politics and Governance\n",
+ " - 2010 Campania regional election\n",
+ " - List of mayors of Vicenza\n",
+ " - 1965 Nuneaton by-election\n",
+ " - Karl Gether Bomhoff\n",
+ "\n",
+ "Cluster 2: Sports and Athletics\n",
+ " - Carlos Uzabeaga\n",
+ " - Boxing at the 2014 Commonwealth Games – Lightweight\n",
+ " - Luke Kibet Bowen\n",
+ " - 2019 Copa do Brasil\n",
+ " - Arthit Sunthornpit\n",
+ "\n",
+ "Cluster 3: Science and Research\n",
+ " - Paracongroides\n",
+ " - Array Network Facility\n",
+ " - 1699 in science\n",
+ "\n",
+ "Cluster 4: Language and Culture\n",
+ " - Jewish Neo-Aramaic dialect of Koy Sanjaq\n",
+ " - Young African Leaders Initiative\n",
+ "\n",
+ "Cluster 5: Law Enforcement and Security\n",
+ " - Special Assault Team\n",
+ " - Olympic Security Command Centre\n",
+ "\n",
+ "Cluster 6: Arts and Entertainment\n",
+ " - Berlin Correspondent\n",
+ " - Thunder Live\n",
+ " - Dan Duffy (artist)\n",
+ " - Fotonovela (film)\n",
+ " - Zeitraumexit\n",
+ "\n",
+ "Cluster 7: Education and Health\n",
+ " - Mohtarma Benazir Bhutto Shaheed Medical College\n",
+ " - 1934 County Championship\n"
+ ]
+ }
+ ],
+ "source": [
+ "completion = openai.ChatCompletion.create(\n",
+ " engine=\"gpt-4\",\n",
+ " messages=[\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": (\n",
+ " \"You are an expert encyclopedia editor. \"\n",
+ " \"Your task is to read through the summaries of Wikipedia articles, \"\n",
+ " \"and assign them to one of seven clusters. \"\n",
+ " )\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": (\n",
+ " \"The following are titles and summaries of Wikipedia articles. \"\n",
+ " \"Read the summaries. For each article, come up with a few tags that would \"\n",
+ " \"describe the cluster it belongs to. \"\n",
+ " \"Then assign the article to one of exactly 7 clusters (from Cluster 1 to Cluster 7), \"\n",
+ " \"based on their topics. It is very important that you only assign each article to one cluster. \"\n",
+ " \"You must have exactly 7 clusters, and each cluster must have at least two articles assigned to it. \"\n",
+ " \"Your final output should follow the format:\\n\"\n",
+ " \"Cluster N: Cluster Title\\n\"\n",
+ " \" - First Article Title\\n\"\n",
+ " \" - Second Article Title\\n\"\n",
+ " \"etc...\\n\\n\\n\"\n",
+ " ) + \"\\n\".join(f\"Title: {article['title']}\\nSummary: {article['summary']}\\n\\n\" for article in articles)\n",
+ " },\n",
+ " ],\n",
+ " temperature=0.1,\n",
+ ")\n",
+ "clusters = completion['choices'][0]['message'][\"content\"]\n",
+ "\n",
+ "print(clusters)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In conclusion, we can achieve pretty good clustering using both techniques. Using LLMs to summarise the articles and then asking for a clustering gets us excellent results, but at a price - we need one minute and a half to complete the task, and the costs for making the LLM calls add up. Learning the clustering from the embeddings directly is very fast (4 seconds) and inexpensive, but getting high quality results is much harder, and we don't get the added benefit of the LLM being able to reason about the clusters and give them a clear description, we have to do that using a human, and that's many times more expensive."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}