lesson 2 files uploaded

l-newbould · Oct 5, 2023 · 9b40179 · 9b40179
commit 9b40179
Show file tree

Hide file tree

Showing 8 changed files with 1,830 additions and 0 deletions.
diff --git a/2.2 Lowercase.ipynb b/2.2 Lowercase.ipynb
@@ -0,0 +1,119 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 2.2 Lowercase"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An important first step in working with text data is simply converting it into lowercase. Why do we do this? Well, it helps maintain consistency in our data and our output. When we're working with text, be that exploratory analysis or machine learning, we want to ensure words are understood and counted as the same word, your model might treat a word with a capital letter different from the same word without any capital letter. Lowercasing ensures conformity.\n",
+ "\n",
+ "It also make it easier to continue with additonal cleaning of the data as we don’t have to account for different cases.\n",
+ "\n",
+ "However, do remember that lowercasing can change the meaning of some text e.g \"US\" in uppercase is understood as a country, as opposed to \"us\".\n",
+ "\n",
+ "Let's take a look at how easy it is to convert our data to lowercase using python's built in lower() function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Her cat's name is Luna\n"
+ ]
+ }
+ ],
+ "source": [
+ "sentence = \"Her cat's name is Luna\"\n",
+ "print(sentence)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "her cat's name is luna\n"
+ ]
+ }
+ ],
+ "source": [
+ "lower_sentence = sentence.lower()\n",
+ "print(lower_sentence)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Could you pass me the TV remote?', 'It is IMPOSSIBLE to find this hotel', 'Want to go for dinner on Tuesday?']\n"
+ ]
+ }
+ ],
+ "source": [
+ "sentence_list = ['Could you pass me the TV remote?', \n",
+ " 'It is IMPOSSIBLE to find this hotel', \n",
+ " 'Want to go for dinner on Tuesday?']\n",
+ "print(sentence_list)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['could you pass me the tv remote?', 'it is impossible to find this hotel', 'want to go for dinner on tuesday?']\n"
+ ]
+ }
+ ],
+ "source": [
+ "lower_sentence_list = [x.lower() for x in sentence_list]\n",
+ "print(lower_sentence_list)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/2.3 Stopwords.ipynb b/2.3 Stopwords.ipynb
@@ -0,0 +1,161 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 2.3 Stopwords\n",
+ "\n",
+ "In this lesson we'll be using the nltk package to remove stop words from text.\n",
+ "\n",
+ "Stop words are common words in the language which don't carry much meaning e.g. \"and\", \"of\", \"a\", \"to\". \n",
+ "\n",
+ "We remove these words because it removes a lot of complexity from the data. These words don't add much meaning to text so by removing them we are left with a smaller, cleaner dataset. Smaller, cleaner datasets often lead to increased accuracy in machine learning and will also speed up processing times."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[nltk_data] Downloading package stopwords to /home/lauren/nltk_data...\n",
+ "[nltk_data] Package stopwords is already up-to-date!\n"
+ ]
+ }
+ ],
+ "source": [
+ "# import packages\n",
+ "import nltk\n",
+ "nltk.download('stopwords')\n",
+ "from nltk.corpus import stopwords"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# assign our stop words to a variable\n",
+ "en_stopwords = stopwords.words('english')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# print the list of stop words to see what we will be removing\n",
+ "print(en_stopwords)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sentence = \"it was too far to go to the shop and he did not want her to walk\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "far go shop want walk\n"
+ ]
+ }
+ ],
+ "source": [
+ "# keep the words in the sentance if the word is not in the list of stop words\n",
+ "sentance_no_stopwords = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n",
+ "print(sentance_no_stopwords)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# removing stop words from list\n",
+ "en_stopwords.remove(\"did\")\n",
+ "en_stopwords.remove(\"not\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# add custom stop words\n",
+ "en_stopwords.append(\"go\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "far shop did not want walk\n"
+ ]
+ }
+ ],
+ "source": [
+ "sentance_no_stopwords_custom = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n",
+ "print(sentance_no_stopwords_custom)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}