-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 9b40179
Showing
8 changed files
with
1,830 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 2.2 Lowercase" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"An important first step in working with text data is simply converting it into lowercase. Why do we do this? Well, it helps maintain consistency in our data and our output. When we're working with text, be that exploratory analysis or machine learning, we want to ensure words are understood and counted as the same word, your model might treat a word with a capital letter different from the same word without any capital letter. Lowercasing ensures conformity.\n", | ||
"\n", | ||
"It also make it easier to continue with additonal cleaning of the data as we don’t have to account for different cases.\n", | ||
"\n", | ||
"However, do remember that lowercasing can change the meaning of some text e.g \"US\" in uppercase is understood as a country, as opposed to \"us\".\n", | ||
"\n", | ||
"Let's take a look at how easy it is to convert our data to lowercase using python's built in lower() function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Her cat's name is Luna\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"sentence = \"Her cat's name is Luna\"\n", | ||
"print(sentence)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"her cat's name is luna\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"lower_sentence = sentence.lower()\n", | ||
"print(lower_sentence)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['Could you pass me the TV remote?', 'It is IMPOSSIBLE to find this hotel', 'Want to go for dinner on Tuesday?']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"sentence_list = ['Could you pass me the TV remote?', \n", | ||
" 'It is IMPOSSIBLE to find this hotel', \n", | ||
" 'Want to go for dinner on Tuesday?']\n", | ||
"print(sentence_list)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['could you pass me the tv remote?', 'it is impossible to find this hotel', 'want to go for dinner on tuesday?']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"lower_sentence_list = [x.lower() for x in sentence_list]\n", | ||
"print(lower_sentence_list)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 2.3 Stopwords\n", | ||
"\n", | ||
"In this lesson we'll be using the nltk package to remove stop words from text.\n", | ||
"\n", | ||
"Stop words are common words in the language which don't carry much meaning e.g. \"and\", \"of\", \"a\", \"to\". \n", | ||
"\n", | ||
"We remove these words because it removes a lot of complexity from the data. These words don't add much meaning to text so by removing them we are left with a smaller, cleaner dataset. Smaller, cleaner datasets often lead to increased accuracy in machine learning and will also speed up processing times." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"[nltk_data] Downloading package stopwords to /home/lauren/nltk_data...\n", | ||
"[nltk_data] Package stopwords is already up-to-date!\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# import packages\n", | ||
"import nltk\n", | ||
"nltk.download('stopwords')\n", | ||
"from nltk.corpus import stopwords" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# assign our stop words to a variable\n", | ||
"en_stopwords = stopwords.words('english')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# print the list of stop words to see what we will be removing\n", | ||
"print(en_stopwords)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sentence = \"it was too far to go to the shop and he did not want her to walk\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"far go shop want walk\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# keep the words in the sentance if the word is not in the list of stop words\n", | ||
"sentance_no_stopwords = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n", | ||
"print(sentance_no_stopwords)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# removing stop words from list\n", | ||
"en_stopwords.remove(\"did\")\n", | ||
"en_stopwords.remove(\"not\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# add custom stop words\n", | ||
"en_stopwords.append(\"go\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"far shop did not want walk\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"sentance_no_stopwords_custom = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n", | ||
"print(sentance_no_stopwords_custom)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.