Skip to content

Commit

Permalink
lesson 2 files uploaded
Browse files Browse the repository at this point in the history
  • Loading branch information
l-newbould committed Oct 5, 2023
0 parents commit 9b40179
Show file tree
Hide file tree
Showing 8 changed files with 1,830 additions and 0 deletions.
119 changes: 119 additions & 0 deletions 2.2 Lowercase.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.2 Lowercase"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An important first step in working with text data is simply converting it into lowercase. Why do we do this? Well, it helps maintain consistency in our data and our output. When we're working with text, be that exploratory analysis or machine learning, we want to ensure words are understood and counted as the same word, your model might treat a word with a capital letter different from the same word without any capital letter. Lowercasing ensures conformity.\n",
"\n",
"It also make it easier to continue with additonal cleaning of the data as we don’t have to account for different cases.\n",
"\n",
"However, do remember that lowercasing can change the meaning of some text e.g \"US\" in uppercase is understood as a country, as opposed to \"us\".\n",
"\n",
"Let's take a look at how easy it is to convert our data to lowercase using python's built in lower() function."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Her cat's name is Luna\n"
]
}
],
"source": [
"sentence = \"Her cat's name is Luna\"\n",
"print(sentence)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"her cat's name is luna\n"
]
}
],
"source": [
"lower_sentence = sentence.lower()\n",
"print(lower_sentence)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Could you pass me the TV remote?', 'It is IMPOSSIBLE to find this hotel', 'Want to go for dinner on Tuesday?']\n"
]
}
],
"source": [
"sentence_list = ['Could you pass me the TV remote?', \n",
" 'It is IMPOSSIBLE to find this hotel', \n",
" 'Want to go for dinner on Tuesday?']\n",
"print(sentence_list)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['could you pass me the tv remote?', 'it is impossible to find this hotel', 'want to go for dinner on tuesday?']\n"
]
}
],
"source": [
"lower_sentence_list = [x.lower() for x in sentence_list]\n",
"print(lower_sentence_list)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
161 changes: 161 additions & 0 deletions 2.3 Stopwords.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.3 Stopwords\n",
"\n",
"In this lesson we'll be using the nltk package to remove stop words from text.\n",
"\n",
"Stop words are common words in the language which don't carry much meaning e.g. \"and\", \"of\", \"a\", \"to\". \n",
"\n",
"We remove these words because it removes a lot of complexity from the data. These words don't add much meaning to text so by removing them we are left with a smaller, cleaner dataset. Smaller, cleaner datasets often lead to increased accuracy in machine learning and will also speed up processing times."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /home/lauren/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
}
],
"source": [
"# import packages\n",
"import nltk\n",
"nltk.download('stopwords')\n",
"from nltk.corpus import stopwords"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# assign our stop words to a variable\n",
"en_stopwords = stopwords.words('english')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
]
}
],
"source": [
"# print the list of stop words to see what we will be removing\n",
"print(en_stopwords)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"sentence = \"it was too far to go to the shop and he did not want her to walk\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"far go shop want walk\n"
]
}
],
"source": [
"# keep the words in the sentance if the word is not in the list of stop words\n",
"sentance_no_stopwords = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n",
"print(sentance_no_stopwords)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# removing stop words from list\n",
"en_stopwords.remove(\"did\")\n",
"en_stopwords.remove(\"not\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# add custom stop words\n",
"en_stopwords.append(\"go\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"far shop did not want walk\n"
]
}
],
"source": [
"sentance_no_stopwords_custom = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])\n",
"print(sentance_no_stopwords_custom)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit 9b40179

Please sign in to comment.