{ "cells": [ { "cell_type": "code", "execution_count": null, "source": [ "'''\r\n", "What is Stemming?\r\n", "Stemming is the process of producing morphological variants of a root/base word. \r\n", "Stemming programs are commonly referred to as stemming algorithms or stemmers.\r\n", "\r\n", "A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to \r\n", "the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.\r\n", "'''" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "from nltk.stem import PorterStemmer\r\n", "from nltk.tokenize import word_tokenize" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "ps = PorterStemmer()\r\n", "\r\n", "example_words = [\"python\", \"pythonic\", \"pythoner\", \"pythoning\", \"pythoned\", \"pythonly\"]" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "for w in example_words:\r\n", " print(ps.stem(w))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "python\n", "python\n", "python\n", "python\n", "python\n", "pythonli\n" ] } ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "new_text = \"It is very important to be pythonly while you're pythoning with python. All pythoners should be pythonic. All pythoners have pythoned poorly atleast once.\"" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "for w in word_tokenize(new_text):\r\n", " print(ps.stem(w))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "It\n", "is\n", "veri\n", "import\n", "to\n", "be\n", "pythonli\n", "while\n", "you\n", "'re\n", "python\n", "with\n", "python\n", ".\n", "all\n", "python\n", "should\n", "be\n", "python\n", ".\n", "all\n", "python\n", "have\n", "python\n", "poorli\n", "atleast\n", "onc\n", ".\n" ] } ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "\"Lets try on some bigger set of sentences / paragraph\"" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "from nltk.tokenize import word_tokenize\r\n", "from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer\r\n", "\r\n", "porter = PorterStemmer()\r\n", "snowball = SnowballStemmer(language='english')\r\n", "lanc = LancasterStemmer()\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "# Example of EU_Definition:\r\n", "\r\n", "eu_definition = '''\r\n", "The European Union (EU) is a political and economic union of 27 member states that are located primarily in Europe. \r\n", "Its members have a combined area of 4,233,255.3 km2 (1,634,469.0 sq mi) and an estimated total population of about 447 million. \r\n", "The EU has developed an internal single market through a standardised system of laws that apply in all member states in those matters, \r\n", "and only those matters, where members have agreed to act as one. EU policies aim to ensure the free movement of people, goods, \r\n", "services and capital within the internal market; enact legislation in justice and home affairs; and maintain common policies on trade, \r\n", "agriculture, fisheries and regional development. Passport controls have been abolished for travel within the Schengen Area. \r\n", "A monetary union was established in 1999, coming into full force in 2002, and is composed of 19 EU member states which use the euro \r\n", "currency. The EU has often been described as a sui generis political entity (without precedent or comparison).\r\n", "The EU and European citizenship were established when the Maastricht Treaty came into force in 1993. \r\n", "The EU traces its origins to the European Coal and Steel Community (ECSC) and the European Economic Community (EEC), established, \r\n", "respectively, by the 1951 Treaty of Paris and 1957 Treaty of Rome. The original members of what came to be known as the European \r\n", "Communities were the Inner Six: Belgium, France, Italy, Luxembourg, the Netherlands, and West Germany. The Communities and their \r\n", "successors have grown in size by the accession of new member states and in power by the addition of policy areas to their remit. \r\n", "The United Kingdom became the first member state to leave the EU on 31 January 2020. Before this, three territories of member states \r\n", "had left the EU or its forerunners. The latest major amendment to the constitutional basis of the EU, the Treaty of Lisbon, \r\n", "came into force in 2009.\r\n", "'''" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "\r\n", "sentence_example = (\r\n", " 'This is definitely a controversy as the attorney labeled the case \"extremely controversial\"'\r\n", ")\r\n", "\r\n", "# Porter Stemmed version of sentence example\r\n", "stemmed_sentence = [\r\n", " porter.stem(word) for word in word_tokenize(sentence_example)\r\n", "]\r\n", "\r\n", "\r\n", "\r\n", "# Tokenizing and Stemming the eu_definition\r\n", "tokenized_eu = word_tokenize(eu_definition)\r\n", "porter_eu = [porter.stem(word) for word in tokenized_eu]\r\n", "\r\n", "# Examples of single words used in the post:\r\n", "porter.stem('cats')\r\n", "porter.stem('amazing')\r\n", "porter.stem('amazement')\r\n", "porter.stem('amaze')\r\n", "porter.stem('amazed')\r\n", "porter.stem('amazon')\r\n", "porter.stem('nation')\r\n", "porter.stem('premonition')\r\n", "\r\n", "# Comparison between Porter and Snowball\r\n", "porter.stem('loudly')\r\n", "snowball.stem('loudly')\r\n", "\r\n", "# Comparison between Snowball and Lancaster\r\n", "porter.stem('salty')\r\n", "snowball.stem('salty')\r\n" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'salti'" ] }, "metadata": {}, "execution_count": 11 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "len(''.join(porter_eu))" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1430" ] }, "metadata": {}, "execution_count": 12 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "len(''.join(word_tokenize(eu_definition)))\r\n" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1591" ] }, "metadata": {}, "execution_count": 13 } ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "#The percentage of information retained is the ratio between these numbers (1430/1591), around 89.88%." ], "outputs": [], "metadata": {} } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.6.10 64-bit ('tf-gpu': conda)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "interpreter": { "hash": "135e78ef6267b613ce7b86630936d470174b66187aad9f784a45e5cc3235687c" } }, "nbformat": 4, "nbformat_minor": 2 }