{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Natural language Processing With SpaCy and Python\n", "+ NLP a form of AI or Artificial Intelligence (Building systems that can do intelligent things)\n", "+ NLP or Natural Language Processing – Building systems that can understand everyday language. It is a subset of Artificial Intelligence. \n", "+ SpaCy by Explosion.ai (Matthew Honnibal)\n", "![alt text](SpaCy_logo.png \"Title\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Basic Terms\n", "+ Tokenization:\tSegmenting text into words, punctuations marks etc.\n", "+ Part-of-speech: (POS) Tagging\tAssigning word types to tokens, like verb or noun.\n", "+ Dependency Parsing:\tAssigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.\n", "+ Lemmatization:\tAssigning the base forms of words. For example, the lemma of \"was\" is \"be\", and the lemma of \"rats\" is \"rat\".\n", "+ Named Entity Recognition (NER):\tLabelling named \"real-world\" objects, like persons, companies or locations.\n", "+ Similarity:\tComparing words, text spans and documents and how similar they are to each other.\n", "+ Sentence Boundary Detection (SBD):\tFinding and segmenting individual sentences.\n", "+ Text Classification:\tAssigning categories or labels to a whole document, or parts of a document.\n", "+ Rule-based Matching:\tFinding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.\n", "+ Training:\tUpdating and improving a statistical model's predictions.\n", "+ Serialization:\tSaving objects to files or byte strings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installing the Library on Linux/Unix\n", "+ sudo pip install spacy\n", "+ sudo python -m spacy download en\n", "+ sudo python -m spacy download fr\n", "\n", "### Installing On Windows using Conda\n", "\n", "+ conda install tqdm\n", "+ conda install -c conda-forge spacy /conda install spacy\n", "+ python -m spacy download en\n", "- - with cmd administrator\n", "\n", "\n", "### Installing using Conda\n", "+ conda install -c conda-forge spacy\n", "+ sudo python -m spacy download en\n", "+ sudo python -m spacy download fr\n", "\n", "\n", "\n", "\n", "#### For Download the Models of other languages\n", "+ sudo python -m spacy download de\n", "+ sudo python -m spacy download es\n", "+ sudo python -m spacy download xx \n", "\n", "\n", "### Installing On Windows using Conda\n", "+ conda config-add channel conda-forge\n", "+ conda update anaconda\n", "+ conda install tqdm\n", "+ conda install -c conda-forge spacy\n", "+ sudo python -m spacy download en" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Loading the package\n", "import spacy\n", "nlp = spacy.load('en')\n", "\n", "#nlp = en_core_web_sm.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](BehindSpacy.jpg \"Title\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading A Document or Text" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Reading the text /tokens\n", "docx = nlp(\"SpaCy is a cool tool\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpaCy is a cool tool" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docx" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "docx2 = nlp(u\"SpaCy is an amazing tool like nltk\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Reading a file\n", "myfile = open(\"examplefile.txt\").read()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "doc_file = nlp(myfile)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The best error message is the one that never shows up.\n", "Gates, currently a journalist in America, played with David Cotton in the final years of the seventies in Newcastle.\n", "\n", "You Learn More From Failure Than From Success. \n", "The purpose of software engineering is to control complexity, not to create it\n", "\n", "This is one of the most interesting programming books I have ever read, and it's so easy to jump right in and play with the NLTK. I have devoured this book.\n", "\n", "Although this text is available for free online through NLTK, it is an incredible resource for anybody trying to get started with NLP in Python. With only basic knowledge of Python and a week with this book, I was able to write a work application to identify key themes in survey data utilizing part of speech tagging and a custom built Regex parser. I highly recommend this text." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calling the file\n", "doc_file" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Simplified one liner\n", "doc_file2 = nlp(open(\"examplefile.txt\").read())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "The best error message is the one that never shows up.\n", "Gates, currently a journalist in America, played with David Cotton in the final years of the seventies in Newcastle.\n", "\n", "You Learn More From Failure Than From Success. \n", "The purpose of software engineering is to control complexity, not to create it\n", "\n", "This is one of the most interesting programming books I have ever read, and it's so easy to jump right in and play with the NLTK. I have devoured this book.\n", "\n", "Although this text is available for free online through NLTK, it is an incredible resource for anybody trying to get started with NLP in Python. With only basic knowledge of Python and a week with this book, I was able to write a work application to identify key themes in survey data utilizing part of speech tagging and a custom built Regex parser. I highly recommend this text." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc_file2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Decoding the file as UTF-8\n", "#myfile2 = open(\"examplefile.txt\").read().decode('utf8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentence Tokens\n", "+ Tokenization == Splitting or segmenting the text into sentences or tokens\n", "+ .sent" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The best error message is the one that never shows up.\n", "Gates, currently a journalist in America, played with David Cotton in the final years of the seventies in Newcastle.\n", "\n", "You Learn More From Failure Than From Success. \n", "The purpose of software engineering is to control complexity, not to create it\n", "\n", "This is one of the most interesting programming books I have ever read, and it's so easy to jump right in and play with the NLTK. I have devoured this book.\n", "\n", "Although this text is available for free online through NLTK, it is an incredible resource for anybody trying to get started with NLP in Python. With only basic knowledge of Python and a week with this book, I was able to write a work application to identify key themes in survey data utilizing part of speech tagging and a custom built Regex parser. I highly recommend this text." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List of Sentences in File\n", "doc_file" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: The best error message is the one that never shows up.\n", "\n", "1: Gates, currently a journalist in America, played with David Cotton in the final years of the seventies in Newcastle.\n", "\n", "\n", "2: You Learn More From Failure\n", "3: Than From Success. \n", "\n", "4: The purpose of software engineering is to control complexity, not to create it\n", "\n", "\n", "5: This is one of the most interesting programming books I have ever read, and it's so easy to jump right in and play with the NLTK.\n", "6: I have devoured this book.\n", "\n", "\n", "7: Although this text is available for free online through NLTK, it is an incredible resource for anybody trying to get started with NLP in Python.\n", "8: With only basic knowledge of Python and a week with this book, I was able to write a work application to identify key themes in survey data utilizing part of speech tagging and a custom built Regex parser.\n", "9: I highly recommend this text.\n" ] } ], "source": [ "# Sentence Tokens\n", "for num,sentence in enumerate(doc_file.sents):\n", " #print(f'{num}: {sentence}') # For Python 3.6 upwards\n", " print('{0}: {1}'.format(num,sentence))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word Tokens\n", "+ Splitting or segmenting the text into words\n", "+ .text" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "doc = nlp(u\"Spacy is an amazing tool\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Spacy\n", "is\n", "an\n", "amazing\n", "tool\n" ] } ], "source": [ "# Word Tokens\n", "for token in doc:\n", " print(token.text)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spacy', 'is', 'an', 'amazing', 'tool']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List of Word Tokens\n", "[token.text for token in doc ]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spacy', 'is', 'an', 'amazing', 'tool']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Similar to splitting on spaces\n", "doc.text.split(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More about words\n", "+ .shape_ ==> for shape of word eg. capital,lowercase,etc\n", "+ .is_alpha ==> returns boolean(true or false) if word is alphabet\n", "+ .is_stop ==> returns boolean(true or false) if word is a stop word" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpaCy is a cool tool" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docx" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SpaCy 14101195205177134206\n", "is 4370460163704169311\n", "a 11123243248953317070\n", "cool 13110060611322374290\n", "tool 13110060611322374290\n" ] } ], "source": [ "# Word Shape As Hash Value\n", "for word in docx:\n", " print(word.text,word.shape)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SpaCy XxxXx\n", "is xx\n", "a x\n", "cool xxxx\n", "tool xxxx\n" ] } ], "source": [ "# Word Shape As readable representation\n", "for word in docx:\n", " print(word.text,word.shape_)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "ex_doc = nlp(\"Hello hello HELLO HeLLO\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Token => Hello Shape Xxxxx True False\n", "Token => hello Shape xxxx True False\n", "Token => HELLO Shape XXXX True False\n", "Token => HeLLO Shape XxXXX True False\n" ] } ], "source": [ "for word in ex_doc:\n", " print(\"Token =>\", word.text, \"Shape \",word.shape_,word.is_alpha,word.is_stop)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part of Speech Tagging\n", "+ NB attribute_ ==> Returns readable string representation of attribute\n", "+ .pos\n", "+ .pos_ ==> exposes Google Universal pos_tag,simple \n", "+ .tag\n", "+ .tag_ ==> exposes Treebank, detailed,for training your own model\n", "+ + Uses \n", "- - Sentiment Analysis,Homonym Disambuguity ,Prediction" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Parts of Speech\n", "ex1 = nlp(\"He drinks a drink\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "He PRON\n", "drinks VERB\n", "a DET\n", "drink NOUN\n" ] } ], "source": [ "# pos_ = Parts of Speech Simplified\n", "for word in ex1:\n", " print(word.text,word.pos_)\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Parts of Speech Simple Term (.pos_)\n", "ex2 = nlp(\"I fish a fish\")" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I PRON PRP\n", "fish VERB VBP\n", "a DET DT\n", "fish NOUN NN\n" ] } ], "source": [ "for word in ex2:\n", " print(word.text,word.pos_,word.tag_)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I PRON PRP\n", "fish VERB VBP\n", "a DET DT\n", "fish NOUN NN\n" ] } ], "source": [ "# Parts of Speech Detailed (.tag_) (Good for training your own model,features) \n", "# Parts of Speech of Tag \n", "for word in ex2:\n", " print(word.text,word.pos_,word.tag_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### If you want to know the meaning of the pos abbreviation\n", "+ spacy.explain('DT')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'verb, non-3rd person singular present'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain('VBP')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "exercise1 = nlp(u\"All the faith he had had had had no effect on the outcome of his life\")\n", "#the first is a modifier while the second is the main verb of the sentence\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('All', 'PDT', 'ADJ')\n", "('the', 'DT', 'DET')\n", "('faith', 'NN', 'NOUN')\n", "('he', 'PRP', 'PRON')\n", "('had', 'VBD', 'VERB')\n", "('had', 'VBN', 'VERB')\n", "('had', 'VBN', 'VERB')\n", "('had', 'VBN', 'VERB')\n", "('no', 'DT', 'DET')\n", "('effect', 'NN', 'NOUN')\n", "('on', 'IN', 'ADP')\n", "('the', 'DT', 'DET')\n", "('outcome', 'NN', 'NOUN')\n", "('of', 'IN', 'ADP')\n", "('his', 'PRP$', 'ADJ')\n", "('life', 'NN', 'NOUN')\n" ] } ], "source": [ "for word in exercise1:\n", " print((word.text,word.tag_,word.pos_))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "exercise2 = nlp(\"The man the professor the student has studies Rome.\")\n", "#The student has the professor who knows the man who studies ancient Rome" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('The', 'DT', 'DET')\n", "('man', 'NN', 'NOUN')\n", "('the', 'DT', 'DET')\n", "('professor', 'NN', 'NOUN')\n", "('the', 'DT', 'DET')\n", "('student', 'NN', 'NOUN')\n", "('has', 'VBZ', 'VERB')\n", "('studies', 'NNS', 'NOUN')\n", "('Rome', 'NNP', 'PROPN')\n", "('.', '.', 'PUNCT')\n" ] } ], "source": [ "for word in exercise2:\n", " print((word.text,word.tag_,word.pos_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Syntactic Dependency\n", "+ It helps us to know the relation between tokens \n", "+ How each word is connected and dependent on each other" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "ex3 = nlp(\"Sally likes Sam\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Sally', 'NNP', 'PROPN', 'advmod')\n", "('likes', 'VBZ', 'VERB', 'ROOT')\n", "('Sam', 'NNP', 'PROPN', 'dobj')\n" ] } ], "source": [ "for word in ex3:\n", " print((word.text,word.tag_,word.pos_,word.dep_))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'adverbial modifier'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What does Advmod mean?\n", "spacy.explain('advmod')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing Dependency using displaCy\n", "+ from spacy import displacy\n", "+ displacy.serve()\n", "+ displacy.render(jupyter=True) # for jupyter notebook" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# To dispay the dependences and any other visualization\n", "from spacy import displacy" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Sally\n", " PROPN\n", "\n", "\n", "\n", " likes\n", " VERB\n", "\n", "\n", "\n", " Sam\n", " PROPN\n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# For Jupyter Notebooks you can set jupter=True to render it properly\n", "displacy.render(ex3,style='dep',jupyter=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualizing Named Entity Recognistion \n", "#displacy.render(ex1,style='ent',jupyter=True,options={'distance':140})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing using displaCy\n", "+ For IDEs\n", "+ For Jupyter notebook\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For IDEs\n", "#from spacy import displacy" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "docx3 = nlp('Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Buffalo', 'NNP', 'PROPN', 'compound')\n", "('buffalo', 'NN', 'NOUN', 'compound')\n", "('Buffalo', 'NNP', 'PROPN', 'compound')\n", "('buffalo', 'NN', 'NOUN', 'compound')\n", "('buffalo', 'NN', 'NOUN', 'compound')\n", "('buffalo', 'NN', 'NOUN', 'compound')\n", "('Buffalo', 'NNP', 'PROPN', 'compound')\n", "('buffalo', 'NN', 'NOUN', 'ROOT')\n" ] } ], "source": [ "for word in docx3:\n", " print((word.text,word.tag_,word.pos_,word.dep_))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start a server running on your localhost\n", "#displacy.serve(docx3,style='dep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using displaCy in Jupyter notebooks\n", "+ displacy.render(jupyter=True)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(docx3,style='dep',jupyter=True)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "## Customizing the Diplays\n", "# Compact set it to square arrows or curved arrows\n", "# Color:#09a3d5\n", "options = {'compact': True, 'bg': 'cornflowerblue',\n", " 'color': '#fff', 'font': 'Sans Serif'}\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(docx3,style='dep',options=options,jupyter=True)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# Adding Title\n", "docx3.user_data['title']= 'Buffalo Complex Sentence'" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " Buffalo\n", " PROPN\n", "\n", "\n", "\n", " buffalo\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(docx3,style='dep',options=options,jupyter=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rendering HTML\n", "+ Default is svg\n", "+ set page to True\n", "+ minify=True For Minified format" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "html = displacy.render(docx3,style='dep',page=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exporting The Rendered Graphic" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "svg = displacy.render(doc, style='dep')\n", "output = 'buffalosentence.svg'\n", "with open(output,'w', encoding='utf-8') as f:\n", " f.write(svg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternative Method\n", "# svg = displacy.render(doc, style='dep')\n", "# output_path = Path('/images/sentence.svg')\n", "# output_path.open('w', encoding='utf-8').write(svg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named Entity Recognition or Detection\n", "+ Classifying a text into predefined categories or real world object entities.\n", "+ takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, and organizations) that are mentioned in that string.\n", "\n", "##### Uses\n", "+ Classifying or Categorizing contents by getting the relevant tags\n", "+ Improve search algorithms\n", "+ For content recommendations\n", "+ For info extraction\n", "\n", "+ .ents\n", "+ .label_" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "wikitext = nlp(u\"By 2020 the telecom company Orange, will relocate from Turkey to Orange County in the U.S. close to Apple.It will cost them 2 billion dollars.\")" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2020 DATE\n", "Turkey GPE\n", "Orange County GPE\n", "U.S. GPE\n", "Apple ORG\n", "2 billion dollars MONEY\n" ] } ], "source": [ "for entity in wikitext.ents:\n", " print(entity.text,entity.label_)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Countries, cities, states'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What does GPE means\n", "spacy.explain('GPE')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
By \n", "\n", " 2020\n", " DATE\n", "\n", " the telecom company Orange, will relocate from \n", "\n", " Turkey\n", " GPE\n", "\n", " to \n", "\n", " Orange County\n", " GPE\n", "\n", " in the \n", "\n", " U.S.\n", " GPE\n", "\n", " close to \n", "\n", " Apple\n", " ORG\n", "\n", ".It will cost them \n", "\n", " 2 billion dollars\n", " MONEY\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize With DiSplaCy\n", "displacy.render(wikitext,style='ent',jupyter=True)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "wikitext2 = nlp(u\"Linus Benedict Torvalds is a Finnish-American software engineer who is the creator, and for a long time, principal developer of the Linux kernel, which became the kernel for operating systems such as the Linux operating systems, Android, and Chrome OS.\")" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Linus Benedict Torvalds\n", " PERSON\n", "\n", " is a \n", "\n", " Finnish\n", " NORP\n", "\n", "-American software engineer who is the creator, and for a long time, principal developer of the \n", "\n", " Linux\n", " ORG\n", "\n", " kernel, which became the kernel for operating systems such as the \n", "\n", " Linux\n", " ORG\n", "\n", " operating systems, \n", "\n", " Android\n", " GPE\n", "\n", ", and Chrome OS.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize With DiSplaCy\n", "displacy.render(wikitext2,style='ent',jupyter=True)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Nationalities or religious or political groups'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain('NORP')" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "doc1 = nlp(\"Facebook, Explosion.ai, JCharisTech are all internet companies\")" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Facebook\n", " ORG\n", "\n", ", \n", "\n", " Explosion.ai\n", " ORG\n", "\n", ", \n", "\n", " JCharisTech\n", " ORG\n", "\n", " are all internet companies
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize With DiSplaCy\n", "displacy.render(doc1,style='ent',jupyter=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Normalization and Word Inflection\n", "+ Word inflection == syntactic differences between word forms \n", "+ Reducing a word to its base/root form\n", "+ Lemmatization **\n", "- - a word based on its intended meaning\n", "+ Stemming \n", "- - Cutting of the prefixes/suffices to reduce a word to base form\n", "+ Word Shape Analysis" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "## Lemmatization \n", "docx_lemma = nlp(\"studying student study studies studio studious\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Token=> studying Lemma=> study VERB\n", "Token=> student Lemma=> student NOUN\n", "Token=> study Lemma=> study NOUN\n", "Token=> studies Lemma=> study NOUN\n", "Token=> studio Lemma=> studio NOUN\n", "Token=> studious Lemma=> studious ADJ\n" ] } ], "source": [ "for word in docx_lemma:\n", " print(\"Token=>\",word.text,\"Lemma=>\",word.lemma_,word.pos_)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "docx_lemma1 = nlp(\"good goods run running runner was be were\")" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Token=> good Lemma=> good ADJ\n", "Token=> goods Lemma=> good NOUN\n", "Token=> run Lemma=> run VERB\n", "Token=> running Lemma=> run VERB\n", "Token=> runner Lemma=> runner NOUN\n", "Token=> was Lemma=> be VERB\n", "Token=> be Lemma=> be VERB\n", "Token=> were Lemma=> be VERB\n" ] } ], "source": [ "for word in docx_lemma1:\n", " print(\"Token=>\",word.text,\"Lemma=>\",word.lemma_,word.pos_)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "docx_lemma2 = nlp(\"walking walks walk walker\")" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Token=> walking Lemma=> walk VERB\n", "Token=> walks Lemma=> walk NOUN\n", "Token=> walk Lemma=> walk VERB\n", "Token=> walker Lemma=> walker ADV\n" ] } ], "source": [ "for word in docx_lemma2:\n", " print(\"Token=>\",word.text,\"Lemma=>\",word.lemma_,word.pos_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Semantic Similarity\n", "+ object1.similarity(object2)\n", "+ Uses:\n", "+ - Recommendation systems\n", "+ - Data Preprocessing eg removing duplicates\n", "- - python -m spacy download en_core_web_lg" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "# Loading Packages\n", "import spacy\n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# Similarity of object\n", "doc1 = nlp(\"wolf\")\n", "doc2 = nlp(\"dog\")" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6759108958707175" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc1.similarity(doc2)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "doc3 = nlp(\"cat\")" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7344887997583573" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc3.similarity(doc2)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "# Synonmys\n", "doc4 = nlp(\"smart\")\n", "doc5 = nlp(\"clever\")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8051825859624082" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Similarity of words\n", "doc4.similarity(doc5)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "similarword = nlp(\"wolf dog cat bird fish\")" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wolf\n", "dog\n", "cat\n", "bird\n", "fish\n" ] } ], "source": [ "for token in similarword:\n", " print(token.text)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('wolf', 'wolf') similarity=> 1.0\n", "('wolf', 'dog') similarity=> 0.5234999\n", "('wolf', 'cat') similarity=> 0.30953428\n", "('wolf', 'bird') similarity=> 0.52796596\n", "('wolf', 'fish') similarity=> 0.051317338\n", "('dog', 'wolf') similarity=> 0.5234999\n", "('dog', 'dog') similarity=> 1.0\n", "('dog', 'cat') similarity=> 0.62507194\n", "('dog', 'bird') similarity=> 0.4794655\n", "('dog', 'fish') similarity=> 0.32915178\n", "('cat', 'wolf') similarity=> 0.30953428\n", "('cat', 'dog') similarity=> 0.62507194\n", "('cat', 'cat') similarity=> 1.0\n", "('cat', 'bird') similarity=> 0.4474155\n", "('cat', 'fish') similarity=> 0.447517\n", "('bird', 'wolf') similarity=> 0.52796596\n", "('bird', 'dog') similarity=> 0.4794655\n", "('bird', 'cat') similarity=> 0.4474155\n", "('bird', 'bird') similarity=> 1.0\n", "('bird', 'fish') similarity=> 0.35412988\n", "('fish', 'wolf') similarity=> 0.051317338\n", "('fish', 'dog') similarity=> 0.32915178\n", "('fish', 'cat') similarity=> 0.447517\n", "('fish', 'bird') similarity=> 0.35412988\n", "('fish', 'fish') similarity=> 1.0\n" ] } ], "source": [ "# Similarity Between Tokens\n", "for token1 in similarword:\n", " for token2 in similarword:\n", " print((token1.text,token2.text),\"similarity=>\",token1.similarity(token2))" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "#[x for b in a for x in b] \n", "mylist = [(token1.text,token2.text,token1.similarity(token2)) for token2 in similarword for token1 in similarword]" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('wolf', 'wolf', 1.0),\n", " ('dog', 'wolf', 0.5234999),\n", " ('cat', 'wolf', 0.30953428),\n", " ('bird', 'wolf', 0.52796596),\n", " ('fish', 'wolf', 0.051317338),\n", " ('wolf', 'dog', 0.5234999),\n", " ('dog', 'dog', 1.0),\n", " ('cat', 'dog', 0.62507194),\n", " ('bird', 'dog', 0.4794655),\n", " ('fish', 'dog', 0.32915178),\n", " ('wolf', 'cat', 0.30953428),\n", " ('dog', 'cat', 0.62507194),\n", " ('cat', 'cat', 1.0),\n", " ('bird', 'cat', 0.4474155),\n", " ('fish', 'cat', 0.447517),\n", " ('wolf', 'bird', 0.52796596),\n", " ('dog', 'bird', 0.4794655),\n", " ('cat', 'bird', 0.4474155),\n", " ('bird', 'bird', 1.0),\n", " ('fish', 'bird', 0.35412988),\n", " ('wolf', 'fish', 0.051317338),\n", " ('dog', 'fish', 0.32915178),\n", " ('cat', 'fish', 0.447517),\n", " ('bird', 'fish', 0.35412988),\n", " ('fish', 'fish', 1.0)]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mylist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using DataFrames" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(mylist)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0wolfwolf1.000000
1dogwolf0.523500
2catwolf0.309534
3birdwolf0.527966
4fishwolf0.051317
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 wolf wolf 1.000000\n", "1 dog wolf 0.523500\n", "2 cat wolf 0.309534\n", "3 bird wolf 0.527966\n", "4 fish wolf 0.051317" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
21.0
\n", "
" ], "text/plain": [ " 2\n", "2 1.0" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Correlation\n", "df.corr()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "df.columns = [\"Token1\",\"Token2\",\"Similarity\"]" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Token1Token2Similarity
0wolfwolf1.000000
1dogwolf0.523500
2catwolf0.309534
3birdwolf0.527966
4fishwolf0.051317
\n", "
" ], "text/plain": [ " Token1 Token2 Similarity\n", "0 wolf wolf 1.000000\n", "1 dog wolf 0.523500\n", "2 cat wolf 0.309534\n", "3 bird wolf 0.527966\n", "4 fish wolf 0.051317" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Token1 object\n", "Token2 object\n", "Similarity float64\n", "dtype: object" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Types\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "# Visualization Package with Seaborn\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "# Encoding it\n", "df_viz = df.replace({'wolf':0,'dog':1,'cat':2,'fish':3,'bird':4})" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Token1Token2Similarity
0001.000000
1100.523500
2200.309534
3400.527966
4300.051317
\n", "
" ], "text/plain": [ " Token1 Token2 Similarity\n", "0 0 0 1.000000\n", "1 1 0 0.523500\n", "2 2 0 0.309534\n", "3 4 0 0.527966\n", "4 3 0 0.051317" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_viz.head()" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting with Correlation\n", "plt.figure(figsize=(20,10))\n", "sns.heatmap(df_viz.corr(),annot=True)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting without correlation\n", "plt.figure(figsize=(20,10))\n", "sns.heatmap(df_viz,annot=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word Analysis\n", "+ shape of word\n", "+ is_alpha\n", "+ is_stop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Noun Chunks\n", "+ noun + word describing the noun\n", "+ noun phrases\n", "+ adnominal\n", "+ root.text" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "# Noun Phrase or Chunks\n", "doc_phrase1 = nlp(\"The man reading the news is very tall.\")" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The man\n", "the news\n" ] } ], "source": [ "for word in doc_phrase1.noun_chunks:\n", " print(word.text)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "man\n", "news\n" ] } ], "source": [ "# Root Text\n", "# the Main Noun \n", "for word in doc_phrase1.noun_chunks:\n", " print(word.root.text)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "man connector_text to root head : is\n", "news connector_text to root head : reading\n" ] } ], "source": [ "# Text of the root token head\n", "for token in doc_phrase1.noun_chunks:\n", " print(token.root.text,\"connector_text to root head :\",token.root.head.text)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "doc_phrase2 = nlp(\"For us the news is a concern.\")" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "us Connector: us Text of Root Tokens Head: For\n", "the news Connector: news Text of Root Tokens Head: is\n", "a concern Connector: concern Text of Root Tokens Head: is\n" ] } ], "source": [ "for word in doc_phrase2.noun_chunks:\n", " print(word.text,\"Connector:\",word.root.text,\"Text of Root Tokens Head: \",word.root.head.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence Segmentation or Boundary Detection\n", "+ Deciding where sentences begin and end\n", "+ =================================================== \n", "+ a) If it's a period, it ends a sentence.\n", "+ (b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.\n", "+ (c) If the next token is capitalized, then it ends a sentence.\n", "+ =================================================== \n", "+ Default = Uses the Dependency parser\n", "+ Custom Rule Based or Manual\n", " - - You set boundaries before parsing the doc" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "# Manual or Custom Based\n", "def mycustom_boundary(docx):\n", " for token in docx[:-1]:\n", " if token.text == '...':\n", " docx[token.i+1].is_sent_start = True\n", " return docx\n" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "import spacy \n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "# Adding the rule before parsing\n", "nlp.add_pipe(mycustom_boundary,before='parser')" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "mydoc = nlp(u\"This is my first sentence...the last comment was so cuul... what if...? this is the last sentence\")" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is my first sentence...\n", "the last comment was so cuul...\n", "what if...\n", "?\n", "this is the last sentence\n" ] } ], "source": [ "for sentence in mydoc.sents:\n", " print(sentence.text)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "# Manual or Custom Based\n", "def mycustom_boundary2(docx):\n", " for token in docx[:-1]:\n", " if token.text == '---':\n", " docx[token.i+1].is_sent_start = True\n", " return docx" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "nlp2 = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "# Adding the rule before parsing\n", "nlp2.add_pipe(mycustom_boundary2,before='parser')" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "mydoc2 = nlp2(u\"Last year was great---this year 2018-05-22 will be so cuul. when was your birthday? ---this is the last sentence\")" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last year was great---\n", "this year 2018-05-22 will be so cuul.\n", "when was your birthday?\n", "---this is the last sentence\n" ] } ], "source": [ "for sentence in mydoc2.sents:\n", " print(sentence.text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Removing the parsing\n", "nlp.remove_pipe('parser')" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "mydoc3 = nlp(u\"This is my first sentence...the last comment was so cuul... what if...? this is the last sentence\")" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is my first sentence...the last comment was so cuul...\n", "what if...?\n", "this is the last sentence\n" ] } ], "source": [ "# Normal Sentence Segmenter\n", "for sentence in mydoc3.sents:\n", " print(sentence.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Custome Rule Based" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "from spacy.lang.en import English\n", "from spacy.pipeline import SentenceSegmenter\n" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "def split_on_newlines(doc):\n", " start = 0\n", " seen_newline = False\n", " for word in doc:\n", " if seen_newline and not word.is_space:\n", " yield doc[start:word.i]\n", " start = word.i\n", " seen_newline = False\n", " elif word.text == '\\n':\n", " seen_newline = True\n", " if start < len(doc):\n", " yield doc[start:len(doc)]" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "def split_on_tab(doc):\n", " start = 0\n", " seen_newline = False\n", " for word in doc:\n", " if seen_newline and not word.is_space:\n", " yield doc[start:word.i]\n", " start = word.i\n", " seen_newline = False\n", " elif word.text == '\\t':\n", " seen_newline = True\n", " if start < len(doc):\n", " yield doc[start:len(doc)]" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "nlp = English() # just the language with no model\n", "sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)\n", "nlp.add_pipe(sbd)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a great sentence\n", "\n", "This is another comment\n", "\n", "And more\n" ] } ], "source": [ "doc = nlp(u\"This is a great sentence\\n\\nThis is another comment\\nAnd more\")\n", "for sent in doc.sents:\n", " print(sent.text)\n", " " ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This', 'is', 'a', 'great', 'sentence', '\\n\\n', 'This', 'is', 'another', 'comment', '\\n']\n", "['And', 'more']\n" ] } ], "source": [ "doc = nlp(u\"This is a great sentence\\n\\nThis is another comment\\nAnd more\")\n", "for sent in doc.sents:\n", " print([token.text for token in sent])" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "nlp_tab = English() # just the language with no model\n", "sbd_tab = SentenceSegmenter(nlp.vocab, strategy=split_on_tab)\n", "nlp_tab.add_pipe(sbd_tab)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "## Spliting on Tabs\n", "doc_tab = nlp_tab(u\"This is a great sentence\\t This is another\\t comment\\t And more\")" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a great sentence\t This is another\t comment\t And more\n" ] } ], "source": [ "for sent in doc_tab.sents:\n", " print(sent.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stops Words In Spacy\n", "+ A Stop word/Stop list\n", "+ Words filtered out before preprocessing\n", "+ Most Common words*\n", "\n", "#### Uses\n", "+ Improve performance in search engines\n", "+ + eg how to perform sentiment analysis\n", "+ Eliminating noise and distraction in sentiment classification\n", "+ + Make ML learning faster due to less features\n", "+ + Make Prediction more accurate due to noise reduction" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "import spacy \n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "from spacy.lang.en.stop_words import STOP_WORDS" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'although', 'ours', 'since', 'she', 'last', 'anyone', 'together', 'herself', 'besides', 'was', 'else', 'one', 'toward', 'often', 'unless', 'indeed', 'something', 'itself', 'becomes', 'several', 'front', 'meanwhile', 'whereafter', 'will', 'whereby', 'two', 'before', 'each', 'empty', 'do', 'fifteen', 'otherwise', 'own', 'has', 'anywhere', 'really', 'i', 'an', 'have', 'hundred', 'nine', 'the', 'all', 'others', 'beyond', 'hereby', 'yet', 'bottom', 'perhaps', 'into', 'anyway', 'when', 'both', 'sixty', 'onto', 'whereupon', 'once', 'six', 'where', 'would', 'about', 'by', 'because', 'no', 'except', 'ten', 'yourselves', 'above', 'cannot', 'thereby', 'forty', 'on', 'hers', 'myself', 'name', 'side', 'latter', 'from', 'nowhere', 'then', 'throughout', 'for', 'himself', 'same', 'their', 'again', 'there', 'full', 'done', 'under', 'it', 'thus', 'might', 'already', 'whether', 'call', 'now', 'via', 'could', 'make', 'elsewhere', 'always', 'many', 'seems', 'they', 'third', 'seemed', 'did', 'get', 'his', 'well', 'does', 'may', 'few', 'whence', 'please', 'say', 'such', 'while', 'whom', 'hereupon', 'beside', 'who', 'anyhow', 'however', 'amongst', 'your', 'move', 'be', 'along', 'ca', 'back', 'none', 'therein', 'within', 'how', 'nothing', 'among', 'than', 'across', 'but', 'neither', 'whither', 'here', 'if', 'nor', 'being', 'below', 'is', 'afterwards', 'seem', 'through', 'whole', 'take', 'nobody', 'sometimes', 'with', 'became', 'he', 'see', 'less', 'over', 'am', 'either', 'various', 'most', 'five', 'somehow', 'twelve', 'becoming', 'whenever', 'beforehand', 'anything', 'become', 'sometime', 'very', 'are', 'just', 'mine', 'its', 'noone', 'ever', 'must', 'part', 'against', 'thereupon', 'though', 'hence', 'another', 'any', 'behind', 'everyone', 'until', 'mostly', 'can', 'namely', 'only', 'or', 'enough', 'someone', 'a', 'us', 'without', 'of', 'whoever', 'seeming', 'more', 'these', 'themselves', 'also', 'due', 'formerly', 'them', 'eleven', 'give', 'keep', 'using', 'off', 'former', 'therefore', 'up', 'first', 'whereas', 'after', 'as', 'every', 'made', 'towards', 'fifty', 'what', 'yourself', 're', 'my', 'our', 'which', 'quite', 'wherever', 'three', 'used', 'had', 'regarding', 'been', 'between', 'everywhere', 'somewhere', 'this', 'too', 'still', 'we', 'whose', 'even', 'hereafter', 'thru', 'almost', 'ourselves', 'and', 'doing', 'four', 'never', 'next', 'during', 'eight', 'whatever', 'rather', 'further', 'much', 'out', 'him', 'least', 'those', 'not', 'some', 'should', 'moreover', 'amount', 'to', 'you', 'twenty', 'down', 'why', 'thence', 'were', 'serious', 'everything', 'wherein', 'around', 'in', 'other', 'thereafter', 'upon', 'nevertheless', 'per', 'alone', 'go', 'show', 'top', 'her', 'herein', 'at', 'that', 'so', 'me', 'latterly', 'put', 'yours'}\n" ] } ], "source": [ "# Print List of Stop words\n", "print(STOP_WORDS)" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "305" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(STOP_WORDS)" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Checking If A Word is a Stopword\n", "nlp.vocab[\"the\"].is_stop" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab[\"theme\"].is_stop" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "mysentence = nlp(u\"This is a group of word to check for stop words\")" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "is\n", "a\n", "of\n", "to\n", "for\n" ] } ], "source": [ "# Filtering Non Stop Words\n", "for word in mysentence:\n", " if word.is_stop == True:\n", " print(word)\n", " " ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This\n", "group\n", "word\n", "check\n", "stop\n", "words\n" ] } ], "source": [ "# Filtering Stop Words\n", "filterdwords = []\n", "for word in mysentence:\n", " if word.is_stop == False:\n", " print(word)" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[This, group, word, check, stop, words]\n" ] } ], "source": [ "# Filtering Stop Words\n", "filteredwords = []\n", "for word in mysentence:\n", " if word.is_stop == False:\n", " filteredwords.append(word)\n", "print(filteredwords)" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[This, group, word, check, stop, words]" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[ word for word in mysentence if word.is_stop == False]" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[This, is, a, group, of, word, to, check, for, stop, words]\n" ] } ], "source": [ "# Filtering Stop Words\n", "filtered = []\n", "for word in mysentence:\n", " if word not in STOP_WORDS:\n", " filtered.append(word)\n", "print(filtered)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filtering Stop Words\n", "\n", "# for word in STOP_WORDS:\n", "# if nlp.vocab[word].is_stop == True:\n", "# print(word)\n", "\n", "# for mysentence in STOP_WORDS:\n", "# lexeme = nlp.vocab[mysentence]\n", "# lexeme.is_stop = True\n", "# print(lexeme)\n", " \n", " \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Adding Your Own Stop Words" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "stoplist = STOP_WORDS.add(\"lol\")" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "example2 = nlp(u\"There are a lot of lol in this sentence but what does it mean.\")" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "are\n", "a\n", "of\n", "lol\n", "in\n", "this\n", "but\n", "what\n", "does\n", "it\n" ] } ], "source": [ "for word in example2:\n", " if word.is_stop == True:\n", " print(word)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lol Has been Added as A Stop Word\n", "nlp.vocab[\"lol\"].is_stop " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Removing Stop Words\n", "STOP_WORDS.remove(\"lol\")\n", "# Remove the Last word added\n", "STOP_WORDS.pop(\"lol\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Similarity With ML" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "# Using ML\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.metrics.pairwise import euclidean_distances\n" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "documents = ['wolf','dog','cat','bird','fish']\n" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer()\n", "features = vectorizer.fit_transform(documents).todense()" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'dog': 2, 'fish': 3, 'wolf': 4, 'cat': 1, 'bird': 0}\n" ] } ], "source": [ "print(vectorizer.vocabulary_)\n" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.]] [[0 0 0 0 1]]\n", "[[0.]] [[0 0 1 0 0]]\n", "[[0.]] [[0 1 0 0 0]]\n", "[[0.]] [[1 0 0 0 0]]\n", "[[0.]] [[0 0 0 1 0]]\n" ] } ], "source": [ "for word in features:\n", " print(euclidean_distances(features[0]),word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence Similarity\n", "+ To do" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "# Using Three Sentences\n", "corpus1 = [\"I like that bachelor\",\"I like that unmarried man\",\"I don't like the married man\"]\n", "corpus2 = [\"Jane is very nice.\", \"Is Jane very nice?\"]\n", "corpus3 = [\"He is a bachelor\",\"He is an unmarried man\"]\n", "corpus4 = [\"She is a wife\",\"She is a wife\"]\n", "corpus5 = [\"He is a king\",\"He is a doctor\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rule-based Matching\n", "+ Tokenize\n", "+ Pattern matching" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "from spacy.matcher import Matcher\n", "import spacy\n", "nlp = spacy.load('en')\n" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "patterns = {'HelloWorld': [{'LOWER': 'hello'}, {'LOWER': 'world'}]}\n", "matcher = Matcher(nlp.vocab)\n" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "matcher = Matcher(nlp.vocab)\n" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "pattern = [{'LOWER': \"hello\"}, {'LOWER': \"world\"}]\n", "# matcher.add(\"HelloWorld\", None, pattern)\n" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "matcher.add(\"HelloWorld\", None, pattern)\n" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "doc = nlp(u'hello world this is not it!')\n", "matches = matcher(doc)\n" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(15578876784678163569, 0, 2)]\n" ] } ], "source": [ "print(matches)" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "### Thanks\n", "# By Jesse JCharis\n", "# J-Secur1ty\n", "# Jesus Saves @ JCharisTech" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }