Merged CODAIT master into branch

CODAIT · Jul 2, 2021 · f8f175d · f8f175d
2 parents 2bec5ef + e414fcd
commit f8f175d
Show file tree

Hide file tree

Showing 9 changed files with 2,965 additions and 1,885 deletions.
diff --git a/notebooks/Analyze_Text.ipynb b/notebooks/Analyze_Text.ipynb
@@ -15,7 +15,7 @@
  "source": [
  "# Introduction\n",
  "\n",
- "This notebook shows how the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) lets you use use [Pandas](https://pandas.pydata.org/) DataFrames and the [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) service to analyze natural language text. \n",
+ "This notebook shows how the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) lets you use [Pandas](https://pandas.pydata.org/) DataFrames and the [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) service to analyze natural language text. \n",
  "\n",
  "We start out with an excerpt from the [plot synopsis from the Wikipedia page\n",
  "for *Monty Python and the Holy Grail*](https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail#Plot). \n",
@@ -1220,7 +1220,7 @@
  "source": [
  "That's it. With the DataFrame version of this data, we can perform our example task with **one line of code**.\n",
  "\n",
- "Specifically, we use a Pandas selection condition to filter out the tokens that aren't pronouns, and then then we \n",
+ "Specifically, we use a Pandas selection condition to filter out the tokens that aren't pronouns, and then we \n",
  "project down to the columns containing sentence and token spans. The result is another DataFrame that \n",
  "we can display directly in our Jupyter notebook."
  ]

diff --git a/text_extensions_for_pandas/resources/span_array.js b/text_extensions_for_pandas/resources/span_array.js
@@ -1,11 +1,11 @@
 // Increment the version to invalidate the cached script
-const VERSION = 0.76
+const VERSION = 0.75
+const global_stylesheet = document.head.querySelector("style.span-array-css")
+const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")
 
 if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || window.SpanArray.VERSION < VERSION) {
 
  // Replace global SpanArray CSS with latest copy
- const global_stylesheet = document.head.querySelector("style.span-array-css")
- const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")
  if(local_stylesheet != undefined) {
  if(global_stylesheet != undefined) {
  document.head.removeChild(global_stylesheet)
@@ -385,17 +385,18 @@ if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || win
  if(closest_tr == undefined) return
 
  const matching_span = doc_object.lookup_table[closest_tr.getAttribute("data-id")]
+ if(matching_span == undefined) return
 
  switch(closest_control_button.getAttribute("data-control")) {
  case "visibility":
  {
- if(matching_span != undefined) matching_span.visible = !matching_span.visible
+ matching_span.visible = !matching_span.visible
  source_spanarray.render()
  }
  break;
  case "highlight":
  {
- if(matching_span != undefined) matching_span.highlighted = !matching_span.highlighted
+ matching_span.highlighted = !matching_span.highlighted
  source_spanarray.render()
  }
  break;
@@ -437,9 +438,7 @@ if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || win
 } else {
  // SpanArray JS is already defined and not an outdated copy
  // Replace global SpanArray CSS with latest copy IFF global stylesheet is undefined
-
- const global_stylesheet = document.head.querySelector("style.span-array-css")
- const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")
+
  if(local_stylesheet != undefined) {
  if(global_stylesheet == undefined) {
  document.head.appendChild(local_stylesheet)

diff --git a/tutorials/corpus/CoNLL_2.ipynb b/tutorials/corpus/CoNLL_2.ipynb
@@ -42,6 +42,8 @@
  " \"from the directory containing this notebook, or use a Python \"\n",
  " \"environment on which you have used `pip` to install the package.\")\n",
  "\n",
+ "PROJECT_ROOT = \"../..\" \n",
+ " \n",
  "# Code shared among notebooks is kept in util.py, in this directory.\n",
  "import util"
  ]
@@ -3523,6 +3525,73 @@
  "difficult_precision.head(20)"
  ]
  },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>fold</th>\n",
+ " <th>doc_offset</th>\n",
+ " <th>span</th>\n",
+ " <th>ent_type</th>\n",
+ " <th>gold</th>\n",
+ " <th>num_teams</th>\n",
+ " <th>context</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>16</th>\n",
+ " <td>test</td>\n",
+ " <td>155</td>\n",
+ " <td>[776, 783): 'Antwerp'</td>\n",
+ " <td>LOC</td>\n",
+ " <td>False</td>\n",
+ " <td>16</td>\n",
+ " <td>...kish man smuggled heroin from Turkey to [Antwerp] from where it was taken to Spain, Franc...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " fold doc_offset span ent_type gold num_teams \\\n",
+ "16 test 155 [776, 783): 'Antwerp' LOC False 16 \n",
+ "\n",
+ " context \n",
+ "16 ...kish man smuggled heroin from Turkey to [Antwerp] from where it was taken to Spain, Franc... "
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "difficult_precision.loc[[16]]"
+ ]
+ },
  {
  "cell_type": "markdown",
  "metadata": {},

diff --git a/tutorials/market/Market_Intelligence_Part1.ipynb b/tutorials/market/Market_Intelligence_Part1.ipynb
@@ -1309,7 +1309,7 @@
  " entity_mentions_df = dataframes[\"entity_mentions\"]\n",
  " semantic_roles_df = dataframes[\"semantic_roles\"]\n",
  " \n",
- " # Extract mentions of person names and company names\n",
+ " # Extract mentions of person names\n",
  " person_mentions_df = entity_mentions_df[entity_mentions_df[\"type\"] == \"Person\"]\n",
  " \n",
  " # Extract instances of subjects that made statements\n",

diff --git a/tutorials/market/Market_Intelligence_Part2.ipynb b/tutorials/market/Market_Intelligence_Part2.ipynb
@@ -250,7 +250,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "This graph is always a tree, so we refer to it as the *dependency-based parse tree* of the sentence. We often shorten the phrase \"dependency-based parse tree\" to **dependency parse** or **parse tree**.\n",
+ "This graph is always a tree, so we call it the *dependency-based parse tree* of the sentence. We often shorten the phrase \"dependency-based parse tree\" to **dependency parse** or **parse tree**.\n",
  "\n",
  "Every word in the sentence (including the period at the end) becomes a node of the parse tree:\n",
  "![Parse tree for the sentence \"I like natural language processing\". Each word of the sentence becomes a node of the tree.](images/parse_tree_nodes.png)\n",
@@ -263,7 +263,7 @@
  "\n",
  "Each edge is tagged with information about why the words are related. For example, the first two words in the sentence, \"I\" and \"like\", have an `nsubj` relationship. The pronoun \"I\" is the subject for the verb \"like\".\n",
  "\n",
- "Dependency parsing is useful because it lets you solve problems with very little code. The parser acts as a sort of universal machine learning model. The output of the parser is much easier to filter and manipulate with code, compared with the original text."
+ "Dependency parsing is useful because it lets you solve problems with very little code. The parser acts as a universal machine learning model, extracting many facts at once from the text. Pattern matching over the parse tree lets you filter this set of facts down to the ones that are relevant to your application."
  ]
  },
  {
@@ -923,7 +923,7 @@
  "\n",
  "The edge types in this parse tree come from the [Universal Dependencies](https://universaldependencies.org/) framework. The edge between the name and job title has the type `appos`. `appos` is short for \"[appositional modifier](https://universaldependencies.org/docs/en/dep/appos.html)\", or [appositive](https://owl.purdue.edu/owl/general_writing/grammar/appositives.html). An appositive is a noun that describes another noun. In this case, the noun phrase \"general manager, Data and AI, IBM\" describes the noun phrase \"Daniel Hernandez\".\n",
  "\n",
- "This pattern happens whenever a person's job title is an appositive for that person's name. The title will be below the name in the tree, and the head nodes of the name and title will be connected by an `appos` edge. We can use this pattern to find the job title via a three-step process:\n",
+ "The pattern in the picture above happens whenever a person's job title is an appositive for that person's name. The title will be below the name in the tree, and the head nodes of the name and title will be connected by an `appos` edge. We can use this pattern to find the job title via a three-step process:\n",
  "\n",
  "1. Look for an `appos` edge coming out of any of the parse tree nodes for the name.\n",
  "2. The node at the other end of this edge should be the head node of the job title.\n",
@@ -1213,7 +1213,9 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "This `tokens` DataFrame contains one row for every *token* in the document. The term \"token\" here refers to a part of the document that is a word, an abbreviation, or a piece of punctuation. The columns \"id\", \"dep\" and \"head\" encode the edges of the parse tree. Since we're going to be analyzing the parse tree, it's more convenient to have the nodes and edges in separate DataFrames. So let's split `tokens` into DataFrames of nodes and edges:"
+ "This `tokens` DataFrame contains one row for every *token* in the document. The term \"token\" here refers to a part of the document that is a word, an abbreviation, or a piece of punctuation. The columns \"id\", \"dep\" and \"head\" encode the edges of the parse tree.\n",
+ "\n",
+ "Since we're going to be analyzing the parse tree, it's more convenient to have the nodes and edges in separate DataFrames. So let's split `tokens` into DataFrames of nodes and edges:"
  ]
  },
  {
@@ -1687,7 +1689,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "Each element of the `span` column of `appos_targets` holds the head node of a person's title. To find the remaining nodes of the titles, we'll do the transitive closure operation we described earlier. We use a Pandas DataFrame to store our set of selected nodes. We use the `traverse_edges_once` function to perform each step of walking the tree. We use `Pandas.concat()` to add the new nodes to our selected set of nodes. Then we use `DataFrame.drop_duplicates()` to remove duplicates from the set. The entire algorithm looks like this:"
+ "Each element of the \"span\" column of `appos_targets` holds the head node of a person's title. To find the remaining nodes of the titles, we'll do the transitive closure operation we described earlier. We use a Pandas DataFrame to store our set of selected nodes. We use the `traverse_edges_once` function to perform each step of walking the tree. Then we use `Pandas.concat()` and `DataFrame.drop_duplicates()` to add the new nodes to our selected set of nodes. The entire algorithm looks like this:"
  ]
  },
  {
@@ -2182,7 +2184,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "If we combine this `find_titles_of_persons()` function with the `find_persons_quoted_by_name()` function we created in our previous post, we can build a data mining pipeline. This pipeline find the names and titles of executives in corporate press releases. Here's the output that we get if we pass a year's worth of IBM press releases through this pipeline:"
+ "If we combine this `find_titles_of_persons()` function with the `find_persons_quoted_by_name()` function we created in our previous post, we can build a data mining pipeline. This pipeline finds the names and titles of executives in corporate press releases. Here's the output that we get if we pass a year's worth of IBM press releases through this pipeline:"
  ]
  },
  {