Skip to content

Commit

Permalink
Merged CODAIT master into branch
Browse files Browse the repository at this point in the history
  • Loading branch information
PokkeFe committed Jul 2, 2021
2 parents 2bec5ef + e414fcd commit f8f175d
Show file tree
Hide file tree
Showing 9 changed files with 2,965 additions and 1,885 deletions.
4 changes: 2 additions & 2 deletions notebooks/Analyze_Text.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"source": [
"# Introduction\n",
"\n",
"This notebook shows how the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) lets you use use [Pandas](https://pandas.pydata.org/) DataFrames and the [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) service to analyze natural language text. \n",
"This notebook shows how the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) lets you use [Pandas](https://pandas.pydata.org/) DataFrames and the [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) service to analyze natural language text. \n",
"\n",
"We start out with an excerpt from the [plot synopsis from the Wikipedia page\n",
"for *Monty Python and the Holy Grail*](https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail#Plot). \n",
Expand Down Expand Up @@ -1220,7 +1220,7 @@
"source": [
"That's it. With the DataFrame version of this data, we can perform our example task with **one line of code**.\n",
"\n",
"Specifically, we use a Pandas selection condition to filter out the tokens that aren't pronouns, and then then we \n",
"Specifically, we use a Pandas selection condition to filter out the tokens that aren't pronouns, and then we \n",
"project down to the columns containing sentence and token spans. The result is another DataFrame that \n",
"we can display directly in our Jupyter notebook."
]
Expand Down
15 changes: 7 additions & 8 deletions text_extensions_for_pandas/resources/span_array.js
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
// Increment the version to invalidate the cached script
const VERSION = 0.76
const VERSION = 0.75
const global_stylesheet = document.head.querySelector("style.span-array-css")
const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")

if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || window.SpanArray.VERSION < VERSION) {

// Replace global SpanArray CSS with latest copy
const global_stylesheet = document.head.querySelector("style.span-array-css")
const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")
if(local_stylesheet != undefined) {
if(global_stylesheet != undefined) {
document.head.removeChild(global_stylesheet)
Expand Down Expand Up @@ -385,17 +385,18 @@ if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || win
if(closest_tr == undefined) return

const matching_span = doc_object.lookup_table[closest_tr.getAttribute("data-id")]
if(matching_span == undefined) return

switch(closest_control_button.getAttribute("data-control")) {
case "visibility":
{
if(matching_span != undefined) matching_span.visible = !matching_span.visible
matching_span.visible = !matching_span.visible
source_spanarray.render()
}
break;
case "highlight":
{
if(matching_span != undefined) matching_span.highlighted = !matching_span.highlighted
matching_span.highlighted = !matching_span.highlighted
source_spanarray.render()
}
break;
Expand Down Expand Up @@ -437,9 +438,7 @@ if(window.SpanArray == undefined || window.SpanArray.VERSION == undefined || win
} else {
// SpanArray JS is already defined and not an outdated copy
// Replace global SpanArray CSS with latest copy IFF global stylesheet is undefined

const global_stylesheet = document.head.querySelector("style.span-array-css")
const local_stylesheet = document.currentScript.parentElement.querySelector("style.span-array-css")

if(local_stylesheet != undefined) {
if(global_stylesheet == undefined) {
document.head.appendChild(local_stylesheet)
Expand Down
69 changes: 69 additions & 0 deletions tutorials/corpus/CoNLL_2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@
" \"from the directory containing this notebook, or use a Python \"\n",
" \"environment on which you have used `pip` to install the package.\")\n",
"\n",
"PROJECT_ROOT = \"../..\" \n",
" \n",
"# Code shared among notebooks is kept in util.py, in this directory.\n",
"import util"
]
Expand Down Expand Up @@ -3523,6 +3525,73 @@
"difficult_precision.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fold</th>\n",
" <th>doc_offset</th>\n",
" <th>span</th>\n",
" <th>ent_type</th>\n",
" <th>gold</th>\n",
" <th>num_teams</th>\n",
" <th>context</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>test</td>\n",
" <td>155</td>\n",
" <td>[776, 783): 'Antwerp'</td>\n",
" <td>LOC</td>\n",
" <td>False</td>\n",
" <td>16</td>\n",
" <td>...kish man smuggled heroin from Turkey to [Antwerp] from where it was taken to Spain, Franc...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fold doc_offset span ent_type gold num_teams \\\n",
"16 test 155 [776, 783): 'Antwerp' LOC False 16 \n",
"\n",
" context \n",
"16 ...kish man smuggled heroin from Turkey to [Antwerp] from where it was taken to Spain, Franc... "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"difficult_precision.loc[[16]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
2 changes: 1 addition & 1 deletion tutorials/market/Market_Intelligence_Part1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1309,7 +1309,7 @@
" entity_mentions_df = dataframes[\"entity_mentions\"]\n",
" semantic_roles_df = dataframes[\"semantic_roles\"]\n",
" \n",
" # Extract mentions of person names and company names\n",
" # Extract mentions of person names\n",
" person_mentions_df = entity_mentions_df[entity_mentions_df[\"type\"] == \"Person\"]\n",
" \n",
" # Extract instances of subjects that made statements\n",
Expand Down
14 changes: 8 additions & 6 deletions tutorials/market/Market_Intelligence_Part2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This graph is always a tree, so we refer to it as the *dependency-based parse tree* of the sentence. We often shorten the phrase \"dependency-based parse tree\" to **dependency parse** or **parse tree**.\n",
"This graph is always a tree, so we call it the *dependency-based parse tree* of the sentence. We often shorten the phrase \"dependency-based parse tree\" to **dependency parse** or **parse tree**.\n",
"\n",
"Every word in the sentence (including the period at the end) becomes a node of the parse tree:\n",
"![Parse tree for the sentence \"I like natural language processing\". Each word of the sentence becomes a node of the tree.](images/parse_tree_nodes.png)\n",
Expand All @@ -263,7 +263,7 @@
"\n",
"Each edge is tagged with information about why the words are related. For example, the first two words in the sentence, \"I\" and \"like\", have an `nsubj` relationship. The pronoun \"I\" is the subject for the verb \"like\".\n",
"\n",
"Dependency parsing is useful because it lets you solve problems with very little code. The parser acts as a sort of universal machine learning model. The output of the parser is much easier to filter and manipulate with code, compared with the original text."
"Dependency parsing is useful because it lets you solve problems with very little code. The parser acts as a universal machine learning model, extracting many facts at once from the text. Pattern matching over the parse tree lets you filter this set of facts down to the ones that are relevant to your application."
]
},
{
Expand Down Expand Up @@ -923,7 +923,7 @@
"\n",
"The edge types in this parse tree come from the [Universal Dependencies](https://universaldependencies.org/) framework. The edge between the name and job title has the type `appos`. `appos` is short for \"[appositional modifier](https://universaldependencies.org/docs/en/dep/appos.html)\", or [appositive](https://owl.purdue.edu/owl/general_writing/grammar/appositives.html). An appositive is a noun that describes another noun. In this case, the noun phrase \"general manager, Data and AI, IBM\" describes the noun phrase \"Daniel Hernandez\".\n",
"\n",
"This pattern happens whenever a person's job title is an appositive for that person's name. The title will be below the name in the tree, and the head nodes of the name and title will be connected by an `appos` edge. We can use this pattern to find the job title via a three-step process:\n",
"The pattern in the picture above happens whenever a person's job title is an appositive for that person's name. The title will be below the name in the tree, and the head nodes of the name and title will be connected by an `appos` edge. We can use this pattern to find the job title via a three-step process:\n",
"\n",
"1. Look for an `appos` edge coming out of any of the parse tree nodes for the name.\n",
"2. The node at the other end of this edge should be the head node of the job title.\n",
Expand Down Expand Up @@ -1213,7 +1213,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This `tokens` DataFrame contains one row for every *token* in the document. The term \"token\" here refers to a part of the document that is a word, an abbreviation, or a piece of punctuation. The columns \"id\", \"dep\" and \"head\" encode the edges of the parse tree. Since we're going to be analyzing the parse tree, it's more convenient to have the nodes and edges in separate DataFrames. So let's split `tokens` into DataFrames of nodes and edges:"
"This `tokens` DataFrame contains one row for every *token* in the document. The term \"token\" here refers to a part of the document that is a word, an abbreviation, or a piece of punctuation. The columns \"id\", \"dep\" and \"head\" encode the edges of the parse tree.\n",
"\n",
"Since we're going to be analyzing the parse tree, it's more convenient to have the nodes and edges in separate DataFrames. So let's split `tokens` into DataFrames of nodes and edges:"
]
},
{
Expand Down Expand Up @@ -1687,7 +1689,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Each element of the `span` column of `appos_targets` holds the head node of a person's title. To find the remaining nodes of the titles, we'll do the transitive closure operation we described earlier. We use a Pandas DataFrame to store our set of selected nodes. We use the `traverse_edges_once` function to perform each step of walking the tree. We use `Pandas.concat()` to add the new nodes to our selected set of nodes. Then we use `DataFrame.drop_duplicates()` to remove duplicates from the set. The entire algorithm looks like this:"
"Each element of the \"span\" column of `appos_targets` holds the head node of a person's title. To find the remaining nodes of the titles, we'll do the transitive closure operation we described earlier. We use a Pandas DataFrame to store our set of selected nodes. We use the `traverse_edges_once` function to perform each step of walking the tree. Then we use `Pandas.concat()` and `DataFrame.drop_duplicates()` to add the new nodes to our selected set of nodes. The entire algorithm looks like this:"
]
},
{
Expand Down Expand Up @@ -2182,7 +2184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If we combine this `find_titles_of_persons()` function with the `find_persons_quoted_by_name()` function we created in our previous post, we can build a data mining pipeline. This pipeline find the names and titles of executives in corporate press releases. Here's the output that we get if we pass a year's worth of IBM press releases through this pipeline:"
"If we combine this `find_titles_of_persons()` function with the `find_persons_quoted_by_name()` function we created in our previous post, we can build a data mining pipeline. This pipeline finds the names and titles of executives in corporate press releases. Here's the output that we get if we pass a year's worth of IBM press releases through this pipeline:"
]
},
{
Expand Down
Loading

0 comments on commit f8f175d

Please sign in to comment.