Introduction To Text Mining
Introduction To Text Mining
Introduction To Text Mining
Lotfi NAJDI
2 / 54
Introduction
Text is still some of the most valuable data out there for those who know how
to use it.
In this lab we will try to take on some of the most important tasks in working
with text.
3 / 54
Business problem
You're a consultant for DelFalco's Italian Restaurant, and The owner asked
you to identify whether there are any foods on their menu that diners find
disappointing.
The business owner suggested you use diner reviews from the Yelp
website to determine which dishes people liked and disliked.
4 / 54
Business problem
review_id text stars
i used to work food service and my manager at the time
recommended i try defalco's. he knows food well so i was
excited to try one of his favorites spots.
this place is really,
really good. lot of authentic italian choices and they even
have a grocery section with tons of legit italian goodies. i had
a chicken parmigiana sandwich that was to die for. anytime
109 my ex-manager comes back to town (he left for vegas and i 4
think he misses defalco's more than anything else in the
valley), he is sure to stop by and grab his favorite grub.
parking is a bit tricky during busy hours and the wait times
for food can get a bit long, so i recommend calling your
order ahead of time (unless you want to take a look around
while you wait, first-timers).
5 / 54
Business problem
The owner also gave you this list of menu items and common alternate
spellings.
menu_items
cheese steak
cheesesteak
steak and cheese
italian combo
tiramisu
cannoli
chicken salad
chicken spinach salad
meatball
pizza
6 / 54
Your turn 1
Before you get to analysis, run the code to load the data from the provided
DelFalco_reviews .
01:00
7 / 54
Your turn 1
DelFalco_reviews <- read_excel("data/yelp_reviews.xlsx")
glimpse(DelFalco_reviews)
# Rows: 1,321
# Columns: 4
# $ review_id <dbl> 109, 1013, 1204, 1251, 1354, 1504, 1739,~
# $ stars <dbl> 4, 4, 5, 1, 2, 5, 5, 4, 5, 3, 3, 5, 3, 4~
# $ date <dttm> 2013-01-27, 2015-04-15, 2011-03-20, 201~
# $ text <chr> "i used to work food service and my mana~
8 / 54
Any idea to solve the problem?
Given the data from Yelp and the list of menu items, do you have any ideas for
how you could find which menu items have disappointed diners?
We can tell which foods are mentioned in reviews with low scores, so the
restaurant can fix the recipe or remove those foods from the menu.
9 / 54
Pattern Matching
Items in one review
As a first step, we will write code to extract the foods mentioned in a single
review :
text
i felt like i was eating in a storage room. soup was good...sandwich nothing
special.bread was like pizza dough.....doughy in the middle and done on the
outside....service was very good....might go back for take out and try the pizza.
if you ever had a pat's philly cheesesteak in philly this place can't compare.
10 / 54
Pattern Matching
Items in one review
Using str_detect function from stringr package, we will check if single_review
contains for examlpe pizza 🍕.
11 / 54
Pattern Matching
Items in one review
Then we will check if single_review contain more than one item using the
same syntax.
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
12 / 54
Pattern Matching
Items in one review
In order to identify the items mentionned in each review , we could str_extract
# [1] NA "cheesesteak" NA NA NA
# [6] NA NA NA NA "pizza"
13 / 54
Your turn 2
Create a data set single_review containing just the review at position 10
Then try str_detect with the menu_items vector . Could you explain the
result ?
03:00
14 / 54
Your turn 2
str_detect
single_review <- DelFalco_reviews %>%
slice(10) %>%
select(text)
single_review %>% str_detect("pizza")
str_extract
single_review %>%
str_extract( menu_items)
15 / 54
Pattern Matching
Scale pattern matching to items mentioned in all
reviews
Now let's consider the whole dataset and collect ratings for each menu item.
Each review has a rating ( stars) .
16 / 54
Pattern Matching
Scale pattern matching to items mentioned in all
reviews
The map functions transform their input by applying a function to each
element and returning a vector the same length as the input.
18 / 54
Your turn 3
Complete the following chunk in order to extract the list matching itmes
for each review.
19 / 54
Your turn 3
Item column is represents a list-column
Use the unnest function from tidyr package to make each element of
the list on its own row.
Drop rows with NA then select just item and stars columns.
20 / 54
Analysis
Now we are ready for further analysis
21 / 54
Analysis
The 10 best rated items
mean_ratings %>%
arrange(-average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 artichoke salad 5 5
# 2 corned beef 5 2
# 3 fettuccini alfredo 5 6
# 4 turkey breast 5 1
# 5 steak and cheese 4.89 9
# 6 reuben 4.75 4
# 7 prosciutto 4.68 50
# 8 purista 4.67 63
# 9 chicken salad 4.6 5
# 10 chicken pesto 4.56 27
22 / 54
Analysis
The 10 worst rated items
mean_ratings %>%
arrange(average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 chicken cutlet 3.55 11
# 2 spaghetti 3.89 36
# 3 italian beef 3.92 25
# 4 macaroni 4 5
# 5 tuna salad 4 5
# 6 turkey sandwich 4 6
# 7 italian combo 4.05 22
# 8 garlic bread 4.13 39
# 9 roast beef 4.14 7
# 10 eggplant 4.16 69
23 / 54
Your turn 4
Calculate the mean ratings and the number of reviews for each menu
item.
24 / 54
Your turn 4
mean_ratings %>%
arrange(average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 chicken cutlet 3.55 11
# 2 spaghetti 3.89 36
# 3 italian beef 3.92 25
# 4 macaroni 4 5
# 5 tuna salad 4 5
# 6 turkey sandwich 4 6
# 7 italian combo 4.05 22
# 8 garlic bread 4.13 39
# 9 roast beef 4.14 7
# 10 eggplant 4.16 69
25 / 54
Your turn 4
mean_ratings %>%
arrange(-average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 artichoke salad 5 5
# 2 corned beef 5 2
# 3 fettuccini alfredo 5 6
# 4 turkey breast 5 1
# 5 steak and cheese 4.89 9
# 6 reuben 4.75 4
# 7 prosciutto 4.68 50
# 8 purista 4.67 63
# 9 chicken salad 4.6 5
# 10 chicken pesto 4.56 27
26 / 54
Matching patterns with regular
expressions
Regular expressions are a concise and flexible tool for describing patterns
in strings.
They take a little while to get your head around, but once you understand
them, you’ll find them extremely
useful.
27 / 54
Matching patterns with regular
expressions
Basic matches
The simplest patterns match exact strings:
apple
banana
pear
pineapple
28 / 54
Matching patterns with regular
expressions
The next step up in complexity is ., which matches any character (except a
newline):
apple
banana
pear
pineapple
29 / 54
Matching patterns with regular
expressions
Anchors
By default, regular expressions will match any part of a string. It’s often useful
to anchor the regular expression so that it matches from the start or end of the
string. You can use:
apple
banana
pear
pineapple
30 / 54
Matching patterns with regular
expressions
Anchors
apple
banana
pear
pineapple
31 / 54
Matching patterns with regular
expressions
Character classes and alternatives
There are a number of special patterns that match more than one character.
[abc] : matches a, b, or c.
32 / 54
Matching patterns with regular
expressions
Character classes and alternatives
str_view_all(fruit, "[pe]")
ap ple
banana
p ear
pineap ple
33 / 54
Matching patterns with regular
expressions
Character classes and alternatives
apple
219 733 8965
329-293-8753
Work: 579-499-7527 ; Home: 543.355.3679
34 / 54
Stringr
Stringr package for Pattern matching functions and regular expressions .
stringr 1.4.0.9000
Introduction to stringr
Source: vignettes/stringr.Rmd
35 / 54
Tokenization
A token is a meaningful unit of text, such as a word, that we are interested
in using for analysis, and tokenization is the process of splitting text into
tokens. Text Mining with R: A Tidy Approach
In order to illustrate this method we will consider a case study about TED
talks dataset created by Katherine M. Kinnaird and John Laudun for their
paper “TED Talks as Data”.
36 / 54
Tokenization
speaker text
Al Gore Thank you so much, Chris. And it's truly a great honor to have the
opportunity to come to this stage twice; I'm extremely grateful. I
have been blown away by this conference, and I want to thank all
of you for the many nice comments about what I had to say the
other night. And I say that sincerely, partly because (Mock sob) I
need that. (Laughter) Put yourselves in my position. (Laughter) I
flew on Air Force Two for eight years. (Laughter) Now I have to take
off my shoes or boots to get on an airplane! (Laughter) (Applause)
I'll tell you one quick story to illustrate what that's been like for me.
(Laughter) It's a true story — every bit of this is true. Soon after
Tipper and I left the — (Mock sob) White House — (Laughter) we
were driving from our home in Nashville to a little farm we have 50
miles east of Nashville. Driving ourselves. (Laughter) I know it
sounds like a little thing to you, but — (Laughter) I looked in the
rear-view mirror and all of a sudden it just hit me. There was no
motorcade back there. (Laughter) You've heard of phantom limb
pain? (Laughter) This was a rented Ford Taurus. (Laughter) It was
dinnertime, and we started looking for a place to eat. We were on
I-40. We got to Exit 238, Lebanon, Tennessee. We got off the exit,
we found a Shoney's restaurant. Low-cost family restaurant chain, 37 / 54
Your turn 5
1. Strat by loading the two packages below (first tidyverse, and then tidytext)
by replacing the _ with the package names.
2. Complete the chunk to read TED transcripts from the file ted_talks.rds .
02:00
38 / 54
Your turn 5
library(tidyverse)
library(tidytext)
ted_talks <- read_rds("data/ted_talks.rds")
talk_id: the identifier from the TED website for this particular talk
text: the text of this TED talk
speaker: the main or first listed speaker (some TED talks have more than
one speaker)
39 / 54
Tokenization
Let's start with one talk .
to break the text into individual tokens and transform it to a tidy data
structure..
the input dataframe that contains your text (often you will use the pipe
%>% to send this argument to unnest_tokens())
41 / 54
Your turn 6
Try The unnest_tokens() function in order to tokenize the text column. You
might start with the first row before scalling to the the whole dataset
--
02:00
42 / 54
Most common TED talk words
Now that our data in a tidy format, a whole world of analysis opportunity
has opened up for us.
We can start by computing term frequencies in just one line. What are the
most common words in these TED talks?
word n
the 95
to 75
and 71
of 62
a 59
i 50
in 40
43 / 54
Stop words
Words like "the", "and", and "to" that aren't very interesting for a text analysis are
called stop words. Often the best choice is to remove them. The tidytext
package provides access to stop word lexicons, with a default list and then
other options and other languages.
get_stopwords()
# # A tibble: 175 x 2
# word lexicon
# <chr> <chr>
# 1 i snowball
# 2 me snowball
# 3 my snowball
# 4 myself snowball
# 5 we snowball
# 6 our snowball
# 7 ours snowball
# 8 ourselves snowball
# 9 you snowball
# 10 your snowball
# # ... with 165 more rows
44 / 54
Stop words
When text data is in a tidy format, stop words can be removed using an
anti_join().
This type of join will "filter" or remove items that are in the right-hand side,
keeping those in the left-hand side.
These are now more interesting words and are starting to show the focus
of TED talks.
word n
laughter 22
going 10
can 9
carbon 9
much 9
45 / 54
Visualize Most common TED talk
words
We can fluently pipe from the code we just wrote straight to ggplot2
functions.
46 / 54
Your turn 7
Use count() to find the most common words.
tidy_talks %>%
count(word, sort = TRUE)
# # A tibble: 730 x 2
# word n
# <chr> <int>
# 1 the 95
# 2 to 75
# 3 and 71
# 4 of 62
# 5 a 59
# 6 i 50
# 7 in 40
# 8 you 39
# 9 it 37
# 10 that 33
# # ... with 720 more rows
01:00
47 / 54
Your turn 8
To exclude stop words, you might use anti_join() within a e call to
get_stopwords() function from Tidytext package.
tidy_talks %>%
_____join(_________) %>%
count(word, sort = TRUE)
01:00
48 / 54
Your turn 9
Complete the folowing chunk in orderr to :
Visualize top 15 words by puting n on the x-axis and word on the y-axis.
49 / 54
Your turn 9
tidy_talks %>%
filter( word != "laughter" ) %>%
# remove stop words
anti_join(get_stopwords()) %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 12) %>%
mutate(word = reorder(word, n)) %>%
# put `n` on the x-axis and `word` on the y-axis
ggplot(aes(n, word)) +
geom_col() +
theme_xaringan(background_color = "#FFFFFF")
50 / 54
Compare TED talk vocabularies
51 / 54
52 / 54
Text Mining with R
On
View
This is the website for Text Mining with R! Visit
the GitHub repository for this site, find the Edit
53 / 54
Text Mining with R
1. Introduction
54 / 54