In this project, we saw an exploratory data analysis to the Yelp data from the Yelp Dataset Challenge by using correlation analysis and visualizations. The research question is about finding correlations between restaurant overall rating and varioius restaurant attributes. My hypotheses was verified that some aspects of restaurant are more likely to determine the overall rating of restaurant than other attributes, but most attributes barely have correlation (i.e. correlation coefficient is between 0.3 and -0.3) with restaurant overall rating. This might be due to the large amount of null values in the dataset and unbalanced data. In genearl, the Yelp rating appears to be normal distributed, but we do need to be cautious when interpreting the restaurant rating because Yelp ratings may suffer from bias due to stereotype or customers' psychological suggestions.
You will need Python 3.X and Jupyter notebook installed to reproduce this project by running data512-final-project.ipynb. To install Python 3, see download and beginner's guide To install Jupyter Notebook, follow installation
Addtionally, you will need following packages:
This project will use datasets from Yelp Dataset Challenge. The datasets are available in both JSON and SQL files. (See the documentation from here). I will mainly use business dataset in JSON format for my project:
- Businesss.json
The business dataset contains business data including location data, attributes, and categories.
For categories,
// an array of strings of business categories
"categories": [
"Mexican",
"Burgers",
"Gastropubs"
],
which has a variety of tags. It’s impossible to get all restaurants in the data by simply searching “restaurant” for category. Thus, I will need to do some explorations to the category in business data and figure out the set of applicable categories to my research.
Yelp provides a guideline regarding the dataset license.
This project is licensed under the MIT License - see the LICENSE.md file for details