
Data description
- 5,200,000 user reviews
- Information on 174,000 businesses
- The data spans 11 metropolitan areas
We initially start a spark version of project using the entire dataset on Microsoft Azure. However, due to the limited time and budget, we pivot to a local version with 50,000 reviews(approximately 100,000 sentences).
Sentiment Analysis
We use nltk.sentiment.vader module to label our data.
- When the polarized score 'compound' is greater than zero, we label the observation 'positive'.
- Otherwise, we label it 'negative'.
Result
We have our data ready in the database with five tables in total.
- counts
- restaurant_scores
- review
- scores
- user_counts