Database

Data description

5,200,000 user reviews
Information on 174,000 businesses
The data spans 11 metropolitan areas

We initially start a spark version of project using the entire dataset on Microsoft Azure. However, due to the limited time and budget, we pivot to a local version with 50,000 reviews(approximately 100,000 sentences).

Sentiment Analysis

We use nltk.sentiment.vader module to label our data.

When the polarized score 'compound' is greater than zero, we label the observation 'positive'.
Otherwise, we label it 'negative'.

Result

We have our data ready in the database with five tables in total.

counts
restaurant_scores
review
scores
user_counts

Recomen

Prepare The Dataset

Data description

Sentiment Analysis

Result

Data after preparation