Prepare The Dataset

Yelp Dataset

A trove of reviews, businesses, users, tips, and check-in data!

Data description

  • 5,200,000 user reviews
  • Information on 174,000 businesses
  • The data spans 11 metropolitan areas

We initially start a spark version of project using the entire dataset on Microsoft Azure. However, due to the limited time and budget, we pivot to a local version with 50,000 reviews(approximately 100,000 sentences).

Sentiment Analysis

We use nltk.sentiment.vader module to label our data.

  • When the polarized score 'compound' is greater than zero, we label the observation 'positive'.
  • Otherwise, we label it 'negative'.

Result

We have our data ready in the database with five tables in total.

  • counts
  • restaurant_scores
  • review
  • scores
  • user_counts