A Genral View

Represented by:
Huayi Zhang
Jiaming Di
Weiqing Li
Yingnan Han

We fulfilled a restaurant recommendation system based on users' preference of different aspect of a restaurant with Yelp dataset.

Firstly, we used Hierarchical Latent Tree Analysis(HLTA), it can cluster the words into several topics. Then, we labled the review sentence-wisely. If the word in one sentence is in a certain topic A, we add 1 in the count table to this review in A. This sentence's sentiment score is added in the score table to this review in A.

Secondly, we generated two tables which contain every restaurant's mean scores and every user's total count in these topics.

Thus, given a user_id, we could calculate every restaurant's score for him. The score is the linear combination of the socres in all topics. The weight is this user's log-scaled count. We gave out top k restaurants as recommendations.

Next

Prepare the Dataset

Orignially get from Yelp Dataset Challenge

The original dataset is 6.53 GB in total. We mainly used the review.json(4.2G) in it. When implemented in local computer, we use the first 50,000 reviews. After generated all the tables, the database is 137.8 MB.

Next

Cluster all the words

Use HLTA to get words cluster

We use 50,000 reviews to run the HLTA and generated the word cluster model. The words are allocated into 18 clusters.

Next

Build a recommendation system

Build a recommendation system based on users' preference in 18 topics.

The system is like a content-based system, however we took user's profile and restaurant profile based on 18 topics we got.

Next

Evaluate the model

Use NDCG to compare the ranking outcome between our system and collaborative filtering.

Normally, our system reach an NDCG score of 0.95.

Next

Accumsan sed tempus adipiscing blandit

Iaculis ac volutpat vis non enim gravida nisi faucibus posuere arcu consequat

Programming Laguage

Used python for the local version. Used nltk, pandas, numpy packages for calculation.

Data storage

In local, use Sqlite3 for data storage. Also, use azure's databricks as big data computation.

Version control

Use github for version control and code share. Use google drive for document share. Plese check codes here.

Thank you

This is a course project for CS595. Thank you for reading.
And special thanks to Professor Kyumin Lee.