Have you ever read a review of a product or business and wondered how truthful it was? What if there was a way to determine their accuracy, and make more informed decisions? One of our Data Science Flex grads created a capstone project to help you do just that.

An accurate Yelp review helps you decide on which boozy brunch to go to on your birthday or which happy hour is the best spot to host a visiting friend. However, with 6.5 million views on the site, data science student Clayton M. wanted a way to sort which ones were of the highest quality. Using data, he was able to gain insights into what factors show greater accuracy and less bias.

You can check out his full project here.

Below are the highlights to predicting useful Yelp reviews with data science:

Project Approach/Defining Value

With so many Yelp reviews out there, the data set was huge. Clayton reduced the computation space to 15% (or one million reviews) so that Dask and a modern PC could compute more feasibly. The standard libraries used were: NumPy, pandas, scikit-learn, and spaCy.

Determining the value of a Yelp review was next. For this project, Clayton defined value as: an entry containing any useful votes the count of words the readability and length of review embeddings.

When establishing functions to run data calculations, for clarity and workflow efficiency, Clayton came up with five:

Project Challenges

Challenges crept up in this distributed environment. Clayton found it difficult to work with Dask even though it’s meant for larger datasets. Even after shrinking the set to one million, computation times were long, and memory allocation was limited.

Using spaCy, a scatterplot allowed Clayton to deduce that there’s a relationship with readability of reviews and review length. The longer the review, the more easily readable it is.  

Clayton saw striking similarities between the reviews classified as useful or not useful compared to length of text from the starting line. The peaks and valleys, upon first impression, are consistent, meaning that useful reviews can be any length.

Clayton had to combine predictions to determine if most reviews are useful versus novel. The algorithms used were readability and length features and PCA compressed LSA vectors. The scatterplots show: 1) predicted classifications are occurring in more clustered fashion that actuals 2) the model tends to vote the review is 'novel' and, 3) correct classifications aren’t horrible, but there is still room for interpretation on how reviews are actually behaving.

Dig deeper into Clayton’s analysis here.

Clayton flexed his data science skills with a project that can be used in everyday life – that’s why it’s a standout student project. Data science helped Clayton learn the skills to determine business findings for a top social platform used by millions of people. His capstone project allowed him to discover relationships between the type of review left by a Yelp user and its actual usefulness. His analysis can help the platform optimize by showing more useful reviews first, and ultimately helping consumers find the best places to go for the ultimate wing Wednesday, Margarita Monday or last minute dinner date.

Artwork by Rachel Knobloch.


Learning new tech skills means that you’re equipping yourself with the tools to create useful and interesting digital products. Check out our Data Science Flex course to see how far it could take you.

Thinkful is building the world’s next workforce. Join us.

Share this article