My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
NetflixAlgorithm  
About our algorithm implementation
Updated Oct 1, 2011 by tyler.s.rhodes

Introduction

The Netflix Problem is described here. Essentially, we are trying to guess how well a user will like a movie, based on their previous ratings and on what others have rated the movie. We are given a large dataset containing information about the ratings of movies and about the customers that made those ratings. We are trying to achieve a root mean squared error (RMSE, for short) of less than 1.0 with a running time of less than 10 minutes.

Caches

Because the dataset is so large, caches are needed to drastically bring down the running time. Therefore, we are using two cache files: one for the average rating per movie, and one for the average rating per customer. This cuts the running time down because every time we want to make a guess, we can simply look up the answer in our cache instead of finding the answer in the files.

For the actual implementation of the caches, we use a dictionary for the customer rating cache and a list for the movie rating cache. We had to use a dictionary for the customer rating cache because the customer ids are sparsely populated. As for the movie rating cache, a list was sufficient because the movies are numbered 1,...,n, where n is the number of movies.

Calculating RMSE

A RMSE can be calculated by first getting all of the differences between two sets, squaring all of those differences, finding the mean of them, and finally taking the square root of the mean. More info on RMSEs here

Guessing

We simply use a linear combination of the two values that we retrieved from our cache (average rating by the user and average rating of the movie). By manipulating the ratios of the combination, we were able to achieve an RMSE < 1.0.

We also came to the realization that usually people who rate things in the extremities (1 or 5) will do so often. The same goes for movie ratings. Therefore, if the average movie/user rating is less than 2.0 or greater than 4.0, we put more emphasis on the extreme values. Otherwise, we set the weights of the ratings to be more or less equal. Below are our current weights, in table form.

customer rating weight movie rating weight
avg movie rating < 2.5 or > 4.0 .25 .75
avg customer rating < 2.5 or > 4.0 .75 .25
else .6 .4

Plug these values into the code below to get our formula for guessing.

guess = customer_rating_weight * avg_customer_rating + movie_rating_weight * avg_movie_rating
Powered by Google Project Hosting