My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
netflix  
Updated Feb 29, 2012 by su.s...@gmail.com
Netflix.py is a program that predicts customers' ratings on movies and calculates the root mean square error between the actual ratings placed by netflix customers and the predicted ratings. The program is based on a contest held by Netflix to reduce the rsme value that their software was producing. Data that provided information about each customers' ratings for each movie was given. Additionally, information about the dates of the ratings made, as well as the production date of each movie and the title of each movie was provided.

The program that I have created utilizes a hash table to store key,value pairs that identify the correlation between the numerous data. In this case, a hash table of hash tables was created using python's dictionary. This hash table stored for each movie a customer,rating pair for each customer that rated the movie. In this way, data was easily accessed in subsequent processes. Because the data such as movie id and customer id were not presented in a sequential manner, and because there were gaps in the data due to Netflix's attempt to raise annonymity, the hash table was determined to be the most efficient data structure.

A large part of creating this hash table was the calculation of the predicted rating for each movie,customer pair. The prediction was made using three caches: a movie average cache, which lists the average rating for each movie, a customer average cache, which lists the average of all the ratings that each customer has made, and the cdr cache, which lists each customers' ratings on movies produced in each dacade from 1890 to 1950. These three caches represent a significant portion of what can be deduced from a customer-movie relationship. The prediction calculation takes in different weights of predicted values from each of the cache files and sums them in order to produce the optimal result.
The hash tables are created from two functions, make_dict and make_cache_dict. Make dict assumes that the input will contain the wholistic information of movie, customer and rating. Therefore, It reads through files in the csv format and collects data first for the customer,rating pair and creating a hash table and then assigning a movie key to each during each iteration of the steps through movie id's. The make_cache_dict assumes that the cache files are more specialized such as the movie,rating pair in movie_avg file or the customer,rating pair in customer_avg file. Therefore, the function is designed to produce a one dimensional hash table.
The rsme value is the root square mean error calculated by summing the squares of each actual,predicted rating pair's differences, which is then divided by the total number of pairs before being squared. The get_rsme function takes in a hash table of movies paired with customer,predicted rating pairs and a hash table of movies paired with customer,actual rating pairs. The calculation of rsme is iterated through the tables and the rounded value is returned.
The program is begun with the solve_netflix method. This method begins by creating a hash table of actual ratings, using anoter cache file, actual_ratings, and a hash table of the predicted ratings. Then it calls the rsme method on the two hash tables and retrieves both the rsme value and the number of elements. The writer file is used to write output lines as the program is iterated over the hash table of predicted ratings, printing movie ids followed by corresponding prediction ratings. The total number of elements that was retrieved is utilized to end the ouput file with a line that indicates the total number of records.
Powered by Google Project Hosting