My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
netflixpreprocess  

netflix, dataset, transfer
Updated May 13, 2011 by honglianglv

#the steps of pre-process the netflix dataset.

Introduction

the steps of pre-process the netflix dataset

Details

the steps of pre-process the netflix dataset is as follows:

  • make the userId sequential
(1) scan all the ratings to get a map between original userId and new sequential userId ,and scan all the rating files and transfer the rating to new sequential userId . (./dataset/netflix/getUserSeqId.php)->get new rating files in ./dataset/netflix/transfer_set/

cd ./dataset/netflix/

mkdir transfer_set

php getUserSeqId.php //(this is a script of php, you should have php installed, maybe I will rewrite this programme in c++ soon)
important notice
because there are 17770 files to deal, the php programme takes a lot of memorys, you'd better change the memory limit of php to more than 1G(it's usaully 16M in default). The steps to change the memory limit: a. Find the php.ini file: "php -i | grep 'php\.ini'", b. Change the memory limit: find the memory limit setting section and change it.

(2) transfer the probe set and training set. (./dataset/netflix/transferProbeUserId.cpp)

./dataset/netflix/probe.txt->./dataset/netflix/probe_t.txt
  • merge the rating files to one file (./dataset/netflix/mergeData.cpp)
  • ./dataset/netflix/transfer_set/(files) ->./dataset/netflix/data.txt
  • get the real rate of the probe set (./dataset/netflix/getProbeReal.cpp)
  • ./dataset/netflix/probe_t.txt->./dataset/netflix/probe_real.txt
  • get the training dataset without probe ratings(./dataset/netflix/getDataWithoutProbe.cpp)
  • ./dataset/netflix/transfer_set/(files) ->./dataset/netflix/data_without_prob.txt

Comment by xinshengwen, Apr 17, 2011

在最后一步的时候,数据转换出现了问题:堆栈溢出: Exception: STATUS_STACK_OVERFLOW at eip=00401EA2 eax=00904E9C ebx=6125E65E ecx=00032D3C edx=00401216 esi=6125E670 edi=61179FC7 ebp=0022CD48 esp=0022CD34 program=D:\cygwin\home\Hoodoo\recs\dataset\netflix\with_out_prob.exe, pid 1588, thread main cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023 Stack trace: Frame Function Args 0022CD48 00401EA2 (6125E670, 61179FC7, 0022CD88, 61006CD3) 0022CD88 61006CD3 (00000000, 0022CDC4, 61006570, 7FFD8000) End of stack trace

Comment by project member honglianglv, Apr 18, 2011

If you use the netflix dataset to test, you must run the programme on the machine which has more than 2G RAM. This error is lacking RAM.


Sign in to add a comment
Powered by Google Project Hosting