Export to GitHub

word2vec - issue #4

Segfault in script demo-phrase-accuracy.sh


Posted on Aug 19, 2013 by Happy Horse

$ ./demo-phrase-accuracy.sh make: Nothing to be done for `all'. Starting training using file text8 Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2586139 Words in train file: 17005206 Words written: 17000K real 0m21.130s user 0m20.062s sys 0m1.054s Starting training using file text8-phrase Vocab size: 123636 Words in train file: 16337523 Alpha: 0.000119 Progress: 99.59% Words/thread/sec: 22.70k
real 1m38.617s user 12m0.795s sys 0m1.501s newspapers: ./demo-phrase-accuracy.sh: line 12: 36538 Segmentation fault: 11 ./compute-accuracy vectors-phrase.bin < questions-phrases.txt

I'm on OSX (latest non-beta), and had to switch out "#include <stdlib.h>" to get it to compile, but no other changes.

Comment #1

Posted on Aug 19, 2013 by Happy Horse

demo-word-accuracy.sh also crashes. The other demos run great.

Comment #2

Posted on Aug 22, 2013 by Happy Bear

Im on OSX Lion compiled with clang.

Using valgrind the issue appears to be on line 102 of compute-accuracy.c

vec[a] = M[a + b2 * size] - M[a + b1 * size] + m[a + b3 * size];

With 30k as the input on the command line for words the size of M is 24,000,000 bytes or 6M float array, but from putting in an if statement the program regularly accesses memory outside of this range.

Putting the if statement with a printf msg stops the seg fault.

I have: if (a + b3 * size > 6000000) printf("Memory overflow\n");

Putting this statement in there outputs a bunch of memory overflow messages but aside from that it seems as the though the program keeps trucking along and I get a final output of

ACCURACY TOP1: 18.77 % (122 / 650) Total accuracy 26.19% Semantic accuracy: 24.76% Syntactic accuracy: 26.91% Questions seen / total: 12268 19544 62.77%

This is obviously not a fix, something to do with buffers but I'm not a C expert by any means.

Comment #3

Posted on Aug 23, 2013 by Happy Bird

Thanks for reporting this bug, it should be fixed now.

Comment #4

Posted on Aug 23, 2013 by Happy Horse

Seems still broken. deleted all data files. Updated to latest. Re-applied the OSX fix (#include becomes stdlib.h) make clean make re-ran the script.

Starting training using file text8 Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2586139 Words in train file: 17005206 Words written: 17000K real 0m20.452s user 0m19.601s sys 0m0.816s Starting training using file text8-phrase Vocab size: 123636 Words in train file: 16337523 Alpha: 0.000119 Progress: 99.59% Words/thread/sec: 22.46k
real 1m37.069s user 12m8.130s sys 0m1.240s newspapers: ./demo-phrase-accuracy.sh: line 12: 1189 Segmentation fault: 11 ./compute-accuracy vectors-phrase.bin < questions-phrases.txt

Comment #5

Posted on Aug 23, 2013 by Happy Horse

No idea what I'm doing, but if it helps:

(gdb) run vectors-phrase.bin

Comment #6

Posted on Aug 23, 2013 by Happy Horse

Removing -Ofast from the make file seems to have helped. But wow is it slower, maybe a 90% speed reduction?

output:

newspapers: ACCURACY TOP1: 8.33 % (1 / 12) Total accuracy: 8.33 % Semantic accuracy: 8.33 % Syntactic accuracy: nan % ice_hockey: ACCURACY TOP1: 0.00 % (0 / 56) Total accuracy: 1.47 % Semantic accuracy: 1.47 % Syntactic accuracy: nan % basketball: ACCURACY TOP1: 0.00 % (0 / 30) Total accuracy: 1.02 % Semantic accuracy: 1.02 % Syntactic accuracy: nan % airlines: ACCURACY TOP1: 14.29 % (6 / 42) Total accuracy: 5.00 % Semantic accuracy: 5.00 % Syntactic accuracy: nan % people-companies: ACCURACY TOP1: 25.00 % (1 / 4) Total accuracy: 5.56 % Semantic accuracy: 5.56 % Syntactic accuracy: nan % Questions seen / total: 144 3218 4.47 %

Status: Fixed

Labels:
Type-Defect Priority-Medium