|
Join large files Synopsisjava -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin [options] {files(gz)|url(gz)}UsageThis tool is used to merge some large files using one or more column as the common key for each file. It temporarily stores and sorts its data using BerkeleyDB. There can be two or more files. Requirements
Downloaddownload bigjoin.jar at https://code.google.com/p/code915/downloads/list Source codeOptions
Examplesthe following example joins the ensGene.txt.gz refGene.txt.gz using the chromosome/start/end as the common key. If one file is mismatching, the missing values will be replaced by some "NULL". The output will be ordered by chrom/start/end java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 -L -s \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz |\ grep -v NULL | head -n 20 | tr " " ";" (...) 594;ENST00000323275;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;ENSG00000127054;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,;594;NM_017871;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;CPSF3L;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0, 594;ENST00000343938;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000215792;cmpl;cmpl;-1,0,2,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2, 594;ENST00000360706;chr1;+;1250005;1254139;1253433;1254099;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000187488;cmpl;cmpl;-1,-1,0,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2, 594;ENST00000338338;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300396,;1299145,1299688,1300033,1300443,;0;ENSG00000175756;cmpl;cmpl;0,1,0,-1,;594;NM_001127229;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300239,;1299145,1299688,1300033,1300443,;0;AURKAIP1;cmpl;cmpl;0,1,0,-1, same, count the lines, use default ordering, join the missing values time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 -L \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc 93252 2984064 36481577 real 0m10.229s user 0m12.697s sys 0m0.548s same, count the lines, use default ordering, do not join the missing values time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc 11519 368608 7775937 real 0m8.966s user 0m11.925s sys 0m0.456s same, count the lines, use default ordering, do not join the missing values, zip the data (slower but uses less space) time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 -z \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc 11519 368608 7775937 real 0m16.631s user 0m19.181s sys 0m0.500s same, count the lines, use default ordering, join the missing values, assume columns are unique time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 -L -u \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc 76958 2462656 27531602 real 0m8.466s user 0m11.233s sys 0m0.348s same, count the lines, use default ordering, join the missing values, smart sorting time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \ -c 3,5,6 -L -s \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc 93252 2984064 36481577 real 0m10.018s user 0m12.353s sys 0m0.496s TODOuse different column indexes for each file. |