My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
BigJoin  

data, java
Updated May 12, 2010 by plindenb...@gmail.com

Join large files

Synopsis

java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin [options] {files(gz)|url(gz)}

Usage

This tool is used to merge some large files using one or more column as the common key for each file. It temporarily stores and sorts its data using BerkeleyDB. There can be two or more files.

Requirements

Download

download bigjoin.jar at https://code.google.com/p/code915/downloads/list

Source code

https://code.google.com/p/code915/source/browse/trunk/tools/src/java/fr/inserm/umr915/tools/BigJoin.java

Options

  • -h help; This screen.
  • -i case insensible
  • -u expect uniq keys per file (faster)
  • -d [regex-delim] (default:tab)
  • -c separated by commas (first column is '1')
  • -g ignore empty trim(key)
  • -bdb bdb directory default:${HOME}
  • -t delimiter default:tab
  • -s smart sorting (slower) default:true
  • -z sip values (slower, less spaces) default:false
  • -L do a 'left join' if data is missing
  • -null string value if data is missing default:NULL
  • -p print key(s) as the very first columns. default:false
  • --log-level level optional. one of :java.util.logging.Level

Examples

the following example joins the ensGene.txt.gz refGene.txt.gz using the chromosome/start/end as the common key. If one file is mismatching, the missing values will be replaced by some "NULL". The output will be ordered by chrom/start/end

java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
  -c 3,5,6 -L -s \
  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz |\
   grep -v NULL | head -n 20 | tr "    " ";"
(...)
594;ENST00000323275;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;ENSG00000127054;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,;594;NM_017871;chr1;-;1236827;1249909;1237101;1249851;17;1236827,1237260,1237468,1237682,1237835,1238029,1238277,1238751,1238974,1239543,1240066,1240646,1240762,1244538,1245698,1246238,1249823,;1237167,1237390,1237611,1237744,1237943,1238192,1238367,1238835,1239164,1239608,1240205,1240681,1240861,1244767,1245772,1246336,1249909,;0;CPSF3L;cmpl;cmpl;0,2,0,1,1,0,0,0,2,0,2,0,0,2,0,1,0,
594;ENST00000343938;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000215792;cmpl;cmpl;-1,0,2,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000360706;chr1;+;1250005;1254139;1253433;1254099;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;ENSG00000187488;cmpl;cmpl;-1,-1,0,;594;NM_001029885;chr1;+;1250005;1254139;1252153;1253006;3;1250005,1252078,1252483,;1250345,1252275,1254139,;0;GLTPD1;cmpl;cmpl;-1,0,2,
594;ENST00000338338;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300396,;1299145,1299688,1300033,1300443,;0;ENSG00000175756;cmpl;cmpl;0,1,0,-1,;594;NM_001127229;chr1;-;1298972;1300443;1299043;1299999;4;1298972,1299242,1299947,1300239,;1299145,1299688,1300033,1300443,;0;AURKAIP1;cmpl;cmpl;0,1,0,-1,

same, count the lines, use default ordering, join the missing values

time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
   -c 3,5,6  -L \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc
  
   93252 2984064 36481577

real	0m10.229s
user	0m12.697s
sys	0m0.548s

same, count the lines, use default ordering, do not join the missing values

time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
   -c 3,5,6  \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc
  
  11519  368608 7775937

real	0m8.966s
user	0m11.925s
sys	0m0.456s

same, count the lines, use default ordering, do not join the missing values, zip the data (slower but uses less space)

time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
   -c 3,5,6 -z \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc
  
  11519  368608 7775937

real	0m16.631s
user	0m19.181s
sys	0m0.500s

same, count the lines, use default ordering, join the missing values, assume columns are unique

time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
   -c 3,5,6 -L -u \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc
  
  76958 2462656 27531602

real	0m8.466s
user	0m11.233s
sys	0m0.348s

same, count the lines, use default ordering, join the missing values, smart sorting

time java -cp bigjoin.jar:je-4.0.92.jar fr.inserm.umr915.tools.BigJoin \
   -c 3,5,6 -L -s \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGene.txt.gz \
   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz | wc
  
  93252 2984064 36481577

real	0m10.018s
user	0m12.353s
sys	0m0.496s

TODO

use different column indexes for each file.


Sign in to add a comment
Powered by Google Project Hosting