dataset - Fast access to lines in huge text file -
i working relatively large text file (70gb uncompressed, 15gb gzipped) contains 3 columns. lines of file of form:
x1 | y1 | a1 x1 | y2 | a2 x2 | y3 | a3 x3 | y4 | a4 the x , y sequences of words can contain between 1 , 4 words. strings in first column sorted , not unique. strings in second column not unique, , sorted same element in first column.
there 700,000,000 lines in uncompressed text file, , want is, query of tuples (x, y), value in third column. need access in time short possible.
what tried create 2 dictionaries of (strings, list of integers), first dictionary maps string index of lines contain string in first column, , same second dictionary , second column. query (x, y) can intersect these 2 lists , line contains "x | y | a". can use dictionary maps line number offset in file , use random access file read line.
the problem requires way memory (maybe it's because i'm using java!). looking solution can query text file quickly, doesn't require more 20 / 30 gb of ram.
i guess there methods kind of things i'm not familiar them. ideas?
thanks
Comments
Post a Comment