apache spark - pyspark how to load compressed snappy file -


i have compressed file using python-snappy , put in hdfs store. trying read in following traceback. can't find example of how read file in can process it. can read text file (uncompressed) version fine. should using sc.sequencefile ? thanks!

i first compressed file , pushed hdfs  python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy hdfs dfs -put gene_regions.vcf.snappy /  added following spark-env.sh export spark_executor_memory=16g                                                 export hadoop_home=/usr/local/hadoop                                              export java_library_path=$java_library_path:$hadoop_home/lib/native              export ld_library_path=$ld_library_path:$hadoop_home/lib/native                  export spark_library_path=$spark_library_path:$hadoop_home/lib/native            export spark_classpath=$spark_classpath:$hadoop_home/lib/lib/snappy-java-1.1.1.8-snapshot.jar  launch spark master , slave , ipython notebook executing code below.  a_file = sc.textfile("hdfs://master:54310/gene_regions.vcf.snappy") a_file.first() 

valueerror traceback (most recent call last) in () ----> 1 a_file.first()

/home/user/software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) 1244 if rs: 1245 return rs[0] -> 1246 raise valueerror("rdd empty") 1247 1248 def isempty(self):

valueerror: rdd empty

working code (uncompressed) text file a_file = sc.textfile("hdfs://master:54310/gene_regions.vcf") a_file.first() 

output: u'##fileformat=vcfv4.1'

the issue here python-snappy not compatible hadoop's snappy codec, spark use read data when sees ".snappy" suffix. based on same underlying algorithm aren't compatible in can compress 1 , decompress another.

you can make work either writing data out in first place snappy using spark or hadoop. or having spark read data binary blobs , manually invoke python-snappy decompression (see binaryfiles here http://spark.apache.org/docs/latest/api/python/pyspark.html). binary blob approach bit more brittle because needs fit entire file in memory each input file. if data small enough work.


Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -