apache spark - pyspark how to load compressed snappy file -
i have compressed file using python-snappy , put in hdfs store. trying read in following traceback. can't find example of how read file in can process it. can read text file (uncompressed) version fine. should using sc.sequencefile ? thanks!
i first compressed file , pushed hdfs python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy hdfs dfs -put gene_regions.vcf.snappy / added following spark-env.sh export spark_executor_memory=16g export hadoop_home=/usr/local/hadoop export java_library_path=$java_library_path:$hadoop_home/lib/native export ld_library_path=$ld_library_path:$hadoop_home/lib/native export spark_library_path=$spark_library_path:$hadoop_home/lib/native export spark_classpath=$spark_classpath:$hadoop_home/lib/lib/snappy-java-1.1.1.8-snapshot.jar launch spark master , slave , ipython notebook executing code below. a_file = sc.textfile("hdfs://master:54310/gene_regions.vcf.snappy") a_file.first()
valueerror traceback (most recent call last) in () ----> 1 a_file.first()
/home/user/software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) 1244 if rs: 1245 return rs[0] -> 1246 raise valueerror("rdd empty") 1247 1248 def isempty(self):
valueerror: rdd empty
working code (uncompressed) text file a_file = sc.textfile("hdfs://master:54310/gene_regions.vcf") a_file.first()
output: u'##fileformat=vcfv4.1'
the issue here python-snappy not compatible hadoop's snappy codec, spark use read data when sees ".snappy" suffix. based on same underlying algorithm aren't compatible in can compress 1 , decompress another.
you can make work either writing data out in first place snappy using spark or hadoop. or having spark read data binary blobs , manually invoke python-snappy decompression (see binaryfiles here http://spark.apache.org/docs/latest/api/python/pyspark.html). binary blob approach bit more brittle because needs fit entire file in memory each input file. if data small enough work.
Comments
Post a Comment