hadoop - Using an HBase table as MapReduce source -
as far understood when using hbase table source mapreduce job, have define value scan. let's set 500, mean each mapper given 500 rows hbase table? there problem if set high value ?
if scan size small, don't have same problem having small files in mapreduce?
here's sample code hbase book on how run mapreduce job reading hbase table.
configuration config = hbaseconfiguration.create(); job job = new job(config, "exampleread"); job.setjarbyclass(myreadjob.class); // class contains mapper scan scan = new scan(); scan.setcaching(500); // 1 default in scan, bad mapreduce jobs scan.setcacheblocks(false); // don't set true mr jobs // set other scan attrs ... tablemapreduceutil.inittablemapperjob( tablename, // input hbase table name scan, // scan instance control cf , attribute selection mymapper.class, // mapper null, // mapper output key null, // mapper output value job); job.setoutputformatclass(nulloutputformat.class); // because aren't emitting mapper boolean b = job.waitforcompletion(true); if (!b) { throw new ioexception("error job!"); } when "value scan", that's not real thing. either mean scan.setcaching() or scan.setbatch() or scan.setmaxresultsize().
setcachingused tell server how many rows load before returning result clientsetbatchused limit number of columns returned in each call if have wide tablesetmaxresultsizeused limit number of results returned client
typically don't set maxresultsize in mapreduce job. see of data.
reference above information here.
Comments
Post a Comment