hadoop - Using an HBase table as MapReduce source -


as far understood when using hbase table source mapreduce job, have define value scan. let's set 500, mean each mapper given 500 rows hbase table? there problem if set high value ?

if scan size small, don't have same problem having small files in mapreduce?

here's sample code hbase book on how run mapreduce job reading hbase table.

configuration config = hbaseconfiguration.create(); job job = new job(config, "exampleread"); job.setjarbyclass(myreadjob.class);     // class contains mapper  scan scan = new scan(); scan.setcaching(500);        // 1 default in scan, bad mapreduce jobs scan.setcacheblocks(false);  // don't set true mr jobs // set other scan attrs ...  tablemapreduceutil.inittablemapperjob(    tablename,        // input hbase table name    scan,             // scan instance control cf , attribute selection    mymapper.class,   // mapper    null,             // mapper output key    null,             // mapper output value    job); job.setoutputformatclass(nulloutputformat.class);   // because aren't emitting mapper  boolean b = job.waitforcompletion(true); if (!b) {     throw new ioexception("error job!"); } 

when "value scan", that's not real thing. either mean scan.setcaching() or scan.setbatch() or scan.setmaxresultsize().

  1. setcaching used tell server how many rows load before returning result client
  2. setbatch used limit number of columns returned in each call if have wide table
  3. setmaxresultsize used limit number of results returned client

typically don't set maxresultsize in mapreduce job. see of data.

reference above information here.


Comments

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -