hadoop - Using an HBase table as MapReduce source -

April 15, 2010

as far understood when using hbase table source mapreduce job, have define value scan. let's set 500, mean each mapper given 500 rows hbase table? there problem if set high value ?

if scan size small, don't have same problem having small files in mapreduce?

here's sample code hbase book on how run mapreduce job reading hbase table.

configuration config = hbaseconfiguration.create(); job job = new job(config, "exampleread"); job.setjarbyclass(myreadjob.class);     // class contains mapper  scan scan = new scan(); scan.setcaching(500);        // 1 default in scan, bad mapreduce jobs scan.setcacheblocks(false);  // don't set true mr jobs // set other scan attrs ...  tablemapreduceutil.inittablemapperjob(    tablename,        // input hbase table name    scan,             // scan instance control cf , attribute selection    mymapper.class,   // mapper    null,             // mapper output key    null,             // mapper output value    job); job.setoutputformatclass(nulloutputformat.class);   // because aren't emitting mapper  boolean b = job.waitforcompletion(true); if (!b) {     throw new ioexception("error job!"); }

when "value scan", that's not real thing. either mean scan.setcaching() or scan.setbatch() or scan.setmaxresultsize().

setcaching used tell server how many rows load before returning result client
setbatch used limit number of columns returned in each call if have wide table
setmaxresultsize used limit number of results returned client

typically don't set maxresultsize in mapreduce job. see of data.

reference above information here.

Search This Blog

UV code

hadoop - Using an HBase table as MapReduce source -

Comments

Post a Comment

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -