database - Scoring a very huge dataset -
i have fit machine learning classifier on sample data of 1-2% using r/python , i'm pretty satisfied accuracy measures ( precision,recall , f_score).
now score huge database 70 million rows/instances resides in hadoop/hive environment classifier coded in r.
information dataset:
70 million x 40 variables( columns) : 18 variables categorical , rest 22 numeric (integers included)
how go doing ? suggestions ?
the things have thought of doing are:
a) chunking data out in 1 m increments out of hadoop system in csv files , feeding r
b) kind of batch-processing.
its not real-time system, doesn't need happen everyday, still score in 2-3 hours.
if can install r runtime on datanodes can simple hadoop streaming map-only job invoke r code
also can take @ sparkr
Comments
Post a Comment