database - Scoring a very huge dataset -


i have fit machine learning classifier on sample data of 1-2% using r/python , i'm pretty satisfied accuracy measures ( precision,recall , f_score).

now score huge database 70 million rows/instances resides in hadoop/hive environment classifier coded in r.

information dataset:

70 million x 40 variables( columns) : 18 variables categorical , rest 22 numeric (integers included)

how go doing ? suggestions ?

the things have thought of doing are:

a) chunking data out in 1 m increments out of hadoop system in csv files , feeding r

b) kind of batch-processing.

its not real-time system, doesn't need happen everyday, still score in 2-3 hours.

if can install r runtime on datanodes can simple hadoop streaming map-only job invoke r code

also can take @ sparkr


Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -