scala - Apache-Spark Internal Job Scheduling -
i came across feature in spark allows schedule different tasks within spark context.
i want implement feature in program map input rdd(from text source) key value rdd [k,v] subsequently make composite key valuerdd [(k1,k2),v] , filtered rdd containing specific values.
further pipeline involves calling statistical methods mllib on both rdds , join operation followed externalizing result disk.
i trying understand how spark's internal fair scheduler handle these operations. tried reading job scheduling documentation got more confused concept of pools, users , tasks.
what pools, 'tasks' can grouped or linux users pooled group
what users in context. refer threads? or sql context queries ?
i guess relates how tasks scheduled within spark context. reading documentation makes seem dealing multiple applications different clients , user groups.
can please clarify this?
all pipelined procedure described in paragraph 2:
map -> map -> map -> filter will handled in single stage, map() in mapreduce if familiar you. it's because there isn't need repartition or shuffle data make no requirements on correlation between records, spark chain transformation possible same stage before create new one, because lightweight. more informations on stage separation find in paper: resilient distributed datasets section 5.1 job scheduling.
when stage executed, 1 task set (same tasks running in different thread), , scheduled simultaneously in spark's perspective.
and fair scheduler schedule unrelated task sets , not suitable here.
Comments
Post a Comment