Apache Spark RDD partitioning and join -


when join 2 rdds data joined, i.e. data aggregated on driver , shipped out worker nodes, or 1 of nodes randomly selected "receive" data? furthermore, if call partition on pairrdd partitioning done key automatically?

no, not proceed via driver or single node. shuffle happens wherein each of many tasks across executors collects values (from both parents) subset of keys. tasks form join product each key iterated through. partitioning key. joining 2 identically-partitioned rdds advantageous avoid shuffle.


Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -