Apache Spark RDD partitioning and join -


when join 2 rdds data joined, i.e. data aggregated on driver , shipped out worker nodes, or 1 of nodes randomly selected "receive" data? furthermore, if call partition on pairrdd partitioning done key automatically?

no, not proceed via driver or single node. shuffle happens wherein each of many tasks across executors collects values (from both parents) subset of keys. tasks form join product each key iterated through. partitioning key. joining 2 identically-partitioned rdds advantageous avoid shuffle.


Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -