Apache Spark RDD partitioning and join -


when join 2 rdds data joined, i.e. data aggregated on driver , shipped out worker nodes, or 1 of nodes randomly selected "receive" data? furthermore, if call partition on pairrdd partitioning done key automatically?

no, not proceed via driver or single node. shuffle happens wherein each of many tasks across executors collects values (from both parents) subset of keys. tasks form join product each key iterated through. partitioning key. joining 2 identically-partitioned rdds advantageous avoid shuffle.


Comments

Popular posts from this blog

python - Installing PyDev in eclipse is failed -

PHP OOP-based login system -

c# - Nested Internal Class with Readonly Hashtable throws Null ref exception.. on assignment -