Apache Spark RDD partitioning and join -

February 15, 2015

when join 2 rdds data joined, i.e. data aggregated on driver , shipped out worker nodes, or 1 of nodes randomly selected "receive" data? furthermore, if call partition on pairrdd partitioning done key automatically?

no, not proceed via driver or single node. shuffle happens wherein each of many tasks across executors collects values (from both parents) subset of keys. tasks form join product each key iterated through. partitioning key. joining 2 identically-partitioned rdds advantageous avoid shuffle.

Search This Blog

UV code

Apache Spark RDD partitioning and join -

Comments

Post a Comment

Popular posts from this blog

python - Installing PyDev in eclipse is failed -

PHP OOP-based login system -

c# - Nested Internal Class with Readonly Hashtable throws Null ref exception.. on assignment -