Apache Spark RDD partitioning and join -
when join
2 rdd
s data joined, i.e. data aggregated on driver , shipped out worker nodes, or 1 of nodes randomly selected "receive" data? furthermore, if call partition
on pairrdd
partitioning done key automatically?
no, not proceed via driver or single node. shuffle happens wherein each of many tasks across executors collects values (from both parents) subset of keys. tasks form join product each key iterated through. partitioning key. joining 2 identically-partitioned rdds advantageous avoid shuffle.
Comments
Post a Comment