Pyspark: RDD join, intersection and cartesian

This post explains three transformations on Spark RDD: join, intersection and cartesian.

What is needed to replicate these examples:

  • Access to Pyspark

If you have not used Spark, here is a post for introduction and installation of Spark in local mode (in contrast to cluster).

What can be done using these three actions? From a higher level, they combine information from two RDDs together so we can use the map function to manipulate it. We need this because Spark does not allow access to RDD from within RDD – meaning no for loop. Also, lambda expression does not support for loop too.

Join

Operations are performed on the element pairs that have the same keys.

  • Each element in the first RDD, is combined with every element in the second RDD that has the same key. 
  • How to combine? The key is not changed. But the two values are combined into a tuple. The key-tuple is then wrapped around by a container.
  • Note the level of containers here. The element structures in the third RDD, is different from the element structures in the two original RDDs.

Example:


rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()

# output
[('red', (30, 50)), ('red', (30, 40)), ('red', (20, 50)), ('red', (20, 40))]

Intersection

For two RDDs, return the part that they are in common.


rdd1 = sc.parallelize([(1, 2), (2, 3), (3, 4), (2, 4), (1, 5)])
rdd = sc.parallelize([(1, 2), (2, 3), (3, 4), (1, 5), (2, 6)])
rdd.intersection(rdd1).collect()

Cartesian

Compute the “product” of two RDDs. Every element in the first RDD is combined with every element in the second RDD, to form a element in the new RDD.

If RDD1 is of length 3, RDD2 is of length 3, this will produce a new RDD of length 9.

Example: an RDD cartesian with itself.

rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.cartesian(rdd2).collect()

# output
[(('red', 20), ('red', 40)), (('red', 20), ('red', 50)), (('red', 20), ('yellow', 10000)), (('red', 30), ('red', 40)), (('blue', 100), ('red', 40)), (('red', 30), ('red', 50)), (('red', 30), ('yellow', 10000)), (('blue', 100), ('red', 50)), (('blue', 100), ('yellow', 10000))]
Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *