Spark Parallelism


1 Answer(s)


each transformation creates partition. Whick means there will be separate partitions for FilterRDD, MappedRDD and ShufleRDD. Some transformations preserve the number of their parent partitions. for example. map() will not change the number of partitions. However, reduceByKey (and any other ByKey operations for that matter) changes the number of partitons. RDD doesn't hold data. Hence the partitions of RDD doesn't hold data. Unless we persisted an RDD, each time we call action on a given RDD, The dat will be read from HDFS (or local file system) as there is no data stored in RDD (or partitions). 

for example: let input data be : 192MB which will be 3 block in Gen1. val x = sc.textFile(file://path) will create HDFSRDD of 3 partitions (1  for each block).  val y = x.map( a =>(a, a.lenght))  will create MappedRDD of 3 partitions as map() preserve number of partitions. val f = y.filter ( a._2 > 10).coalesce() will probably create a FilteredRDD with less number of partitions. 

The number of task is not related to number of partitons. The  number task is determined by the number and complexity of operations (data manipukations) you make on the data.