- Sep 8, 2020

Spark, Data Structure, Shuffle In Map Reduce

Updated: Mar 25, 2021

Data Structure in MapReduce

Key-value pairs are the basic data structure in MapReduce:

Keys and values can be: integers, float, strings, raw bytes
They can also be arbitrary data structures

The design of MapReduce algorithms involves:

Imposing the key-value structure on arbitrary datasets
- E.g., for a collection of Web pages, input keys may be URLs and values may be the HTML content
In some algorithms, input keys are not used (e.g., wordcount), in others the uniquely identify a record
Keys can be combined in complex ways to design various algorithms

Recall of Map and Reduce

Map

Reads data (split in Hadoop, RDD in Spark)
Produces key-value pairs as intermediate outputs

Reduce

Receive key-value pairs from multiple map jobs
aggregates the intermediate data tuples to the final output

MapReduce in Hadoop

Data stored in HDFS (organized as blocks)
Hadoop MapReduce Divides input into fixed-size pieces, input splits
- Hadoop creates one map task for each split
- Map task runs the user defined map function for each record in the split
- Size of a split is normally the size of a HDFS block

Data locality optimization

Run the map task on a node where the input data resides in HDFS
This is the reason why the split size is the same as the block size
- The largest size of the input that can be guaranteed to be stored on a single node
- If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks

MapReduce in Hadoop

Map tasks write their output to local disk (not to HDFS)

Map output is intermediate output
Once the job is complete the map output can be thrown away
Storing it in HDFS with replication would be overkill
If the node of the map task fails, Hadoop will automatically rerun the map task on another node

Reduce tasks don’t have the advantage of data locality

Input to a single reduce task is normally the output from all mappers
Output of the reduce is stored in HDFS for reliability
The number of reduce tasks is not governed by the size of the input, but is specified independently

More Detailed MapReduce Dataflow

When there are multiple reducers, the map tasks partition their output:

One partition for each reduce task
The records for every key are all in a single partition
Partitioning can be controlled by a user-defined partitioning function

Shuffle

Shuffling is the process of data redistribution
- To make sure each reducer obtains all values associated with the same key.
It is needed for all of the operations which require grouping
- E.g., word count, compute avg. the score for each department,
Spark and Hadoop have different approaches implemented for handling the shuffles.

Shuffle In Hadoop

Happens between each Map and Reduce phase
Use the Shuffle and Sort mechanism
- Results of each Mapper are sorted by the key
- Starts as soon as each mapper finishes
Use combiner to reduce the amount of data shuffled
- Combiner combines key-value pairs with the same key in each par
- This is not handled by the framework!

Shuffle in Spark

Triggered by some operations
- Distinct, join, repartition, all * By,*ByKey
- I.e., Happens between stages
Hash shuffle
Sort shuffle
Tungsten shuffle sort
- More on https://issues.apache.org/jira/browse/SPARK-7081

Hash Shuffle

Data are hash partitioned on the map side
- Hashing is much faster than sorting
Files created to store the partitioned data portion
- # of mappers X # of reducers
Use consolidateFiles to reduce the # of files
- From M * R => E*C/T *R
Pros:
- Fast
- No memory overhead of sorting
Cons:
- Large amount of output files (when # partition is big)

Sort Shuffle

For each mapper 2 files are created
- Ordered (by key) data
- Index of beginning and ending of each '
Merged on the fly while being read by reducers
Default way
- Fallback to hash shuffle if # partitions is small
Pros
- Smaller amount of files created
Cons
- Sorting is slower than hashing

Map Reduce in Spark

Transformation
- Narrow transformation
- Wide transformation
Action
- The job is a list of Transformations followed by one Action
- Only action will trigger the 'real' execution
  - I.e., lazy evaluation

Transformation = Map? Action = Reduce?

combineByKey

RDD([K, V]) to RDD([K, C])
- K: key, V: value, C: combined type
Three parameters (functions)
- createCombiner
- What is done to a single row when it is FIRST met?
- V => C
mergeValue
- What is done to a single row when it meets a previously reduced row?
- C, V => C
- In a partition
mergeCombiners
- What is done to two previously reduced rows?
- C, C => C
- Across partitions

The Efficiency of MapReduce in Spark

Number of transformations
- Each transformation involves a linearly scan of the dataset (RDD)
Size of transformations
- Smaller input size => less cost on linearly scan
Shuffles
- data transferring between partitions is costly
  - especially in a cluster!
    Disk I/O
    Data serialization and deserialization
    Network I/O

Number of Transformations (and Shuffles)

rdd = sc.parallelize(data)

data: (id, score) pairs

Bad design

maxByKey= rdd.combineByKey(…)
sumByKey = rdd.combineByKey(…)
sumMaxRdd maxByKey.join(sumByKey)

Good design

sumMaxRdd=rdd.combineByKey(…)

Size of Transformation

rdd=sc.parallelize(data)

data: (word, pairs)

Bad design

countRdd= rdd.reduceByKey(…)
fileteredRdd countRdd.filter(…)

Good design

fileteredRdd = countRdd.filter(…)
countRdd = fileteredRdd.reduceByKey(…)

Partition

rdd=sc.parallelize(data)

data: (word, pairs)

Bad design

countRdd= rdd.reduceByKey(…)
countBy2ndCharRdd=countRdd.map(…).reduceByKey(…)

Good design

paritionedRdd
data.partitionBy(…)
countBy2ndCharRdd=paritionedRdd.map(…).reduceByKey(…)

How to Merge Two RDDs?

Union
- Concatenate two RDDs
Zip
- Pair two RDDs
Join
- Merge based on the keys from 2 RDDs
- Just like join in DB

Union

How do A and B union together?
- What is the number of partitions for the union of A and B?
Case 1: Different partitioner:
- Note: default partitioner is None
Case 2: Same partitioner:

Zip

Key-Value pairs after A.zip(B)
- Key: tuples in A
- Value: tuples in B
Assumes that the two RDDs have
- The same number of partitions
- The same number of elements in each partition
- E.g., 1 to 1 map

Join

E.g., A.*Join(B)
join
- All pairs with matching Keys from A and B
leftOuterJoin
- Case 1: in both A and B
- Case 2: in A but not B
- Case 3: in B but not A
rightOuterJoin
- Opposite to leftOuterJoin
fullOuterJoin
- Union of leftOuterJoin and rightOuterJoin

MapReduce of “Strips"

Map a sentence into stripes
- ForAll term u in sent s do: H u = new dictionary
- ForAll term v in Neighbors(u) do: H u (v) H u (v)+1
Reduce by key and merge the dictionaries
- element wise sum of dictionaries

“Stripes” Analysis

Advantages

Far less sorting and shuffling of key value pairs

Disadvantages

More difficult to implement
Underlying object more heavyweight
Fundamental limitation in terms of size of event space

Pairs vs. Stripes

The pair's approach
- Keeps track of each pair of co-occur terms separately
- Generates a large number of key-value pairs (also intermediate)
- The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a pair of words
The stripe approach
- Keeps track of all terms that co-occur with the same term
- Generates fewer and shorted intermediate keys
  - The framework has less sorting to do
- Greatly benefits from combiners, as the keyspace is the vocabulary
- More efficient, but may suffer from a memory problem

MapReduce in Real World: Search Engine

Information retrieval (IR)
- Focus on textual information (= text/document retrieval)
- Other possibilities include image, video, music,......
Boolean Text retrieval
- Each document or query is treated as a bag ” of words or terms. Word sequence is not considered
- Query terms are combined logically using the Boolean operators AND, OR, and NOT.
  - E.g., ((data AND mining) AND (NOT text))
- Retrieval
  - Given a Boolean query, the system retrieves every document that makes the query logically true
  - Exact match

Contact us to get help related to "map-reduce assignment help", "map-reduce project help", "map-reduce homework help" or other project topics related to MapReduce at: contact@codersarts.com

Spark, Data Structure, Shuffle In Map Reduce

Data Structure in MapReduce

Recall of Map and Reduce

Shuffle

Map Reduce in Spark

The Efficiency of MapReduce in Spark

Partition

How to Merge Two RDDs?

MapReduce of “Strips"

Pairs vs. Stripes

MapReduce in Real World: Search Engine

Recent Posts