
MapReduce input
The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit
. A Map task is asked to operate on each InputSplit
class. There are two other classes, InputFormat
and RecordReader
, which are significant in handling inputs to Hadoop jobs.
The InputFormat class
The input data specification for a MapReduce Hadoop job is given via the InputFormat
hierarchy of classes. The InputFormat
class family has the following main functions:
- Validating the input data. For example, checking for the presence of the file in the given path.
- Splitting the input data into logical chunks (
InputSplit
) and assigning each of the splits to a Map task. - Instantiating a
RecordReader
object that can work on eachInputSplit
class and producing records to the Map task as key-value pairs.
The FileInputFormat
subclass and associated classes are commonly used for jobs that take inputs from HFDS. The DBInputFormat
subclass is a specialized class that can be used to read data from a SQL database. CombineFileInputFormat
is the direct abstract subclass of the FileInputFormat
class, which can combine multiple files into a single split.
The InputSplit class
The abstract InputSplit
class and its associated concrete classes represent a byte view of the input data. An InputSplit
class is characterized by the following main attributes:
- The input filename
- The byte offset in the file where the split starts
- The length of the split in bytes
- The node locations where the split resides
In HDFS, an InputSplit
class is created per file if the file size is less than the HDFS block size. For example, if the HDFS block size is 128 MB, any file with a size less than 128 MB resides in its own InputSplit
class. For files that are broken up into blocks (size of the file is greater than the HDFS block size), a more complex formula is used to calculate InputSplit
. The InputSplit
class has an upper bound of the HDFS block size, unless the minimum size of a split is greater than the block size. Such cases are rare and could lead to locality problems.
Based on the locations of the splits and the availability of resources, the scheduler makes a decision on which node should execute the Map task for that split. The split is then communicated to the node that executes the task.
InputSplitSize = Maximum(minSplitSize, Minimum(blocksize, maxSplitSize)) minSplitSize : mapreduce.input.fileinputformat.split.minsize blocksize: dfs.blocksize maxSplitSize - mapreduce.input.fileinputformat.split.maxsize