Mastering Hadoop
上QQ阅读APP看书,第一时间看更新

MapReduce input

The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit. A Map task is asked to operate on each InputSplit class. There are two other classes, InputFormat and RecordReader, which are significant in handling inputs to Hadoop jobs.

The InputFormat class

The input data specification for a MapReduce Hadoop job is given via the InputFormat hierarchy of classes. The InputFormat class family has the following main functions:

  • Validating the input data. For example, checking for the presence of the file in the given path.
  • Splitting the input data into logical chunks (InputSplit) and assigning each of the splits to a Map task.
  • Instantiating a RecordReader object that can work on each InputSplit class and producing records to the Map task as key-value pairs.

The FileInputFormat subclass and associated classes are commonly used for jobs that take inputs from HFDS. The DBInputFormat subclass is a specialized class that can be used to read data from a SQL database. CombineFileInputFormat is the direct abstract subclass of the FileInputFormat class, which can combine multiple files into a single split.

The InputSplit class

The abstract InputSplit class and its associated concrete classes represent a byte view of the input data. An InputSplit class is characterized by the following main attributes:

  • The input filename
  • The byte offset in the file where the split starts
  • The length of the split in bytes
  • The node locations where the split resides

In HDFS, an InputSplit class is created per file if the file size is less than the HDFS block size. For example, if the HDFS block size is 128 MB, any file with a size less than 128 MB resides in its own InputSplit class. For files that are broken up into blocks (size of the file is greater than the HDFS block size), a more complex formula is used to calculate InputSplit. The InputSplit class has an upper bound of the HDFS block size, unless the minimum size of a split is greater than the block size. Such cases are rare and could lead to locality problems.

Based on the locations of the splits and the availability of resources, the scheduler makes a decision on which node should execute the Map task for that split. The split is then communicated to the node that executes the task.

InputSplitSize = Maximum(minSplitSize, Minimum(blocksize, maxSplitSize)) 

minSplitSize : mapreduce.input.fileinputformat.split.minsize

blocksize: dfs.blocksize 
maxSplitSize - mapreduce.input.fileinputformat.split.maxsize

Note

In previous releases of Hadoop, the minimum split size property was mapred.min.split.size and the maximum split size was given by the value of the property mapred.max.split.size. These are deprecated now.