PrepAway - Latest Free Exam Questions & Answers

Which best describes how TextInputFormat processes input files and line breaks?

Which best describes how TextInputFormat processes input files and line breaks?

PrepAway - Latest Free Exam Questions & Answers

A.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the beginning of the broken line.

B.
Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.

C.
The input file is split exactly at the line breaks, so each RecordReader will read a series of
complete lines.

D.
Input file splits may cross line breaks. A line that crosses file splits is ignored.

E.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the end of the broken line.

Explanation:
As the Map operation is parallelized the input file set is first split to several pieces
called FileSplits. If an individual file is so large that it will affect seek time it will be split to several
Splits. The splitting does not know anything about the input file’s internal logical structure, for
example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is
created per FileSplit.
When an individual map task starts it will open a new output writer per configured reduce task. It
will then proceed to read its FileSplit using the RecordReader it gets from the specified
InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also
handle records that may be split on the FileSplit boundary. For example TextInputFormat will read
the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit,
TextInputFormat ignores the content up to the first newline.
Reference: How Map and Reduce operations are actually carried out

11 Comments on “Which best describes how TextInputFormat processes input files and line breaks?

  1. yogeswaran says:

    It is option A.

    If the split starting position is not zero, then the input split will seek one position before the start and skip the first record(/n). If the previous end of split is the end of line(/n), then the current split will start processing from the beginning of that split. If the previous end of split is not end of line(/n), then it’ll go till the /n and the current split will start reading after that /n.

    if (start != 0) {
    skipFirstLine = true;
    –start;
    fileIn.seek(start);
    }
    in = new LineReader(fileIn, job);
    }
    if (skipFirstLine) { // skip first line and re-establish “start”.
    start += in.readLine(new Text(), 0,
    (int)Math.min((long)Integer.MAX_VALUE, end – start));
    }
    this.pos = start;

    Please correct me if I’m wrong. Else please update the answer. Thanks!




    0



    0
  2. Vishal says:

    The answer is C. TestInputformat consider each line as a record/value. for more information read Hadoop:The definitive guide-page246.




    0



    0
  3. anonymous says:

    the correct one is ‘E’. “Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line”. Validated in exam.




    0



    0

Leave a Reply

Your email address will not be published. Required fields are marked *