PrepAway - Latest Free Exam Questions & Answers

What is the best way to accomplish this?

To process input key-value pairs, your mapper needs to load a 512 MB data file in memory. What
is the best way to accomplish this?

PrepAway - Latest Free Exam Questions & Answers

A.
Place the datafile in the DataCache and read the data into memory in the configure method
ofthe mapper.

B.
Place the data file in theDistributedCacheand read the data into memory in the map method of
the mapper.

C.
Place the data file in theDistributedCacheand read the data into memory in the configure
method of the mapper.

D.
Serialize the data file, insert it in the Jobconf object, and read the data into memory in the
configure method of the mapper.

Explanation:
Hadoop has a distributed cache mechanism to make available file locally that may
be needed by Map/Reduce jobs
Use Case
Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to
replace all keywords that we encounter during parsing, with some other value.
So what we need is
A key-values files (Lets use a Properties files)
The Mapper code that uses the code
Write the Mapper code that uses it
view sourceprint?

01.
public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {
02.
03.
Properties cache;
04.
05.
@Override
06.
protected void setup(Context context) throws IOException, InterruptedException {
07.
super.setup(context);
08.
Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
09.
10.
if(localCacheFiles != null) {
11.
// expecting only single file here
12.
for (int i = 0; i < localCacheFiles.length; i++) {
13.
Path localCacheFile = localCacheFiles[i];
14.
cache = new Properties();
15.
cache.load(new FileReader(localCacheFile.toString()));
16.
}
17.
} else {
18.
// do your error handling here
19.
}
20.
21.

}
22.
23.
@Override
24.
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
25.
// use the cache here
26.
// if value contains some attribute, cache.get(<value>)
27.
// do some action or replace with something else
28.
}
29.
30.
}
Note:
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives,
jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The
DistributedCache assumes that the files specified via hdfs:// urls are already present on the
FileSystem at the path specified by the url.
Reference:Using Hadoop Distributed Cache

4 Comments on “What is the best way to accomplish this?

  1. Ramesh Hiremath says:

    C.
    Place the data file in the DistributedCache and read the data into memory in the configure
    method of the mapper.




    0



    0

Leave a Reply to yogeswaran Cancel reply

Your email address will not be published. Required fields are marked *