What is the best way to accomplish this?

seenagapeFebruary 9, 2017

To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What
is the best way to accomplish this?

PrepAway - Latest Free Exam Questions & Answers

A.
Serialize the data file, insert in it the JobConf object, and read the data into memory in the
configure method of the mapper.

B.
Place the data file in the DistributedCache and read the data into memory in the map method of
the mapper.

C.
Place the data file in the DataCache and read the data into memory in the configure method of
the mapper.

D.
Place the data file in the DistributedCache and read the data into memory in the configure
method of the mapper.

Explanation:
Hadoop has a distributed cache mechanism to make available file locally that may
be needed by Map/Reduce jobs
Use Case
Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to
replace all keywords that we encounter during parsing, with some other value.
So what we need is
A key-values files (Lets use a Properties files)
The Mapper code that uses the code
Write the Mapper code that uses it
view sourceprint?
01.
public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {
02.
03.
Properties cache;
04.
05.
@Override
06.
protected void setup(Context context) throws IOException, InterruptedException {
07.
super.setup(context);
08.
Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
09.
10.
if(localCacheFiles != null) {
11.
// expecting only single file here
12.
for (int i = 0; i < localCacheFiles.length; i++) {
13.
Path localCacheFile = localCacheFiles[i];
14.
cache = new Properties();
15.
cache.load(new FileReader(localCacheFile.toString()));
16.
}
17.
} else {
18.
// do your error handling here
19.
}
20.
21.
}
22.
23.
@Override
24.
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
25.
// use the cache here
26.
// if value contains some attribute, cache.get(<value>)
27.
// do some action or replace with something else
28.
}
29.
30.
}
Note:
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives,
jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The
DistributedCache assumes that the files specified via hdfs:// urls are already present on the
FileSystem at the path specified by the url.
Reference: Using Hadoop Distributed Cache

10 Comments on “What is the best way to accomplish this?”

Pradeep says:

April 23, 2014 at 10:33 am

Answer is D

Prakash says:

December 17, 2014 at 7:07 am

Answer is A. You can use distributed cache when data cannot fit in memory.
refere page no 287 in difitnitive guide 3rd edition

Nishanth says:

April 24, 2015 at 2:52 am

when you speak of 512 MB data , you should also take into account the memory occupied by the Java objects that will be used to persist and access this. Moreover configuration is read into memory by all the tasks that run for the job so this might cause out of memory. Risky!

0

0

Reply

Westby says:

January 23, 2015 at 11:19 am

Answer is D

When using the old MapReduce API, we use the static method on DistributedCache
instead, as follows:
@Override
public void configure(JobConf conf) {
metadata = new NcdcStationMetadata();
try {
Path[] localPaths = DistributedCache.getLocalCacheFiles(conf);
if (localPaths.length == 0) {
throw new FileNotFoundException(“Distributed cache file not found.”);
}
File localFile = new File(localPaths[0].toString());
metadata.initialize(localFile);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
After MapReduce 0.21, you need to get cached data from the setup method in map/reduce tasks

@Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
URI[] uris = DistributedCache.getCacheFiles(context.getConfiguration());
Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
// TODO
}

DineshPandiyan says:

April 6, 2015 at 4:12 pm

Answer is D

chetan says:

April 23, 2015 at 4:34 am

Answer is D

john says:

April 25, 2015 at 4:46 am

Arti says:

May 5, 2015 at 10:44 am

This is really confusing.. Wondering which is the correct answer 🙁

Debi says:

June 19, 2015 at 3:23 am

Answer A and D both are correct depending on the context. A can be correct because 512 MB is tiny for Hadoop clusters.
But D is more safer approach.

networkmanagers says:

February 10, 2017 at 1:02 am

I agree with the answer. B

Pradeep says:

April 23, 2014 at 10:33 am

Answer is D

0

0

Prakash says:

December 17, 2014 at 7:07 am

Answer is A. You can use distributed cache when data cannot fit in memory.
refere page no 287 in difitnitive guide 3rd edition

0

0

1. Nishanth says:
  
  April 24, 2015 at 2:52 am
  
  when you speak of 512 MB data , you should also take into account the memory occupied by the Java objects that will be used to persist and access this. Moreover configuration is read into memory by all the tasks that run for the job so this might cause out of memory. Risky!
  
  0
  
  0
  
Westby says:

January 23, 2015 at 11:19 am

Answer is D

When using the old MapReduce API, we use the static method on DistributedCache
instead, as follows:
@Override
public void configure(JobConf conf) {
metadata = new NcdcStationMetadata();
try {
Path[] localPaths = DistributedCache.getLocalCacheFiles(conf);
if (localPaths.length == 0) {
throw new FileNotFoundException(“Distributed cache file not found.”);
}
File localFile = new File(localPaths[0].toString());
metadata.initialize(localFile);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
After MapReduce 0.21, you need to get cached data from the setup method in map/reduce tasks

@Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
URI[] uris = DistributedCache.getCacheFiles(context.getConfiguration());
Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
// TODO
}

0

0

DineshPandiyan says:

April 6, 2015 at 4:12 pm

Answer is D

0

0

chetan says:

April 23, 2015 at 4:34 am

Answer is D

0

0

john says:

April 25, 2015 at 4:46 am

D

0

0

Arti says:

May 5, 2015 at 10:44 am

This is really confusing.. Wondering which is the correct answer 🙁

0

0

Debi says:

June 19, 2015 at 3:23 am

Answer A and D both are correct depending on the context. A can be correct because 512 MB is tiny for Hadoop clusters.
But D is more safer approach.

0

0

networkmanagers says:

February 10, 2017 at 1:02 am

I agree with the answer. B

0

0

Get 50% Discount on All Your Purchases
at PrepAway.com - Latest Exam Questions

This is ONE TIME OFFER

Enter your email address to receive your 50% off dicount code:

SPECIAL OFFER: GET 50% OFF

Use Discount Code:

Briefing Cloudera Knowledge

Free Cloudera Study Guide

What is the best way to accomplish this?

10 Comments on “What is the best way to accomplish this?”

Leave a Reply Cancel reply