You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High
Availability (HA). You want to minimize the chance of data loss in your cluster. What should you
do?

A.
Add another master node to increase the number of nodes running the JournalNode which
increases the number of machines available to HA to create a quorum
B.
Set an HDFS replication factor that provides data redundancy, protecting against node failure
C.
Run a Secondary NameNode on a different master from the NameNode in order to provide
automatic recovery from a NameNode failure.
D.
Run the ResourceManager on a different master from the NameNode in order to load-share
HDFS metadata processing
E.
Configure the cluster’s disk drives with an appropriate fault tolerant RAID level
B & D
0
0
Answer “B”
0
0
I don’t think D adds fault tolerance. It just reduces the load on a master node, but that does not need to be necessary at such a small cluster.
Having more than 2 Journal Nodes, however, adds more fault-tolerance to the NameNode metadata, which is why A should be correct.
0
0
I agree with A is the correct answer, “Add another master node to increase the number of nodes running the JournalNode which increases the number of machines available to HA to create a quorum” , however here it shouldn’t be another master node , what is master node ? I have no idea , we only have name node or data node , if we change master node into journalnode , that will be perfect answer.
0
0
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
0
0
I think B is the only answer that makes any sense. Read the question carefully, because you can have HA without setting the proper data replication factor. And data replication is directly related to potential data loss. Having a ResourceManager only relates to YARN functionality and is required for HA anyway.
0
0
B is correct.
0
0
I think E is the best answer , with even namenode HA configured ,you still worry about the risk of losing data, the only way is to use RAID .
0
0
why cant it be C ?
NN and SNN should not be in same master . SNN should run in different master
0
0
from hadoop https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
“Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.”
0
0
As per hadoop documentation maximum two name nodes can be configured.
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
dfs.ha.namenodes.[nameservice ID] – unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “mycluster” as the nameservice ID previously, and you wanted to use “nn1” and “nn2” as the individual IDs of the NameNodes, you would configure this as such:
dfs.ha.namenodes.mycluster
nn1,nn2
Note: Currently, only a maximum of two NameNodes may be configured per nameservice.
Hence “A” is not a valid option.
0
0
I have the same idea. D
0
0
E is a best answer
0
0