Which of these should you choose for a row key to maximize your write throughput?

seenagapeAugust 27, 2016

You have 40 Web servers producing timeseries data from Web traffic logs. You want to attain high
write throughput for storing this data in an HBase table. Which of these should you choose for a
row key to maximize your write throughput?

PrepAway - Latest Free Exam Questions & Answers

A.
<hashCode (centralServerGeneratedSequenceID) ><timestamp>

B.
<Long.MAX_VALUE – timestamp>

C.
<timestamp>

D.
<hashCode (serverGeneratingTheWeblog)><timestamp>

Explanation:
Note: In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a
an optimization note on watching out for a phenomenon where an import process walks in lockstep with all clients in concert pounding one of the table’s regions (and thus, a single node), then
moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp),
this will happen. The pile-up on a single region brought on by monotonically increasing keys can
be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to
avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

Get 50% Discount on All Your Purchases
at PrepAway.com - Latest Exam Questions

This is ONE TIME OFFER

Enter your email address to receive your 50% off dicount code:

SPECIAL OFFER: GET 50% OFF

Use Discount Code:

Briefing Cloudera Knowledge

Free Cloudera Study Guide

Which of these should you choose for a row key to maximize your write throughput?

One Comment on “Which of these should you choose for a row key to maximize your write throughput?”

Leave a Reply to mr_tienvu Cancel reply