You have 40 Web servers producing timeseries data from Web traffic logs. You want to attain high
write throughput for storing this data in an HBase table. Which of these should you choose for a
row key to maximize your write throughput?
<hashCode (centralServerGeneratedSequenceID) ><timestamp>
<Long.MAX_VALUE – timestamp>
Note: In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a
an optimization note on watching out for a phenomenon where an import process walks in lockstep with all clients in concert pounding one of the table’s regions (and thus, a single node), then
moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp),
this will happen. The pile-up on a single region brought on by monotonically increasing keys can
be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to
avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.