PrepAway - Latest Free Exam Questions & Answers

How will you obtain these user records?

You have user profile records in your OLPT database, that you want to join with web logs you
have already ingested into the Hadoop file system. How will you obtain these user records?

PrepAway - Latest Free Exam Questions & Answers

A.
HDFS command

B.
Pig LOAD command

C.
Sqoop import

D.
Hive LOAD DATA command

E.
Ingest with Flume agents

F.
Ingest with Hadoop Streaming

Explanation:
Apache Hadoop and Pig provide excellent tools for extracting and analyzing data
from very large Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD ‘apacheLog.log’ USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
* Content will be created by multiple Web servers and logged in local hard discs. This content will
then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers;
these are machines that collect data intermediately using collectors and finally push that data to
HDFS.
* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch
job solution). These scripts actually analyze the logs on various dimensions and extract the
results. Results from Pig are by default inserted into HDFS, but we can use storage

implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the
solution with HBase (please see the implementation section). Pig Scripts can either push this data
to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts
can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as
we are showcasing the Pig framework applicability for log analysis at large scale.
* The database HBase will have the data processed by Pig scripts ready for reporting and further
slicing and dicing.
* The data-access Web service is a REST-based service that eases the access and integrations
with data clients. The client can be in any language to access REST-based API. These clients
could be BI- or UI-based clients.
Note 2:
The Log Analysis Software Stack
* Hadoop is an open source framework that allows users to process very large data in parallel. It’s
based on the framework that supports Google search engine. The Hadoop core is mainly divided
into two modules:
1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using
multiple commodity servers connected in a cluster.
2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default
implementation is bonded with HDFS.
* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database
is that it provides scalability for the reporting module as well, as we can keep historical processed
data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses
HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to
very large data sets — HBase can save very large tables having million of rows. It’s a distributed
database and can also keep multiple versions of a single row.
* The Pig framework is an open source platform for analyzing large data sets and is implemented
as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of
developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to
be written in Java. In contrast, Pig enables users to write code in a scripting language.
* Flume is a distributed, reliable and available service for collecting, aggregating and moving a
large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for
further processing. It’s a data flow solution, where there is an originator and destination for each
node and is divided into Agent and Collector tiers for collecting logs and pushing them to
destination storage.
Reference: Hadoop and Pig for Large-Scale Web Log Analysis

9 Comments on “How will you obtain these user records?

  1. Ashu says:

    Sqoop imports data from RDBMS to HDFS. In this question we need to obtain data from hdfs and join with RDBMS. In other word, this can be done with sqoop export.




    0



    0
  2. Dhanya says:

    I believe the answer is C but many places the answer is shown as B. Is there any chance that the actual answer is B ?




    0



    0

Leave a Reply

Your email address will not be published. Required fields are marked *