What metric should you use to estimate how hard a particular bug is to fix?
A company has 20 software engineers working to fix on a project. Over the past week, the team
has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs
fixed per engineer id five. None of the engineer fixed exactly five bugs last week.
One engineer points out that some bugs are more difficult to fix than others. What metric should
you use to estimate how hard a particular bug is to fix?
what way can Hadoop be used to improve the performance of LIoyd’s algorithm for k-means clustering on large
In what way can Hadoop be used to improve the performance of LIoyd’s algorithm for k-means
clustering on large data sets?
which set of mappers and reducers in the below pseudo code snippets will solve for the mean number of messages
You have a data file that contains two trillion records, one record per line (comma separated).
Each record lists two friends and unique message sent between them. Their names will not have
commas.
Michael, John, Pabst, Blue Ribbon
Tiffany, James, BMX Racing
John, Michael, Natural Lemon Flavor
Analyze the pseudo code examples below and determine which set of mappers and reducers in
the below pseudo code snippets will solve for the mean number of messages each user sends to
all of the friends?
For example pseudo code may have three friends to whom he sends 6, 10, and 200 messages,
respectively, so Michael’s mean would be (6+10+200)/3. The solution may require a pipeline of
two MapReduce jobs.
Which command gathers these records into a single file on your local file system?
You have just run a MapReduce job to filter user messages to only those of a selected
geographical region. The output for this job in a directory named westUsers, located just below
your home directory in HDFS. Which command gathers these records into a single file on your
local file system?
Which two functions are convex?
Which data serialization system gives you the flexibility to do this?
You need to analyze 60,000,000 images stored in JPEG format, each of which is approximately 25
KB. Because your Hadoop cluster isn’t optimized for storing and processing many small files you
decide to do the following actions:
1. Group the individual images into a set of larger files
2. Use the set of larger files as input for a MapReduce job that processes them directly with
Python using Hadoop streaming
Which data serialization system gives you the flexibility to do this?
What is the best way to acquire the user profile for use in HDFS?
You have user profile records in an OLTP database that you want to join with web server logs
which you have already ingested into HDFS. What is the best way to acquire the user profile for
use in HDFS?
You need to build a system to detect if the total dollar value of sales are outside the norm for each U.S
You are building a system to perform outlier detection for a large online retailer. You need to build
a system to detect if the total dollar value of sales are outside the norm for each U.S. state, as
determined from the physical location of the buyer for each purchase.
The retailer’s data sources are scattered across multiple systems and databases and are
unorganized with little coordination or shared data or keys between the various data sources.
Below are the sources of data available to you. Determine which three will give you the smallest
set of data sources but still allow you to implement the outlier detector by state.
How can the naiveté of the naive Bayes classifier be advantageous?
How can the naiveté of the naive Bayes classifier be advantageous?
What are two defining features of RMSE (root-mean square error or root-mean-square deviation)?
What are two defining features of RMSE (root-mean square error or root-mean-square deviation)?