Which type of join would you use for this table?
You have two tables of customers in your database. Customers in cust_table_1 were sent an email promotion last year, and customers in cust_table_2 received a newsletter last year.
Customers can only be entered in once per table. You want to create a table that includes all
customers, and any of the communications they received last year. Which type of join would you
use for this table?
which lifecycle stage are initial hypotheses formed?
In which lifecycle stage are initial hypotheses formed?
How should you proceed?
You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are
stored in HDFS. You are assigned to divide the users into groups based on the content of their
profiles. You have been instructed to try K-means clustering on this data. How should you
proceed?
What is the next step you should take?
The Marketing department of your company wishes to track opinion on a new product that was
recently introduced. Marketing would like to know how many positive and negative reviews are
appearing over a given period and potentially retrieve each review for more in-depth insight.
They have identified several popular product review blogs that historically have published
thousands of user reviews of your company’s products.
You have been asked to provide the desired analysis. You examine the RSS feeds for each blog
and determine which fields are relevant. You then craft a regular expression to match your new
product’s name and extract the relevant text from each matching review.
What is the next step you should take?
Which word or phrase completes the statement?
Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
to a Table as R is to a ______________ .
Which word or phrase completes the statement?
Which word or phrase completes the statement? Unix is to bash as Hadoop is to:
how does the call volume change over time?
A call center for a large electronics company handles an average of 35, 000 support calls a day.
The head of the call center would like to optimize the staffing of the call center during the rollout of
a new product due to recent customer complaints of long wait times. You have been asked to
create a model to optimize call center costs and customer wait times.
The goals for this project include:
1. Relative to the release of a product, how does the call volume change over time?
2. How to best optimize staffing based on the call volume for the newly released product, relative
to old products.
3. Historically, what time of day does the call center need to be most heavily staffed?
4. Determine the frequency of calls by both product type and customer language.
Which goals are suitable to be completed with MapReduce?
What will be your approach for loading data into the analytical sandbox for this analysis?
Consider the example of an analysis for fraud detection on credit card usage. You will need to
ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your
data for analysis, and not dropped as outliers during pre-processing. What will be your approach
for loading data into the analytical sandbox for this analysis?
What is another component?
Trend, seasonal, and cyclical are components of a time series. What is another component?
Which algorithm is the most appropriate for this study?
You are studying the behavior of a population, and you are provided with multidimensional data at
the individual level. You have identified four specific individuals who are valuable to your study,
and would like to find all users who are most similar to each individual. Which algorithm is the
most appropriate for this study?