How should you proceed?
You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed?
In which lifecycle stage are initial hypotheses formed?
In which lifecycle stage are initial hypotheses formed?
Which type of join would you use for this table?
You have two tables of customers in your database. Customers in cust_table_1 were sent an e-mail promotion last year, and customers in cust_table_2 received a newsletter last year.
Customers can only be entered in once per table. You want to create a table that includes all customers, and any of the communications they received last year. Which type of join would you use for this table?
what is a way that you could try to increase the R2 of the model without artificially inflating it?
Refer to exhibit.
You are asked to write a report on how specific variables impact your clients sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only.
After a preliminary analysis of the data, the following findings were made:
1. Multicollinearity is not an issue among the variables
2. Only three variablesA, B, and Chave significant correlation with sales
You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit.
You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it?
What is the sum of the probabilities that the model assigns to all the filers in your training set that have b
You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?
In which phase of the analytic lifecycle would you expect to spend most of the project time?
In which phase of the analytic lifecycle would you expect to spend most of the project time?
Since R factors are categorical variables, they are most closely related to which data classification level?
Since R factors are categorical variables, they are most closely related to which data classification level?
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
For which class of problem is MapReduce most suitable?
For which class of problem is MapReduce most suitable?
What does R code nv <- v[v < 1000] do?
What does R code nv <- v[v < 1000] do?