You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are
stored in HDFS. You are assigned to divide the users into groups based on the content of their
profiles. You have been instructed to try K-means clustering on this data. How should you
proceed?
A.
Run MapReduce to transform the data,and find relevant key value pairs.
B.
Divide the data into sets of 1,000 user profiles,and run K-means clustering in RHadoop
iteratively.
C.
Run a Naive Bayes classification as a pre-processing step in HDFS.
D.
Partition the data by XML file size,and run K-means clustering in each partition.