Question 1
You are provided with a dataset “NewsArticles.json” having news articles of mixed topics including business, entertainment, politics, sports, technology, but without labels.
***Load the dataset: res/NewsArticles.json
You are required to make a clustering-based model.
Carry out the following tasks:
Perform K-Means clustering on the above dataset and find the value of Sum of Squared Error (SSE)
Use PCA algorithm to reduce the dimension of the dataset (about 100) and then perform K-means clustering on the manipulated dataset and find the value of Sum of Squared Error (SSE)
Find the cluster having the highest value of count (before PCA). Also,
Mention the highest value of count (before PCA)
Find the cluster having the highest value of count (after PCA). Also,
Mention the highest value of count (after PCA)
Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the cluster you think is of news articles related to the topic of entertainment (before PCA)
Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the third cluster (after PCA)
Hint: In both the above cases, use the number of clusters as 5 and compute Sum of Square Error within clusters.
NOTE :
1.Do not use any NLP concepts here for any kind of cleansing or preprocessing.
2. Write the code only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py
Final Output Sample:
Output Format:
Perform the above operations and write your output to a file named output.csv, which should be present at the location output/output.csv
output.csv should contain the answer to each question on consecutive rows.
Screenshot of Output
If you need solution for this assignment or have project a similar assignment, you can leave us a mail at contact@codersarts.com directly.
Comments