top of page

Implementing TF-IDFVectorizer | Sample assignment



In this assignment, we will use the GENIA POS corpus from Assignment A04. You can reuse the python function load_genia_pos to load it.


  1. (4 points) Build a TF-IDF weighted document-term matrix from all the documents in the corpus. The terms in this matrix should include both lowercased unigrams and bigrams. Only tokens that are not stop words and contain at least one alphabetic letter should be included. For example, if a document contains the sentences "I am Sam. I want 2 apples.", the term index should include "sam", "want", "apples", "want apples and no other items. You can use sklearn. feature extraction. text. TfidfVectorizer from the scikit-learn library to build this matrix. Note that the input documents are already tokenized, do not re-tokenize them. Read the documentation here carefully for the parameters you can adjust.

  2. (a) . (3 points Using dot product as a measure for similarity, find 10 most similar documents for the query string "immune response" from the document-term matrix in Question 1 ordered by the similarity. Use a binary vector of unigrams to represent the query string. You can use sklearn.feature_extraction.text.CountVectorizer to construct the query vector. Read the documentation here. Also note that the matrices returned by CountVectorizer and TfidfVectorizer are scipy sparse matrices. When you compute dot products, use scipy's dot method, as numpy may not be aware of sparse matrices. See scipy's note for details.

(b). (3 points) Repeat the same process in part a but use both unigrams and bigrams together as the binary vector representation of the query string. Which results do you prefer? Explain your reason.

Get solution of this project at affordable price. please send your query at contact@codersarts.com we'll share you price quote

How can you contact us for assignment Help.

  1. Via Email: you can directly send your complete requirement files at email id contact@codersarts.com and our email team follow up there for complete discussion like deadline , budget, programming , payment details and expert meet if needed.

  2. Website live chat: Chat with our live chat assistance for your basis queries and doubts for more information.

  3. Contact Form: Fill up the contact form with complete details and we'll review and get back to you via email

Codersarts Dashboard : Register at Codersarts Dashboard, and track order progress



bottom of page