What is Entity Resolution?
Entity resolution is the problem of determining which records in a database refer to the same entities and it is an expensive and important step in the data mining process. In this Article we discuss how to use Apache spark to apply powerful and scalable text analysis techniques and perform the Entity Resolution across two datasets.
Entity resolution is the process of joining one data source to another that describes the same entity.
We perform the text analysis and use Entity resolution to the data mining process using Python and Pyspark. We have 3 Dataset Amazon.csv, Google.csv and stopwords.txt. The amazon dataset contains the id, title description, manufacturer and price. The google dataset contains the id, name, description, manufacturer and price. And the stop words file contains the common English words.
Firstly We read in each of the files and create an RDD consisting of lines. For each of the data files "Google.csv" and "Amazon.csv". . First column of Google.csv is URLs and the first column of Amazon.csv is alphanumeric strings. We call them ID to simplify and want to parse ID for each row. Load the data into RDD so that ID is the key and other information is included in the value. As shown in the bellow.
Bag-of-words is a conceptually simple approach to text analysis. Here we treat each document as an unordered collection of words. We will construct some components for bag-of-words analysis on the description field of datasets.
A bag of words is a representation of text that describes the occurrence of words within a text data or text document. It involve A vocabulary of known words and a measure of the presence of known words.
Implement a function that takes a string and returns non-empty tokens by splitting using regular expressions.
Stopwords are common (English) words that do not contribute much to the content or meaning of a document (e.g., "the", "a", "is", "to", etc.). Stop words add noise to bag-of-words comparisons, so they are usually excluded. Using the included file "stopwords.txt",
As shown in the bellow These are some common stop words.
Now Implement a function, an improved tokenizer that does not emit stop words. we can see in the following output remove all stop words from the list.
Now we implement a function on both dataset, an improved tokenizer that does not emit stop words
TF - TF rewards tokens that appear many times in the same document. It is computed as the frequency of a token in a document, that is, if document d contains 100 tokens and token "w" appears in "doc" 5 times, then the TF weight of "f" in "doc" is 5/100 = 1/20. The intuition for TF is that if a word occurs often in a document, then it is more important to the meaning of the document.
Implement a function that takes a list of tokens and returns a dictionary mapping tokens to term frequency.
Now Combine the Both datasets Amazon and Google to create a corpus. Each element of the corpus is a <key, value> pair where key is ID and value is associated tokens from two datasets combined.
IDF - IDF rewards tokens that are rare overall in a dataset. The intuition is that it is more significant if two documents share a rare word than a common one. IDF weight for a token
for example "w" is a set of documents, computed U
Let N be the total number of documents in U
Determine n(w), the number of document in U that contains w
n(w)/N is the frequency of w in U, and N/n(w) is the inverse frequency.
Build an IDF function that return a pair RDD where key is each unique token and value is corresponding IDF value. Plot a histogram of IDF values.
Visualizing the histogram of idf values.
Build a function which does the following on entire corpus:
(a) Calculate token frequencies for tokens
(b) Create a Python dictionary where each token maps to the token's frequency times the token's IDF weight
TF-IDF The total TF-IDF weight for a token in a document is the product of its TF and IDF weights.