This is the part - 4 of our series "Natural Language Processing". In previous blog we learn all about text analysis using NLP. In this blog we will learn Language identifier, this is third topic of this series so ready to learn with NLP Language identifier.
Before start it first read last parts, click on below link to read previous parts:
I suggest that you go through last parts before start it which is also more help-full for this NLP Series.
What is NLP ?
It is the branch of data science that consists of systematic processes for analyzing, understanding, and how to driving information from the text data in a smart and efficient manner.
First install libraries which is related to NLP -
nltk, numpy, matplotlib.pyplot, tweepy, TwitterSearch, unidecode, langdetect, langid, gensim
And then import all of these
Install these all libraries which use in this:
import nltk # https://www.nltk.org/install.html
import numpy # https://www.scipy.org/install.html
import matplotlib.pyplot # https://matplotlib.org/downloads.html
import tweepy # https://github.com/tweepy/tweepy
import TwitterSearch # https://github.com/ckoepp/TwitterSearch
import unidecode # https://pypi.python.org/pypi/Unidecode
import langdetect # https://pypi.python.org/pypi/langdetect
import langid # https://github.com/saffsd/langid.py
List of Topics which we will covers in this series:
Text-analysis using NLTK library
Detecting text language
Stemming and Lemmatization using Bigrams
Finding unusual words
part of speech and meaning
Classify document into categories
Sentiment Analysis with NLTK
Work with Twitter streaming and Cleaning
Now let's start Topic -Language identifier
What is Language identifier ?
It is the process of identify language form text of data - language may be english, french, german, etc. For this we need good language identification algorithm.
In this we will work with famous NLP language processing package NLTK. In this we will do this using Bi-gram model.
Step 1 :
First need to to tokenize as per past post blog, in this blog we will tokenize it using function. which is another way to tokenize
Now we will find unigram and bigram and finding frequecies as per your selections
Here we find repetition of bigrams and uni-grams by using get() and FreqDist()
Arranging order (Like descending order)
If you like Codersarts blog and looking for Assignment help,Project help, Programming tutors help and suggestion you can send mail at firstname.lastname@example.org.
Please write your suggestion in comment section below if you find anything incorrect in this blog post