top of page

Machine Learning

Public·3 members

Natural Language Processing (NLP) in Python with spacy


spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.


spaCy is designed to help you do real work

  • To build real products

  • Gather real insights.


The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive.



Features of spaCy


  • Non-destructive tokenization

  • Named entity recognition

  • Support for 52+ languages

  • 19 statistical models for 9 languages

  • Pre-trained word vectors

  • State-of-the-art speed

  • Easy deep learning integrationPart-of-speech tagging

  • Labelled dependency parsing

  • Syntax-driven sentence segmentation

  • Built in visualizers for syntax and NER

  • Convenient string-to-hash mapping

  • Export to numpy data arrays

  • Efficient binary serialization

  • Easy model packaging and deployment

  • Robust, rigorously evaluated accuracy



Getting started


install spacy


$ pip install spacy

Statistical models


Download statistical models


Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.


Download en_core_web_sm


$ python -m spacy download en_core_web_sm

Import and load


import spacy
nlp = spacy.load("en_core_web_sm")

if you get error in loading nlp = spacy.load("en_core_web_sm") like OSError: [E050] Can't find model 'en_core_web_sm'


try this


$ python -m spacy download en
import spacy
nlp = spacy.load("en")


Documents, tokens and spans


Processing text


Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.


doc = nlp("This is a text")

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.



Accessing token attributes


doc = nlp("This is a text") # Token texts 
tokens = [token.text for token in doc]
print(tokens)

Output:

['This', 'is', 'a', 'text']


Spans


A slice from a Doc object.


Accessing spans


doc = nlp("This is a text") 
span = doc[2:4] span.text 
Output: 'a text'


Linguistic features


Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_.


Part-of-speech tags (predicted by statistical model)


doc = nlp("This is a text.")

# Coarse-grained part-of-speech tags
[token.pos_ for token in doc] 

Output: ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']  
# Fine-grained part-of-speech tags 
[token.tag_ for token in doc]  
output: ['DT', 'VBZ', 'DT', 'NN', '.']


Syntactic dependencies


doc = nlp("This is a text.")
 
# Dependency labels 
[token.dep_ for token in doc] 

# ['nsubj', 'ROOT', 'det', 'attr', 'punct'] 

# Syntactic head token (governor) 
[token.head.text for token in doc]
 
# ['is', 'is', 'text', 'is', 'is']


Named Entities

doc = nlp("Larry Page founded Google") 

# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
 
# [('Larry Page', 'PERSON'), ('Google', 'ORG')]

Sentences

doc = nlp("This a sentence. This is another one.")
 
# doc.sents is a generator that yields sentence spans 
[sent.text for sent in doc.sents]

# ['This is a sentence.', 'This is another one.']

Base noun phrases

doc = nlp("I have a red car") 

# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

# ['I', 'a red car']

Label explanations


spacy.explain("RB") 
# 'adverb'
 
spacy.explain("GPE") 
# 'Countries, cities, states'


Visualizing


If you're in a Jupyter notebook, use displacy.render. Otherwise, use displacy.serve to start a web server and show the visualization in your browser.



from spacy import displacy
doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

Visualize dependencies - spacy
Visualize dependencies



Visualize named entities

doc = nlp("Larry Page founded Google") 
displacy.render(doc, style="ent")

Visualize named entities
Visualize named entities

104 Views
bottom of page