spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.
spaCy is designed to help you do real work
To build real products
Gather real insights.
The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive.
Features of spaCy
Named entity recognition
Support for 52+ languages
19 statistical models for 9 languages
Pre-trained word vectors
Easy deep learning integrationPart-of-speech tagging
Labelled dependency parsing
Syntax-driven sentence segmentation
Built in visualizers for syntax and NER
Convenient string-to-hash mapping
Export to numpy data arrays
Efficient binary serialization
Easy model packaging and deployment
Robust, rigorously evaluated accuracy
$ pip install spacy
Download statistical models
Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.
$ python -m spacy download en_core_web_sm
Import and load
import spacy nlp = spacy.load("en_core_web_sm")
if you get error in loading nlp = spacy.load("en_core_web_sm") like OSError: [E050] Can't find model 'en_core_web_sm'
$ python -m spacy download en
import spacy nlp = spacy.load("en")
Documents, tokens and spans
Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.
doc = nlp("This is a text")
A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.
Accessing token attributes
doc = nlp("This is a text") # Token texts tokens = [token.text for token in doc] print(tokens)
['This', 'is', 'a', 'text']
A slice from a Doc object.
doc = nlp("This is a text") span = doc[2:4] span.text
Output: 'a text'
Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_.
Part-of-speech tags (predicted by statistical model)
doc = nlp("This is a text.") # Coarse-grained part-of-speech tags [token.pos_ for token in doc]
Output: ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Fine-grained part-of-speech tags [token.tag_ for token in doc]
output: ['DT', 'VBZ', 'DT', 'NN', '.']
doc = nlp("This is a text.") # Dependency labels [token.dep_ for token in doc] # ['nsubj', 'ROOT', 'det', 'attr', 'punct'] # Syntactic head token (governor) [token.head.text for token in doc] # ['is', 'is', 'text', 'is', 'is']
doc = nlp("Larry Page founded Google") # Text and label of named entity span [(ent.text, ent.label_) for ent in doc.ents] # [('Larry Page', 'PERSON'), ('Google', 'ORG')]
doc = nlp("This a sentence. This is another one.") # doc.sents is a generator that yields sentence spans [sent.text for sent in doc.sents] # ['This is a sentence.', 'This is another one.']
Base noun phrases
doc = nlp("I have a red car") # doc.noun_chunks is a generator that yields spans [chunk.text for chunk in doc.noun_chunks] # ['I', 'a red car']
spacy.explain("RB") # 'adverb' spacy.explain("GPE") # 'Countries, cities, states'
If you're in a Jupyter notebook, use displacy.render. Otherwise, use displacy.serve to start a web server and show the visualization in your browser.
from spacy import displacy doc = nlp("This is a sentence") displacy.render(doc, style="dep")
Visualize named entities
doc = nlp("Larry Page founded Google") displacy.render(doc, style="ent")