Getting Started with spaCy
Update: Almost since one year after writing this article, spaCy now has been upgraded to version 1.2, and new data and new features are added in it. I fix some problems in this article for spacy install and test, especially the load data method. Just enjoy it. 2016.11.13
I met spaCy by google “nltk cython”, and found it very interesting:
spaCy.io | Build Tomorrow’s Language Technologies
spaCy is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you’re a small company doing NLP, we want spaCy to seem like a minor miracle.
Now let’s get started with spacy。
Installing spaCy
Installing spacy is very easy, I have installed spacy on my mac and ubuntu vps, both using the pip install methods:
$ sudo pip install -U spacy
And you should download the data and models from spacy, here we downlaod the English data:
$ sudo python -m spacy.en.download all
Note the download data is about 1G,and it split by two parts: parser and glove word2vec modes, and you can also download them one by one:
$ python -m spacy.en.download parser
$ python -m spacy.en.download glove
The data is in the spacy install dir, like this:
textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0
and in the parser dir”en-1.1.0″, like following:
textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet
Test spaCy
After installing spaCy, you can test it by the Python or iPython interpreter:
First, load and initialize the nlp data and text processor, this took about one minute on my macbook pro:
In [1]: import spacy
In [2]: nlp = spacy.load('en')
Word Tokenize Test:
In [14]: doc1 = nlp(u"this's spacy tokenize test")
In [15]: print doc1
this's spacy tokenize test
In [16]: for token in doc1:
....: print token
....:
this
's
spacy
tokenize
test
Sentence Tokenize Test or Sentence Segmentation Test:
In [23]: doc2 = nlp(u"this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.")
In [24]: for sent in doc2.sents:
print sent
....:
this is spacy sentence tokenize test.
this is second sent!
is this the third sent?
final test.
Lemmatize Test:
In [35]: doc3 = nlp(u"this is spacy lemmatize testing. programming books are more better than others")
In [36]: for token in doc3:
....: print token, token.lemma, token.lemma_
....:
this 496 this
is 488 be
spacy 173779 spacy
lemmatize 1510965 lemmatize
testing 2900 testing
. 419 .
programming 3408 programming
books 1011 book
are 488 be
more 529 more
better 615 better
than 555 than
others 871 others
Pos Tagging Test:
In [40]: doc4 = nlp(u"This is pos tagger test for spacy pos tagger")
In [41]: for token in doc4:
print token, token.pos, token.pos_
....:
This 87 DET
is 97 VERB
pos 82 ADJ
tagger 89 NOUN
test 89 NOUN
for 83 ADP
spacy 89 NOUN
pos 89 NOUN
tagger 89 NOUN
Named Entity Recognizer (NER) Test:
Here I used the same example from Stanford NER Python Example:
In [51]: doc5 = nlp(u"Rami Eid is studying at Stony Brook University in New York")
In [52]: for ent in doc5.ents:
print ent, ent.label, ent.label_
....:
Rami Eid 346 PERSON
Stony Brook University 349 ORG
New York 350 GPE
Noun Chunk Test:
In [58]: doc6 = nlp(u"Natural language processing (NLP) deals with the application of computational models to text or speech data.")
In [59]: for np in doc6.noun_chunks:
....: print np
....:
Natural language processing
the application
computational models
text
speech data
Word Vectors Test:
This example is copied from the spacy official example:
In [68]: doc7 = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
In [69]: apples = doc7[0]
In [70]: oranges = doc7[2]
In [71]: boots = doc7[6]
In [72]: hippos = doc7[8]
In [73]: apples.similarity(oranges)
Out[73]: 0.7857989796519943
In [74]: boots.similarity(hippos)
Out[74]: 0.41468512503416094
Now it’s your time to explore the spaCy.
Post by TextMiner
I am getting error while do this
import spacy
In [2]: spacy.load(‘en’)
Out[2]:
In [3]: import os
In [4]: from spacy.en import English, LOCAL_DATA_DIR
—————————————————————————
ImportError Traceback (most recent call last)
in ()
—-> 1 from spacy.en import English, LOCAL_DATA_DIR
ImportError: cannot import name LOCAL_DATA_DIR
Have you download the spacy corpus?
$ python -m spacy.en.download –force all
Note the download data is about 750M,the default dir is in “…python2.7/site-packages/spacy/en/data” dir, includes tokenizer, pos_tagger, parser and related models and datas:
Just in case somebody is confused, that is –force. The page fuses the — into a smart -. Confused me.
Still having problems. I downloaded the data.
jeffs@jeff-desktop:~/skyset/NLP$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import spacy.en
>>> dir(spacy.en)
[‘English’, ‘Language’, ‘STOPWORDS’, ‘__builtins__’, ‘__doc__’, ‘__file__’, ‘__name__’, ‘__package__’, ‘__path__’, ‘path’, ‘print_function’, ‘unicode_literals’]
>>> import os
>>> from spacy.en import English, LOCAL_DATA_DIR
Traceback (most recent call last):
File “”, line 1, in
ImportError: cannot import name LOCAL_DATA_DIR
>>>
>>>
jeffs@jeff-desktop:~/skyset/NLP$ echo $LOCAL_DATA_DIR
/usr/local/lib/python2.7/dist-packages/spacy/data/
jeffs@jeff-desktop:~/skyset/NLP$ ls -l $LOCAL_DATA_DIR
total 12
drwxr-sr-x 5 root staff 4096 Sep 24 22:09 __cache__
-rw-r–r– 1 root staff 201 Sep 24 22:27 cookies.txt
drwxr-sr-x 8 root staff 4096 Sep 24 22:27 en-1.1.0
jeffs@jeff-desktop:~/skyset/NLP$
I finally figured it out. I made a guess of the LOCAL_DATA_DIR value, LOCAL_DATA_DIR=”/usr/local/lib/python2.7/dist-packages/spacy/data/” and got an error message looking for the file /usr/local/lib/python2.7/dist-packages/spacy/data/vocab/strings.json. So I changed the value to LOCAL_DATA_DIR=”/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0″ and that worked fine. Not sure why I have to do that. I should modify the code to make the import work properly.
Can I extract a sentence given a certain keyword in spaCy?
it seems cannot extract sentence by a certain keyword in spaCy
Actually, you can:
input:
import spacy
from nltk.corpus import webtext
doc = webtext.raw(‘grail.txt’)
nlp = spacy.load(‘en’)
grail = nlp(doc)
ni_sents = [grail[tok.i: tok.i].sent for tok in grail if tok.lower_ == ‘neee’]
for c, sent in enumerate(ni_sents):
print(f’Sent {c}: {sent}’)
output:
We are the keepers of the sacred words: Ni, Peng, and Neee-wom!
RANDOM: Neee-wom!
Sorry. Forgot to repost the output after adding the string formatting. The output actually looks like:
Sent 0: We are the keepers of the sacred words: Ni, Peng, and Neee-wom!
Sent 1: RANDOM: Neee-wom!
Hi, I am see difference in the NER detected on the demo in this website and the one i run locally. What is causing this difference? A entity which is detected as org is not even picked as an entity on my local.
may be caused by the different version?
I seem to have the latest version. Are you suggesting that an earlier version is better?
A well known org is detected on your website but not working in my local. This is odd. Shouldn’t the latest have better accuracy?
If the ner result is odd,you should check the data or model downloaded from spacy.
apples.similarity(oranges) gives me 0.392899592931
boots.similarity(hippos) gives me 0.20734257731
same example? Seems strange
Hi, I am seeing the difference in output from Noun Chunks in the demo and when I run it locally. It is identifying adverbs like ‘How many’ as part of the noun chunk when I run it locally. I have all the latest models of spacy installed.
Can you suggest a reason for this?
maybe the model spacy now used is different from the demo
Can’t install spacy. Getting the following error when using python -m spacy.en.download all:
DLL load failed: The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail.
I can not download the data via “python -m spacy.en.download all”. I run this command in cmd and python in Windows, but all failed. How can I run it in Windows?
@molly, @regina: This might be relevant. It’s the message I got when running “python -m spacy.en.download all” on MacOS.
The spacy.en.download command is now deprecated. Please use python -m
spacy download [model name or shortcut] instead. For more info and
available models, see the documentation: https://spacy.io/docs/usage.
Downloading default ‘en’ model now…
apples.similarity(oranges)
boots.similarity(hippos)
For of them, the result is zero.
hello, I have the same problem, did you solve it?
I got the same problem. Does anyone know how to solve it?
Hey I While executing few command like “NER” and “SENT” I am not getting the last word or the last line..
Example:
Python 2.7.10 (default, Feb 6 2017, 23:53:20)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
import spacy
nlp = spacy.load(‘en’)
doc2 = nlp(u”this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.”)
for sent in doc2.sents:
print sent
doc5 = nlp(u”Rami Eid is studying at Stony Brook University in New York”)
for ent in doc5.ents:
print (ent,ent.label,ent.label_)
this is spacy sentence tokenize test.
this is second sent!
is this the third sent? final test.
(Rami Eid, 377, u’PERSON’)
(Stony Brook University, 380, u’ORG’)
Hi,
First of all, Thank you so much for posting very good and precise summary on Spacy. I have been learning for few days. I have few questions and would really appreciate if you help me with that.
I am creating one string as-
text=”I am new to this”
and trying to pass it through nlp as-
nlp=spacy.load(‘en’)
text2=nlp(text)
I am getting error..
also In lemmatize test output-
“this 496 this”
what does number 496 indicates?
Which version of spacy do you use in your lemmatizer? Is that a ‘en_core_web_lg’ or ‘en_core_web_sm’?? I’m making a college project in which I need to lemmatize words that come as an input from users. I’m having issues with some words while using the large dictionary model and they get resolved when I use the smaller one.
Eg. ‘data sciences’ -> ‘data sciences’ in lg version
‘data sciences’ -> ‘data science’ in sm version
Please advice me by telling which spacy lemmatizer you use or how could I solve the issue I’m facing.
sorry, this article wrote on 2015, it’s very different from the spacy version at now