HomeNamed Entity RecognitionGetting Started with spaCy
Deep Learning Specialization on Coursera

Update: Almost since one year after writing this article, spaCy now has been upgraded to version 1.2, and new data and new features are added in it. I fix some problems in this article for spacy install and test, especially the load data method. Just enjoy it. 2016.11.13

I met spaCy by google “nltk cython”, and found it very interesting:

spaCy.io | Build Tomorrow’s Language Technologies

spaCy is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you’re a small company doing NLP, we want spaCy to seem like a minor miracle.

Now let’s get started with spacy。

Installing spaCy
Installing spacy is very easy, I have installed spacy on my mac and ubuntu vps, both using the pip install methods:

$ sudo pip install -U spacy

And you should download the data and models from spacy, here we downlaod the English data:

$ sudo python -m spacy.en.download all

Note the download data is about 1G,and it split by two parts: parser and glove word2vec modes, and you can also download them one by one:

$ python -m spacy.en.download parser

$ python -m spacy.en.download glove

The data is in the spacy install dir, like this:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0

and in the parser dir”en-1.1.0″, like following:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet

Test spaCy
After installing spaCy, you can test it by the Python or iPython interpreter:

First, load and initialize the nlp data and text processor, this took about one minute on my macbook pro:

In [1]: import spacy

In [2]: nlp = spacy.load('en')

Word Tokenize Test:

In [14]: doc1 = nlp(u"this's spacy tokenize test")

In [15]: print doc1
this's spacy tokenize test

In [16]: for token in doc1:
....: print token
....:
this
's
spacy
tokenize
test

Sentence Tokenize Test or Sentence Segmentation Test:

In [23]: doc2 = nlp(u"this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.")

In [24]: for sent in doc2.sents:
print sent
....:
this is spacy sentence tokenize test.
this is second sent!
is this the third sent?
final test.

Lemmatize Test:
In [35]: doc3 = nlp(u"this is spacy lemmatize testing. programming books are more better than others")

In [36]: for token in doc3:
....: print token, token.lemma, token.lemma_
....:
this 496 this
is 488 be
spacy 173779 spacy
lemmatize 1510965 lemmatize
testing 2900 testing
. 419 .
programming 3408 programming
books 1011 book
are 488 be
more 529 more
better 615 better
than 555 than
others 871 others

Pos Tagging Test:
In [40]: doc4 = nlp(u"This is pos tagger test for spacy pos tagger")

In [41]: for token in doc4:
print token, token.pos, token.pos_
....:
This 87 DET
is 97 VERB
pos 82 ADJ
tagger 89 NOUN
test 89 NOUN
for 83 ADP
spacy 89 NOUN
pos 89 NOUN
tagger 89 NOUN

Named Entity Recognizer (NER) Test:
Here I used the same example from Stanford NER Python Example:

In [51]: doc5 = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [52]: for ent in doc5.ents:
print ent, ent.label, ent.label_
....:
Rami Eid 346 PERSON
Stony Brook University 349 ORG
New York 350 GPE

Noun Chunk Test:
In [58]: doc6 = nlp(u"Natural language processing (NLP) deals with the application of computational models to text or speech data.")

In [59]: for np in doc6.noun_chunks:
....: print np
....:
Natural language processing
the application
computational models
text
speech data

Word Vectors Test:
This example is copied from the spacy official example:
In [68]: doc7 = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
In [69]: apples = doc7[0]

In [70]: oranges = doc7[2]

In [71]: boots = doc7[6]

In [72]: hippos = doc7[8]

In [73]: apples.similarity(oranges)
Out[73]: 0.7857989796519943

In [74]: boots.similarity(hippos)
Out[74]: 0.41468512503416094

Now it’s your time to explore the spaCy.

Post by TextMiner

Deep Learning Specialization on Coursera

Comments

Getting Started with spaCy — 27 Comments

  1. I am getting error while do this

    import spacy

    In [2]: spacy.load(‘en’)
    Out[2]:

    In [3]: import os

    In [4]: from spacy.en import English, LOCAL_DATA_DIR
    —————————————————————————
    ImportError Traceback (most recent call last)
    in ()
    —-> 1 from spacy.en import English, LOCAL_DATA_DIR

    ImportError: cannot import name LOCAL_DATA_DIR

    • Have you download the spacy corpus?

      $ python -m spacy.en.download –force all

      Note the download data is about 750M,the default dir is in “…python2.7/site-packages/spacy/en/data” dir, includes tokenizer, pos_tagger, parser and related models and datas:

        • Still having problems. I downloaded the data.

          jeffs@jeff-desktop:~/skyset/NLP$ python
          Python 2.7.12 (default, Jul 1 2016, 15:12:24)
          [GCC 5.4.0 20160609] on linux2
          Type “help”, “copyright”, “credits” or “license” for more information.
          >>> import spacy.en
          >>> dir(spacy.en)
          [‘English’, ‘Language’, ‘STOPWORDS’, ‘__builtins__’, ‘__doc__’, ‘__file__’, ‘__name__’, ‘__package__’, ‘__path__’, ‘path’, ‘print_function’, ‘unicode_literals’]
          >>> import os
          >>> from spacy.en import English, LOCAL_DATA_DIR
          Traceback (most recent call last):
          File “”, line 1, in
          ImportError: cannot import name LOCAL_DATA_DIR
          >>>
          >>>
          jeffs@jeff-desktop:~/skyset/NLP$ echo $LOCAL_DATA_DIR
          /usr/local/lib/python2.7/dist-packages/spacy/data/
          jeffs@jeff-desktop:~/skyset/NLP$ ls -l $LOCAL_DATA_DIR
          total 12
          drwxr-sr-x 5 root staff 4096 Sep 24 22:09 __cache__
          -rw-r–r– 1 root staff 201 Sep 24 22:27 cookies.txt
          drwxr-sr-x 8 root staff 4096 Sep 24 22:27 en-1.1.0
          jeffs@jeff-desktop:~/skyset/NLP$

          • I finally figured it out. I made a guess of the LOCAL_DATA_DIR value, LOCAL_DATA_DIR=”/usr/local/lib/python2.7/dist-packages/spacy/data/” and got an error message looking for the file /usr/local/lib/python2.7/dist-packages/spacy/data/vocab/strings.json. So I changed the value to LOCAL_DATA_DIR=”/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0″ and that worked fine. Not sure why I have to do that. I should modify the code to make the import work properly.

      • Actually, you can:

        input:

        import spacy
        from nltk.corpus import webtext

        doc = webtext.raw(‘grail.txt’)

        nlp = spacy.load(‘en’)
        grail = nlp(doc)

        ni_sents = [grail[tok.i: tok.i].sent for tok in grail if tok.lower_ == ‘neee’]

        for c, sent in enumerate(ni_sents):
        print(f’Sent {c}: {sent}’)

        output:
        We are the keepers of the sacred words: Ni, Peng, and Neee-wom!

        RANDOM: Neee-wom!

        • Sorry. Forgot to repost the output after adding the string formatting. The output actually looks like:
          Sent 0: We are the keepers of the sacred words: Ni, Peng, and Neee-wom!

          Sent 1: RANDOM: Neee-wom!

  2. Hi, I am see difference in the NER detected on the demo in this website and the one i run locally. What is causing this difference? A entity which is detected as org is not even picked as an entity on my local.

      • I seem to have the latest version. Are you suggesting that an earlier version is better?

        A well known org is detected on your website but not working in my local. This is odd. Shouldn’t the latest have better accuracy?

  3. Hi, I am seeing the difference in output from Noun Chunks in the demo and when I run it locally. It is identifying adverbs like ‘How many’ as part of the noun chunk when I run it locally. I have all the latest models of spacy installed.

    Can you suggest a reason for this?

  4. Can’t install spacy. Getting the following error when using python -m spacy.en.download all:
    DLL load failed: The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail.

  5. I can not download the data via “python -m spacy.en.download all”. I run this command in cmd and python in Windows, but all failed. How can I run it in Windows?

  6. @molly, @regina: This might be relevant. It’s the message I got when running “python -m spacy.en.download all” on MacOS.
    The spacy.en.download command is now deprecated. Please use python -m
    spacy download [model name or shortcut] instead. For more info and
    available models, see the documentation: https://spacy.io/docs/usage.
    Downloading default ‘en’ model now…

  7. Hey I While executing few command like “NER” and “SENT” I am not getting the last word or the last line..

    Example:

    Python 2.7.10 (default, Feb 6 2017, 23:53:20)
    [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
    import spacy
    nlp = spacy.load(‘en’)
    doc2 = nlp(u”this is spacy sentence tokenize test. this is second sent! is this the third sent? final test.”)
    for sent in doc2.sents:
    print sent
    doc5 = nlp(u”Rami Eid is studying at Stony Brook University in New York”)
    for ent in doc5.ents:
    print (ent,ent.label,ent.label_)

    this is spacy sentence tokenize test.
    this is second sent!
    is this the third sent? final test.
    (Rami Eid, 377, u’PERSON’)
    (Stony Brook University, 380, u’ORG’)

  8. Hi,

    First of all, Thank you so much for posting very good and precise summary on Spacy. I have been learning for few days. I have few questions and would really appreciate if you help me with that.

    I am creating one string as-

    text=”I am new to this”

    and trying to pass it through nlp as-

    nlp=spacy.load(‘en’)

    text2=nlp(text)

    I am getting error..

    also In lemmatize test output-

    “this 496 this”

    what does number 496 indicates?

  9. Which version of spacy do you use in your lemmatizer? Is that a ‘en_core_web_lg’ or ‘en_core_web_sm’?? I’m making a college project in which I need to lemmatize words that come as an input from users. I’m having issues with some words while using the large dictionary model and they get resolved when I use the smaller one.

    Eg. ‘data sciences’ -> ‘data sciences’ in lg version
    ‘data sciences’ -> ‘data science’ in sm version

    Please advice me by telling which spacy lemmatizer you use or how could I solve the issue I’m facing.

Leave a Reply to TextMiner Cancel reply

Your email address will not be published.