Introduction to NLP Library: Spacy in Python

2 Comments

Natural Language Processing (NLP) allows computers and machines to perform task like read and understanding human language.
One of the major challenge we faced when working with textual data or understanding languages is extracting those patterns which are meaningful and used the information to find actionable insights.
In today’s post we will explore NLP library that can help us in finding patterns in text data .

spaCy is one of the popular and easy-to-use natural language processing library in Python. It helps in building applications that can process and get insights from large volumes of text. It can be used in task related to information extraction or natural language understanding systems, deep learning etc. Companies like Airbnb, Quora, Uber are using this for production purposes and it has an active open source community.

SpaCy is a good choice for NLP tasks. Some of the features provided by spaCy are

  • Tokenisation
  • Lemmatisation
  • Entity recognition
  • Dependency parsing
  • Sentence recognition
  • Part-of-speech tagging

The full notebook can be found here

Installation

pip install  spacy

After installing library we need to download a language model. For more info and available models, see the docs on models.

python -m spacy download en

Import Libraries

import spacy

print(spacy.__version__)
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en') # loading English language model
# define a test sentence

test_sent= "Pakistan got independence in 1947. Karachi, Lahore and Islamabad are few of the major cities Pakistan."

Tokenization

Tokenization a text is the process of splitting  text into set of words, symbols, punctuation, spaces or in short tokens.

# converting our test sentence into an spacy token object
# we will use this approach to extract tokens
parsed_sent = nlp(test_sent) 
print(type(parsed_sent))
# it will split the sentence and print all words
print(parsed_sent.text.split()) 

SpaCy recognises punctuation and is able to split the sentence into word and as well as punctuation tokens.

# .orth_ is used for this purpose

for token in parsed_sent:
    print(token.orth_ )
# To print only word without any punctuation, use below code
# Method # 01
for token in parsed_sent:
    if not token.is_punct | token.is_space:
        print(token.orth_ )
# Method # 02

word_list = [token.orth_ for token in parsed_sent if not token.is_punct | token.is_space]

print(word_list) 

Part-of-Speech Tagging

Spacy makes it easy to get part-of-speech tags 

N(oun) : This usually denotes words that depict some object or entity which may be living or nonliving.

V(erb) : Verbs are words that are used to describe certain actions, states, or occurrences.

Adj(ective) : Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful.

Adv(erb) : Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb (ADV) very, which modifies the adjective (ADJ) beautiful.

# Method # 01

sentence_spans = list(parsed_sent.sents)
displacy.render(sentence_spans, style='dep', jupyter=True)
# Method # 02
for token in parsed_sent:
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")

Entity recognition

Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc.

parsed_sent = nlp(test_sent)
spacy.displacy.render(parsed_sent, style='ent',jupyter=True)

Lemmatization

Lemmatization is the reduction of each word to its root, or lemma. In spaCy we call token.lemma_ to get the lemmas for each word.

for token in parsed_sent:
    print(token, ' -> Its Lemma word ', token.lemma_)
    print()

Convert Spacy data into a Dataframe

import pandas as pd

df_token = pd.DataFrame()

for i, token in enumerate(parsed_sent):
    df_token.loc[i, 'text'] = token.text
    df_token.loc[i, 'lemma'] = token.lemma_,
    df_token.loc[i, 'pos'] = token.pos_
    df_token.loc[i, 'tag'] = token.tag_
    df_token.loc[i, 'dep'] = token.dep_
    #df_token.loc[i, 'shape'] = token.shape_
    #df_token.loc[i, 'is_alpha'] = token.is_alpha
    df_token.loc[i, 'is_stop'] = token.is_stop
    
print(df_token)

# writing dataframe into excel file

df_token.to_excel('Tokens Data.xlsx', index=False)

These are just a few of the functionalities of this amazing NLP library. You can refer to its documentation.

This comes to the end of this article. You can also check and explore NLP library Textblob.
http://uzairadamjee.com/blog/textblob/

Full code can be download from my github;
https://github.com/uzairaj/TextBlob

Check out more blogs on my website and YouTube channel
http://uzairadamjee.com/blog
https://www.youtube.com/channel/UCCxSpt0KMn17sMn8bQxWZXA

Thank you for reading 🙂


2 thoughts on “Introduction to NLP Library: Spacy in Python”

  1. I wanted to thank you for this wonderful read!!

    I absolutely enjoyed every little bit of it.
    I have got you book-marked to look at new stuff you I
    am sure this paragraph has touched all the internet people, its really really good piece of writing on building up new
    webpage. Ahaa, its nice discussion on the topic of this article here at
    this weblog, I have read all that, so at this time me also commenting at this place.
    http://porsche.com

    My blog … Jona

Leave a Reply to uzairadamjee Cancel reply

Your email address will not be published. Required fields are marked *