Natural Language Processing (NLP) allows computers and machines to perform task like read and understanding human language.
One of the major challenge we faced when working with textual data or understanding languages is extracting those patterns which are meaningful and used the information to find actionable insights.
In today’s post we will explore NLP library that can help us in finding patterns in text data .
spaCy is one of the popular and easy-to-use natural language processing library in Python. It helps in building applications that can process and get insights from large volumes of text. It can be used in task related to information extraction or natural language understanding systems, deep learning etc. Companies like Airbnb, Quora, Uber are using this for production purposes and it has an active open source community.
SpaCy is a good choice for NLP tasks. Some of the features provided by spaCy are
- Entity recognition
- Dependency parsing
- Sentence recognition
- Part-of-speech tagging
The full notebook can be found here
pip install spacy
After installing library we need to download a language model. For more info and available models, see the docs on models.
python -m spacy download en
import spacy print(spacy.__version__)
from spacy.lang.en import English from spacy import displacy nlp = spacy.load('en') # loading English language model
# define a test sentence test_sent= "Pakistan got independence in 1947. Karachi, Lahore and Islamabad are few of the major cities Pakistan."
Tokenization a text is the process of splitting text into set of words, symbols, punctuation, spaces or in short tokens.
# converting our test sentence into an spacy token object # we will use this approach to extract tokens parsed_sent = nlp(test_sent) print(type(parsed_sent))
# it will split the sentence and print all words print(parsed_sent.text.split())
SpaCy recognises punctuation and is able to split the sentence into word and as well as punctuation tokens.
# .orth_ is used for this purpose for token in parsed_sent: print(token.orth_ )
# To print only word without any punctuation, use below code # Method # 01 for token in parsed_sent: if not token.is_punct | token.is_space: print(token.orth_ )
# Method # 02 word_list = [token.orth_ for token in parsed_sent if not token.is_punct | token.is_space] print(word_list)
Spacy makes it easy to get part-of-speech tags
N(oun) : This usually denotes words that depict some object or entity which may be living or nonliving.
V(erb) : Verbs are words that are used to describe certain actions, states, or occurrences.
Adj(ective) : Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful.
Adv(erb) : Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb (ADV) very, which modifies the adjective (ADJ) beautiful.
# Method # 01 sentence_spans = list(parsed_sent.sents) displacy.render(sentence_spans, style='dep', jupyter=True)
# Method # 02 for token in parsed_sent: print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")
Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc.
parsed_sent = nlp(test_sent) spacy.displacy.render(parsed_sent, style='ent',jupyter=True)
Lemmatization is the reduction of each word to its root, or lemma. In spaCy we call
token.lemma_ to get the lemmas for each word.
for token in parsed_sent: print(token, ' -> Its Lemma word ', token.lemma_) print()
Convert Spacy data into a Dataframe
import pandas as pd df_token = pd.DataFrame() for i, token in enumerate(parsed_sent): df_token.loc[i, 'text'] = token.text df_token.loc[i, 'lemma'] = token.lemma_, df_token.loc[i, 'pos'] = token.pos_ df_token.loc[i, 'tag'] = token.tag_ df_token.loc[i, 'dep'] = token.dep_ #df_token.loc[i, 'shape'] = token.shape_ #df_token.loc[i, 'is_alpha'] = token.is_alpha df_token.loc[i, 'is_stop'] = token.is_stop print(df_token) # writing dataframe into excel file df_token.to_excel('Tokens Data.xlsx', index=False)
These are just a few of the functionalities of this amazing NLP library. You can refer to its documentation.
This comes to the end of this article. You can also check and explore NLP library Textblob.
Full code can be download from my github;
Check out more blogs on my website and YouTube channel
Thank you for reading 🙂