Introduction to NLTK library in Python

0 Comments

NLTK is an essential library supports tasks such as classification, stemming, tagging, parsing, semantic reasoning, and tokenization in Python. It’s basically your main tool for natural language processing and machine learning. Today it serves as an educational foundation for Python developers who are dipping their toes in this field (and machine learning).

It’s a free and open-source library that is available on Windows, Mac OS and Linux with plenty of tutorials to make your entry into the world of NLP smooth.

Resources

· Documentation — https://www.nltk.org/

· NLTK Book — http://www.nltk.org/book/

Features

  • Tokenization Sentences/Words
  • Part-of-speech tagging
  • Stemming, Lemmatization
  • Sentiment Analysis
  • Wordnet Support
  • Language Translation Translation

Installation

pip install nltk

Import Library

import nltk
#nltk.download()

Sentence Tokenize

from nltk.tokenize import sent_tokenizetext = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."tokenized_text=sent_tokenize(text)print(tokenized_text)

Word Tokenize

from nltk.tokenize import word_tokenizetokenized_word=word_tokenize(text)
print(tokenized_word)

Stopwords

from nltk.corpus import stopwordsexample_sent = "This is a sample sentence, showing off the stop words filtration."stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(text)filtered_words = []for w in word_tokens:
if w not in stop_words:
filtered_words.append(w)print(word_tokens)
print()
print(filtered_words)

After running this cell we will get filtered words without stopword (is, the,how, are, you etc) that belong to our data.

Stemming

from nltk.stem import PorterStemmerps = PorterStemmer()
example_words = ["python","pythoner","rocks","pythoned","pythonly"]for w in example_words:
print(ps.stem(w))

Lemmatization

from nltk.stem import PorterStemmer,WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("pythonly"))

Stemming technique only looks at the form of the word whereas Lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.

POS Tagging

text = word_tokenize(“And now for something completely different”)
nltk.pos_tag(text)

WordNet

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.

Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.

from nltk.corpus import wordnetsyns = wordnet.synsets(“program”)
print(syns[0].name())
print(syns[0].lemmas()[0].name())
print(syns[0].definition())
print(syns[0].examples())

Frequency Distribution

from nltk.probability import FreqDistfdist = FreqDist(tokenized_word)
print(fdist)
#<FreqDist with 25 samples and 30 outcomes>
fdist.most_common(2)
[(‘is’, 3), (‘,’, 2)]
# Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

This comes to the end of this article.

Full code can be download from my github;
https://github.com/uzairaj/N
ltk

Check out more blogs on my website and YouTube channel
http://uzairadamjee.com/bloghttps://www.youtube.com/channel/UCCxSpt0KMn17sMn8bQxWZXA

Thank you for reading 🙂


Leave a Reply

Your email address will not be published. Required fields are marked *