NLTK is an essential library supports tasks such as classification, stemming, tagging, parsing, semantic reasoning, and tokenization in Python. It’s basically your main tool for natural language processing and machine learning. Today it serves as an educational foundation for Python developers who are dipping their toes in this field (and machine learning).
It’s a free and open-source library that is available on Windows, Mac OS and Linux with plenty of tutorials to make your entry into the world of NLP smooth.
· Documentation — https://www.nltk.org/
· NLTK Book — http://www.nltk.org/book/
- Tokenization Sentences/Words
- Part-of-speech tagging
- Stemming, Lemmatization
- Sentiment Analysis
- Wordnet Support
- Language Translation Translation
pip install nltk
from nltk.tokenize import sent_tokenizetext = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."tokenized_text=sent_tokenize(text)print(tokenized_text)
from nltk.tokenize import word_tokenizetokenized_word=word_tokenize(text)
from nltk.corpus import stopwordsexample_sent = "This is a sample sentence, showing off the stop words filtration."stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(text)filtered_words = for w in word_tokens:
if w not in stop_words:
After running this cell we will get filtered words without stopword (is, the,how, are, you etc) that belong to our data.
from nltk.stem import PorterStemmerps = PorterStemmer()
example_words = ["python","pythoner","rocks","pythoned","pythonly"]for w in example_words:
from nltk.stem import PorterStemmer,WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("cats"))
Stemming technique only looks at the form of the word whereas Lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.
text = word_tokenize(“And now for something completely different”)
WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.
Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.
from nltk.corpus import wordnetsyns = wordnet.synsets(“program”)
from nltk.probability import FreqDistfdist = FreqDist(tokenized_word)
#<FreqDist with 25 samples and 30 outcomes>
[(‘is’, 3), (‘,’, 2)]
# Frequency Distribution Plot
import matplotlib.pyplot as plt
This comes to the end of this article.
Check out more blogs on my website and YouTube channel
Thank you for reading 🙂