Introduction To Sentiment Analysis Using Python NLTK Library

0 Comments

In this tutorial we will explore Python library NLTK and how we can use this library in understanding text i.e. Sentimental Analysis. We will start with the basics of NLTK and after getting some idea about it, we will then move to Sentimental Analysis. So, lets jump straight into it.

What is Sentiment Analysis?
Sentiment Analysis is a common task of Natural Language Processing (NLP) that cane be used to identify and extract opinions within a given text. The goal is to understand the attitude, sentiments and emotions of a speaker/writer based on text.

What is Python NLTK library?
NLTK stands for Natural Language Toolkit.  It is a massive tool kit, which contains packages to make machines understand human language and reply to it with an appropriate response. This library can be used for tasks like Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count etc.

Installing library

# How to install nltk library in python
For Python version 2.x:
pip install nltk

For Python version 3.x:
pip3 install nltk 

For mac use sudo pip3 install nltk  
# import libraries
import nltk
nltk.download()

An interface will be opened, click on all and then click download. It will download all packages. This may take a while…

Tokenize sentence

Text can be split into different sentences, using nltk method sentence_tokinize() we can tokenize a text into set of sentences.

from nltk.tokenize import sent_tokenize
text = "Hi my name is Uzair. I studied from IBA. And live in Karachi Pakistan"
print(sent_tokenize(text))

Tokenize words

We can split text or a sentence into tokens or words using nltk method word_tokinize()

from nltk.tokenize import word_tokenize
text = "Hi my name is Uzair. I studied from IBA. And live in Pakistan"
print(word_tokenize(text))

Lets save our results intor different variables

sentences = sent_tokenize(text)
words = word_tokenize(text)
print(sentences)
print(words)

Stop Words

A Text may contain words like ‘am’, ‘who’, ‘where’. We can remove stopwords from the text. There is no universal list of stop words in nlp, however the nltk library provides a list of stop words

from nltk.corpus import stopwords
set(stopwords.words('english'))

# set function is an unordered collection with no duplicate elements 
# For Example: x = [1, 1, 2, 2, 2, 2, 2, 3, 3]
# set(x) # use set
# set([1, 2, 3]) # print

In below example we will remove stopwords from a sentence.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

sample = "Stopwords code which contain a sample sentence, showing off the stop words filtration."

stop_words_array = set(stopwords.words('english'))

words = word_tokenize(sample)
filtered_sentence =  []
for w in words:
    if w not in stop_words_array:
        filtered_sentence.append(w)

print(words)
print('After removing stopwords')
print(filtered_sentence)

IMDB Data

We will now import a movie reviews data set from nltk.corpus and try to clean that data

from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
import string

useless_words = stopwords.words('english') + list(string.punctuation)
#all_words = movie_reviews.words()
filtered_words = [word for word in movie_reviews.words() if not word in useless_words]
print(len(filtered_words)/1e6)

Lets print out the most common words from filtered words

from collections import Counter
word_counter = Counter(filtered_words)
most_commond_words = word_counter.most_common()[:10]
most_commond_words

Sentiment Analysis on Movie Corpus

Classification using machine learning is a technique used for sentiment models. We will build a sentiment classifer using the movie review corpus. Classification is a technique which requires labels from data. This is where we will take advantage of bag-of-words and a curated negative and positive reviews we downloaded. We will implement bag-of-words function to create a positive or negative label for each review bag-of-words.

from nltk.corpus import movie_reviews

def build_bag_of_words_features(words):
    return {
        word:1 for word in words if not word in useless_words}

positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')

negative_features = [ (build_bag_of_words_features(movie_reviews.words(fileids = [f])), 'neg')
                   for f in negative_reviews]
positive_features = [ (build_bag_of_words_features(movie_reviews.words(fileids = [f])), 'pos')
                   for f in positive_reviews]

print(len(negative_features))
print(len(positive_features))

We will use Naive Bayes Classifier for this task. It is a very simple classifier with a probabilistic approach to classification. What this means is that the relationships between the input features and the class labels is expressed as probabilities. So, given the input features, for example, the probability for each class is estimated. The class with the highest probability then determines the label for the sample.

from nltk.classify import NaiveBayesClassifier

One of the simplest supervised machine learning classifiers is the Naive Bayes Classifier, we will train on 80% of the data what words are generally associated with positive or with negative reviews.
Remember, we had 1,000 records in both of positive and negative features. We can use 80% of the data for classification in Naive Bayes. When we provide the first 800 rows in each feature, it’s 80%. So we’ll store that number, 800, in a variable called split.

split = 800

sentiment_classifier = NaiveBayesClassifier.train(positive_features[:split] + negative_features[:split] )

We will classify with the first 800 positive features and the first 800 negative features. Remember they had labels pos and neg.

nltk.classify.util.accuracy(sentiment_classifier, positive_features[:split] + negative_features[:split] )

We can see that it’s about 98% accuracy, so it’s good. But how will the model behave on the 20%

nltk.classify.util.accuracy(sentiment_classifier, positive_features[split:] + negative_features[split:] )

And the accuracy of it, if we calculate it is around 71%. The estimated accuracy for a human is about 80%. So an accuracy of around 70% is a pretty good accuracy for such a simple model.

Remember, we had a large vocabulary and the Sentiment Classifier used all the words, but which of those words gave us this highish accuracy? The Sentiment Classifier, the one model we built, has a function, it says, show most informative features. We can actually run that and see what words or what features in those reviews were most informative.

sentiment_classifier.show_most_informative_features()

So, in this tutorial we start with basics of nltk library and goes to how we can use it in Sentimental Analysis. There are many training datasets available online. Like University dataset. A good dataset will increase the accuracy of your classifier. More the data better the result will be.

Full code can be download from my github;
https://github.com/uzairaj/Sentiment-Analysis

Thank you for reading, I hope that you found this tutorial helpful.


Leave a Reply

Your email address will not be published. Required fields are marked *