Text Classification using Naive Bayes Method

Naive Bayes Text Classification

Probabilistic Model of Classification

In a probabilistic classification model we want to estimate the value of $P(c|x)$ , the probability of a sample x being of class c. Naive Bayes is one such probabilistic classifier that uses Bayes’ Rule to classify samples. And Naive Bayes is “Naive” because it assumes strong independence among all the features of sample x.

Bayes Rule:

\[P(c|x) = \frac{P(x|c)P(c)}{P(x)}\]

Text Classification using Naive Bayes classifier

Consider the task of classifying textual documents into having positive or negative sentiments. We will design the Naive Bayes classifier for this problem as follows:

Samples are text documents, and their features are the words that comprises these documents.

Each document $d$ is a sequence of words, $d = w_1w_2...w_n$, where $w_i$ are the tokens of the document and $n$ is the total number of tokens in the document $d$.
The training dataset consists of many document, sentiment pairs, ${d_i, s_i}$
Each document $d_i$ is associated with a sentiment $s_i \in \{0,1\}$, $0$ being negative sentiment and $1$ being positive sentiment.

Using Bayes’ Rule we have

\[p(s|d) = \frac{p(d|s)p(s)}{p(d|s)p(s) + p(d|\bar{s})p(\bar{s})}\]

And from the independence assumption of features

\[p(d|s) = p(w_1,w_2,..., w_n|s) = p(w_1|s)p(w_2|s)...p(w_n|s)\]

Also in the IMDb reviews dataset that we are considering here have equal number of positive and negative datasets.

We have $p(s) = 0.5$ and $p(\bar{s})=0.5$.

This simplifies our formulation for $p(s|d)$

\[p(s|d) = \frac{p(d|s)}{p(d|s) + p(d|\bar{s})}\]

If we assign threshold of $p_T(s|d) = 0.5$ for deciding the final label, the model simplifies to,

$y= \begin{cases} 1, & \text{if } p(d|s=1) \geq p(d|s=0)\\ 0, & \text{otherwise} \end{cases}$

A measure for numerical stability

$p(w_i)$ will be very small in magnitude, and when we take a product of such very small numbers to compute $p(d|s)$ , even double precision floating points fail to store such small numbers and becomes zero. Hence, for numerical stability, we will convert the probabilities to log probability,

Github Repository: bsantraigi/Sentiment Analysis

# Import 'os' for preliminary tasks like directory listing etc.
import os

# Import re for regex string matching
import re

# Import nltk for word tokenization
import nltk

# Import Python's native data structures Counter and defaultdict
# Counter - maintains count of element
# defaultdict - dictionary data structure with exception handling for missing keys
from collections import Counter, defaultdict

# Import tqdm for fancy progressbars!
from tqdm import tqdm_notebook

# Import numpy for different mathematical operations on arrays / matrices
import numpy as np

# Install the nltk punkt component for tokenization
nltk.download('punkt')

Downloading the data

!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -P data/

--2019-07-14 03:04:30--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Connecting to 172.16.2.30:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: “data/aclImdb_v1.tar.gz”

100%[======================================>] 84,125,825  2.98M/s   in 36s     

2019-07-14 03:05:08 (2.23 MB/s) - “data/aclImdb_v1.tar.gz” saved [84125825/84125825]

Extract data

Please wait for ~15s

%%time
!tar -xzf data/aclImdb_v1.tar.gz -C data/

CPU times: user 8.3 s, sys: 1.7 s, total: 10 s
Wall time: 5min 13s

Data Samples

Dataset is split into two parts for training and testing
Positive and negative samples are organized in individual folders
Each sample document is stored in a .txt file

Let’s read in the data

data_folder = 'data/aclImdb/'

rp = os.path.join(data_folder, 'train/pos')
train_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'train/neg')
train_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

rp = os.path.join(data_folder, 'test/pos')
test_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'test/neg')
test_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

Regex for cleaning html tags

Pattern <.*?> means “anything within two angular brackets”. The qualifier *? denotes “as few times as possible”. This makes sure we match only one html tag at a time.

re_html_cleaner = re.compile(r"<.*?>")

Limit number of samples

To quickly train a small model, consider setting n_train and n_test to some relatively small numbers e.g. 1000. Set, n_train = n_test = -1 to use all the samples available.

n_train = 2500
n_test = 2500

(Conditional) Unigram Counter

Calculates the distribution $$p(w s=1)$and$p(w s=0)$$, empirically, from training data.

# Distribution of word tokens in positive samples
positive_word_counts = Counter()

for _fname in tqdm_notebook(train_positive[:n_train], desc="Crunching +ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        positive_word_counts += Counter(nltk.word_tokenize(text))

# Distribution of word tokens in negative samples
negative_word_counts = Counter()

for _fname in tqdm_notebook(train_negative[:n_train], desc="Crunching -ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        negative_word_counts += Counter(nltk.word_tokenize(text))

Unigram counts to probability distribution

\[p(w|s) = \frac{N_{s,w}}{N_{s,*}} = \frac{N_{s,w}}{\sum_{w' \in W}N_{s,w'}}\]

Additive Smoothing

Note that, if some token, $u$, unseen in training documents, occurrs in a test document, $p(doc_{test}|s)$ becomes $0$ as $N_{s,u}$ for that token is $0$.
We apply Additive Smoothing to prevent probability from going to zero.

\[p(w|s) = \frac{\alpha + N_{s,w}}{\sum_{w' \in W}(\alpha + N_{s,w'})} = \frac{\alpha + N_{s,w}}{\alpha V + \sum_{w' \in W}N_{s,w'}}\]

where V is the total vocab size.

len_corpus_pos = sum(positive_word_counts.values())
len_corpus_neg = sum(negative_word_counts.values())
V_pos = len(positive_word_counts)
V_neg = len(negative_word_counts)
alpha = 0.1
log_p_vocab_pos = defaultdict(
    lambda: np.log(alpha/len_corpus_pos),
    {w:np.log((alpha + c)/(V_pos*alpha + len_corpus_pos)) for w,c in positive_word_counts.items()}
)
log_p_vocab_neg = defaultdict(
    lambda: np.log(alpha/len_corpus_neg),
    {w:np.log((alpha + c)/(V_neg*alpha + len_corpus_neg)) for w,c in negative_word_counts.items()}
)

p_data_pos = len(train_positive)/(len(train_positive) + len(train_negative))
print(f"Prob. of +ve sentiment in our dataset: {p_data_pos}")

Prob. of +ve sentiment in our dataset: 0.5

get_prob_pos(doc)

A function that accepts a document string as input, tokenizes it and computes the probability $p(d|s=1)$ and $p(d|s=0)$ . It returns 1 if $p(d|s=1) \geq p(d|s=0)$ otherwise 0.

def get_prob_pos(doc):
    text = doc.strip()
    text = re_html_cleaner.sub(" ", text)
    tokens = nltk.word_tokenize(text)
    p_pos = 1
    p_neg = 1
    for token in tokens:
        p_pos += log_p_vocab_pos[token]
        p_neg += log_p_vocab_neg[token]

    return 1.0*(p_pos >= p_neg) #/(p_pos+p_neg)

results = []
for _fname in tqdm_notebook(test_positive[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((1, get_prob_pos(f.read())))


for _fname in tqdm_notebook(test_negative[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((0, get_prob_pos(f.read())))

Performance evaluation of our model

Accuracy: Overall performance of our model, fraction of samples that were labelled correctly

Recall: Out of all +ve data samples in test set, what fraction of it was labelled correctly

Precision: How precise is the model? Out of all samples that were tagged +ve by the model, how many were actually positive.

true_pos = 0
false_pos = 0
true_neg = 0
false_neg = 0
for true_label, pred_label in results:
    if true_label == 1 and pred_label == 1:
        true_pos += 1
    elif true_label == 1 and pred_label == 0:
        false_neg += 1
    elif true_label == 0 and pred_label == 1:
        false_pos += 1
    elif true_label == 0 and pred_label == 0:
        true_neg += 1

print(f"Accuracy: {(true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg):0.4F}")
print(f"Recall: {(true_pos)/(true_pos + false_neg):0.4F}")
print(f"Precision: {(true_pos)/(true_pos + false_pos):0.4F}")

Accuracy: 0.7904
Recall: 0.7416
Precision: 0.8218