back to homepage

Sentiment Analysis: Detecting The Sentiment of a Text

I've been doing a bit of machine learning lately and have been messing around with TensorFlow's Keras API. As a fun little project I used their sentiment140 dataset to train a model to classify text as either positive or negative. I trained this model with Python and wrote it in a Jupyter notebook. I saved the model and used TensorFlow.js to load the model onto my website, which you can play around with below.

Loading model...

I have a laptop which I use to store my daily backup that runs at midnight everyday. My machine is connected to this laptop by a spare Ethernet cable. I thought it a waste not to take advantage of its spare computing power, so I used it to train and test my model. I ran jupyter lab --no-browser on it to create a new Jupyter Lab server, then ssh -4NL 8888:localhost:8888 backup on my machine to create an IPv4, shell-less SSH tunnel from my laptop's port 8888 on localhost to my machine's port 8888. I use my machine for a lot of stuff, such as web surfing and reading, so training the models on my laptop instead of my machine helps take a lot of the stress of off it, since training models is really computationally expensive.

The model works by loading the sentiment140 dataset and cleaning the texts by removing punctuation and stopwords—words that don't contribute to the sentiment of a text. It also extracts the polarity, a number indicating the sentiment of the text. The result of this cleansing is

1 ['10x', 'cooler']
0 ['okk', 'thats', 'weird', 'cant', 'stop', 'following', 'people', 'twitter', 'tons', 'people', 'unfollow']
1 ['beautiful', 'day', 'not', 'got', 'first', 'class']
1 ['hildygottlieb', 'saying', 'mahaal', 'yesterday', 'everything', 'ever', 'needed', 'know', 'beatles', 'lyrics', 'prove', 'point']
0 ['kinda', 'sad', 'confused', 'guys']
...

You can see the sentiment (0 means negative; 1 means positive) followed by the text, which has been turned into a list of words. The model then turns the words into numbers. These numbers are then left padded with zeros

1 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
0 [0 0 0 0 1 43 566 13 246 316 73 42 1786 73 1]
1 [0 0 0 0 0 0 0 0 0 224 4 2 14 89 331]
1 [0 0 0 1 602 1 206 284 134 798 20 1 2424 1 681]
0 [0 0 0 0 0 0 0 0 0 0 0 329 46 1072 117]
...

We plug these lists of numbers into the model to train on. It reaches an accuracy of about 0.7868, meaning the model is correct in its guess 78% of the time—which is not that good. But it beats the 50% that random guessing would obtain. The model gets it right for simple sentences like "I love bananas," "The dog is so cute," or "I don't like the new art installation," but incorrectly classifies "I despise bananas," "I dislike bananas," or "I don't hate chocolate chip cookies."