Tweet Sentiment Analysis:
A Machine Learning Approach

Classifying tweet sentiments as positive, negative, or neutral using a custom logistic regression model.

By Amith MG

1. Project Overview

Understanding public opinion and sentiment is crucial in various domains, from brand monitoring to social trend analysis. This project focuses on building a machine learning model to classify the sentiment of tweets as either positive, negative, or neutral.

Problem: Automatically categorize the emotional tone of short, informal text (tweets).
Goal: Develop a robust predictive model capable of accurately classifying tweet sentiment.

Impact: Provide a tool for rapid sentiment assessment, valuable for social listening, market research, and public relations.

Try the Live Demo!

2. Data & Methodology

Data Summary

Source(s): Utilized the nltk.corpus.twitter_samples dataset, containing pre-labeled positive and negative tweets.

Key Features: Features were engineered based on the frequency of words appearing in positive vs. negative contexts within the training corpus.

Preprocessing: Tweets underwent rigorous cleaning including removal of stock tickers, retweets, hyperlinks, hashtags, punctuation, numbers, lowercasing, tokenization, stop word removal, and stemming (using SnowballStemmer).

Approach

Custom Model: A Logistic Regression model was built from scratch, including custom implementations of the sigmoid function, cost function, and gradient descent.

Feature Extraction: A frequency dictionary (freqs) was constructed mapping (word, sentiment) pairs to their counts. Features for each tweet were then extracted as a vector representing bias, sum of positive word counts, and sum of negative word counts.

Training: The model was trained using gradient descent with a predefined learning rate and number of iterations to optimize the weight vector (theta).

3. Key Results & Insights

Model Performance: The trained Logistic Regression model achieved an accuracy of approximately 99.42% on the test set, demonstrating strong classification capability for positive/negative tweets.

Sentiment Range: The model outputs a probability (0 to 1). Tweets with a probability above 0.55 are classified as 'Positive', below 0.45 as 'Negative', and between 0.45 and 0.55 as 'Neutral'.

Feature Influence: The model's weights (theta) directly reflect the influence of positive and negative word counts on the final sentiment prediction.

4. Tools & Technologies

Languages: Python, HTML, CSS (Tailwind CSS), JavaScript

Key Python Libraries: numpy, nltk, pickle, Flask, flask-cors, gunicorn

Development Environment: Jupyter Notebooks, VS Code

Version Control: Git

Deployment: Frontend on GitHub Pages, Backend API on Render

5. Conclusion & Future Work

This project successfully demonstrates the development and deployment of a custom machine learning model for tweet sentiment classification. By building the logistic regression from scratch, a deeper understanding of its mechanics was achieved, resulting in a highly accurate classifier for the given dataset.

Future Work 1: Expand to multi-class sentiment (e.g., happy, sad, angry) or fine-grained emotion detection.

Future Work 2: Explore more advanced NLP techniques and models (e.g., Transformers like BERT) for potentially higher accuracy and nuanced understanding.

Future Work 3: Integrate with live Twitter (X) API for real-time sentiment monitoring of specific topics or hashtags.

Future Work 4: Develop a more interactive dashboard to visualize sentiment trends over time or across different categories of tweets.

Connect with Me

Amith MG | amithds2017@gmail.com