Sentiment Analysis Model

Project Details

This project is a Python-based sentiment analysis model developed by me to predict user sentiment (positive, neutral, or negative) from textual data; it focuses on leveraging natural language processing techniques, classification algorithms, using libraries like scikit-learn and pandas, to build and evaluate a logistic regression model.

Key Features

Text Vectorization with TF-IDF: The project utilizes the TfidfVectorizer from scikit-learn to convert raw text data into numerical features, capturing the importance of each word relative to the document and the entire corpus.
Model Training and Optimization: A logistic regression model is trained on the vectorized text data, with hyperparameter tuning performed using GridSearchCV to improve model performance.
Performance Evaluation: The model’s performance is assessed using metrics like classification report, confusion matrix, and Jaccard index to evaluate its accuracy and effectiveness in predicting sentiment.
Cross-Validation: Cross-validation is conducted to ensure the model’s robustness and generalization to unseen data.
Interactive Data Visualization: Sentiment clusters are visualized in 2D and 3D spaces using PCA, with color mapping to represent different sentiment classes.
User-Model Interaction: This ML model is serialized and can be interacted with by users entering text in the console to analyze the sentiment of the entered text.

Technologies & Resources Used

IDE: Developed using PyCharm and Jupyter Notebook.
Libraries: Incorporated various Python libraries such as pandas, scikit-learn/sci-py, matplotlib, numpy, and pickle.
Natural Language Processing: Used scikit-learn’s TfidfVectorizer for text preprocessing and feature extraction.
Machine Learning: Implemented logistic regression for sentiment classification, with performance optimization via GridSearchCV.
Dimensionality Reduction: Utilized PCA for reducing the dimensionality of the feature set, enabling the visualization of sentiment clusters.
Data Handling: Employed pandas for data manipulation and preprocessing, and scipy for handling sparse matrices.
Visualization: Used matplotlib and seaborn for plotting and visualizing the results, including 2D and 3D plots.
Serialization: Model and vectorizer were serialized using pickle for future reuse and deployment

Main Takeaways

I gained an understanding of natural language processing techniques, focusing on transforming textual data into meaningful features for machine learning models.
I acquired valuable experience in optimizing machine learning models for better performance, including the use of cross-validation techniques to ensure robustness.
My hands-on experience with classification models, particularly logistic regression, has been essential in improving model accuracy and interpretability.
I applied dimensionality reduction methods such as PCA to visualize high-dimensional data effectively, which made complex information more interpretable.
I developed strong proficiency in evaluating model performance using various metrics, which allowed me to make informed decisions based on the results.
I emphasized the importance of clear documentation and collaboration when working on complex machine learning tasks, ensuring that the code and outcomes are understandable and reproducible by others.
Throughout the project, I utilized scikit-learn extensively, which significantly enhanced my skills and provided exposure to machine learning.