Sentiment Analysis Model
Project Details
This project is a Python-based sentiment analysis model developed by me to predict user sentiment (positive, neutral, or negative) from textual data; it focuses on leveraging natural language processing techniques, classification algorithms, using libraries like scikit-learn and pandas, to build and evaluate a logistic regression model.
Key Features
- Text Vectorization with TF-IDF: The project utilizes the TfidfVectorizer from scikit-learn to convert raw text data into numerical features, capturing the importance of each word relative to the document and the entire corpus.
- Model Training and Optimization: A logistic regression model is trained on the vectorized text data, with hyperparameter tuning performed using GridSearchCV to improve model performance.
- Performance Evaluation: The model’s performance is assessed using metrics like classification report, confusion matrix, and Jaccard index to evaluate its accuracy and effectiveness in predicting sentiment.
- Cross-Validation: Cross-validation is conducted to ensure the model’s robustness and generalization to unseen data.
- Interactive Data Visualization: Sentiment clusters are visualized in 2D and 3D spaces using PCA, with color mapping to represent different sentiment classes.
- User-Model Interaction: This ML model is serialized and can be interacted with by users entering text in the console to analyze the sentiment of the entered text.
Technologies & Resources Used
- IDE: Developed using PyCharm and Jupyter Notebook.
- Libraries: Incorporated various Python libraries such as pandas, scikit-learn/sci-py, matplotlib, numpy, and pickle.
- Natural Language Processing: Used scikit-learn’s TfidfVectorizer for text preprocessing and feature extraction.
- Machine Learning: Implemented logistic regression for sentiment classification, with performance optimization via GridSearchCV.
- Dimensionality Reduction: Utilized PCA for reducing the dimensionality of the feature set, enabling the visualization of sentiment clusters.
- Data Handling: Employed pandas for data manipulation and preprocessing, and scipy for handling sparse matrices.
- Visualization: Used matplotlib and seaborn for plotting and visualizing the results, including 2D and 3D plots.
- Serialization: Model and vectorizer were serialized using pickle for future reuse and deployment
Main Takeaways
- I gained an understanding of natural language processing techniques, focusing on transforming textual data into meaningful features for machine learning models.
- I acquired valuable experience in optimizing machine learning models for better performance, including the use of cross-validation techniques to ensure robustness.
- My hands-on experience with classification models, particularly logistic regression, has been essential in improving model accuracy and interpretability.
- I applied dimensionality reduction methods such as PCA to visualize high-dimensional data effectively, which made complex information more interpretable.
- I developed strong proficiency in evaluating model performance using various metrics, which allowed me to make informed decisions based on the results.
- I emphasized the importance of clear documentation and collaboration when working on complex machine learning tasks, ensuring that the code and outcomes are understandable and reproducible by others.
- Throughout the project, I utilized scikit-learn extensively, which significantly enhanced my skills and provided exposure to machine learning.