Projects | Sentiment Analysis using Haskell

Project Link

Our Sentiment Analysis project utilizes a Naive Bayes classifier implemented in pure Haskell, designed to classify movie reviews as either positive or negative. This application allows users to train the model on their own data and make predictions based on user input. We leveraged Haskell's powerful type system and functional programming capabilities to ensure robust and maintainable code.

The application includes a command-line interface (CLI) that provides a user-friendly way for users to input their reviews and receive predictions. We performed extensive data preprocessing, which involved cleaning the text data by removing stop words, symbols, and punctuation. The model was trained and validated using a publicly available dataset—the IMDb dataset, which contains 25,000 positive and 25,000 negative movie reviews—ensuring a solid foundation for sentiment classification. The user can also use any other dataset, provided it is formatted correctly as a text file in the designated datasets directory.

Questions Addressed

  • How can I effectively implement a Naive Bayes classifier in Haskell for text classification?
    This question focuses on applying Bayes' theorem to develop a probabilistic model for sentiment analysis, ensuring accurate predictions based on prior beliefs and observed evidence.
  • What preprocessing steps are necessary to prepare text data for analysis in a machine learning context?
    This question explores the importance of data cleaning techniques, including removing noise and irrelevant information, to enhance model accuracy.
  • How do I implement a command-line interface that facilitates user interaction with a machine learning model?

Technical Specifications

  1. Benefits: The ability to classify movie reviews simplifies the process of understanding audience sentiment and can be valuable for filmmakers and marketers.
  2. Functionality: The application provides capabilities for training, validating, and using a sentiment analysis model, with results based on user input.
  3. Technology Stack: Pure Haskell, leveraging its functional programming features and strong typing for reliable application development.

Implementation Details

  1. Algorithm Overview: Utilizing Bayes' theorem, the Naive Bayes classifier calculates probabilities based on the occurrence of words in the reviews.
  2. Model Training: The model supports randomized data selection techniques for both training and testing, allowing the saving of the model's state for future predictions.
  3. Command-Line Interface: The CLI facilitates user interactions, allowing users to train models, validate them, and make predictions using simple commands.

Learning Outcomes

Through the development of this project, we gained practical experience in writing Haskell applications and managing Cabal projects. We deepened our understanding of Input/Output operations and explored various data structures (like Map and Set) and their functionalities. Debugging Haskell's cryptic compile errors also became more intuitive. We became adept at file handling, reading data to and from files, and implementing machine learning algorithms in Haskell, demonstrating the potential of functional programming in data science.