Skip to content

Language Identification Analysis

About the Project

A machine learning system for identifying Hindi and Marathi text written in Devanagari script using multiple feature extraction techniques and classification models for a comparative study. The classifiers used were both Naive Bayes. This is my course project for the course CL2 (Computational Linguistics 2).

GitHub Report

Features

  • Supports Hindi and Marathi language identification
  • Multiple feature extraction methods:
    • Character frequency analysis
    • Word length statistics
    • Character class distribution (vowels, consonants, matras)
    • N-gram analysis
    • Morphological analysis
    • POS tagging features (optional)
    • TF-IDF features

Setup

Run setup.sh to install required packages and download necessary data files.

Usage

Run all cells in model_comparison.ipynb

Create a directory named <data_size> in case it is not created automatically on running the notebook.

Results

Results of the model comparison are available in results/<data_size> where <data_size> is the configured size of the training + testing data used for the models.