Language Identification Analysis
About the Project
A machine learning system for identifying Hindi and Marathi text written in Devanagari script using multiple feature extraction techniques and classification models for a comparative study. The classifiers used were both Naive Bayes. This is my course project for the course CL2 (Computational Linguistics 2).
Features
- Supports Hindi and Marathi language identification
- Multiple feature extraction methods:
- Character frequency analysis
- Word length statistics
- Character class distribution (vowels, consonants, matras)
- N-gram analysis
- Morphological analysis
- POS tagging features (optional)
- TF-IDF features
Setup
Run setup.sh
to install required packages and download necessary data files.
Usage
Run all cells in model_comparison.ipynb
Create a directory named
<data_size>
in case it is not created automatically on running the notebook.
Results
Results of the model comparison are available in results/<data_size>
where <data_size>
is the configured size of the training + testing data used for the models.