World Bank Document Classification

A multi-label, multi-class classification challenge focused on identifying topics within World Bank publications.

Competition

DrivenData Machine Learning Competition with 18,660 training samples to predict labels for 18,738 test documents across categories like Macroeconomics, Poverty Reduction, and Technology.

Approach

The winning solution addressed significant data quality issues and class imbalance through SMOTE techniques and custom vocabulary development. A critical discovery involved identifying approximately 4,000 training observations inadvertently included in test data, which provided the competitive advantage.

Technologies

Applied NLP techniques including TF-IDF, word embeddings (Word2Vec, Glove, custom), stemming, and adversarial validation. Built LSTM/GRU neural networks with Keras and TensorFlow, complemented by LightGBM for ensemble methods.

Skills

Tools

Competition

Approach

Technologies