Skip to content
Back to Projects

Flight Delay & Ontime Prediction

A distributed machine learning system built on Azure Databricks that predicts flight delays by processing large-scale aviation and weather datasets, enabling proactive operational optimization.

Research UC Berkeley

Skills

Machine Learning at Scale Distributed Computing Feature Engineering Predictive Analytics

Tools

PySpark Azure Databricks XGBoost TensorFlow Python

Flight delays create cascading problems across the aviation industry — disrupted schedules, passenger inconvenience, and significant economic losses for airlines and airports alike. This project tackles the challenge of predicting delays before they occur, enabling proactive optimization of operations.

The Challenge

Aviation delays are influenced by a complex interplay of factors: weather conditions, airport congestion, aircraft availability, and seasonal patterns. Traditional reactive approaches leave airlines scrambling to manage disruptions after they’ve already begun. The goal was to build a predictive system capable of processing massive datasets to forecast delay probability with sufficient lead time for operational intervention.

Approach

Working with large-scale flight, weather, and station data, the team engineered features from multiple data sources and trained machine learning models using distributed computing on Microsoft Azure Databricks. The pipeline included:

  • Data Integration: Merging flight records with weather observations and airport station data
  • Feature Engineering: Creating temporal, geographic, and weather-derived predictors
  • Model Development: Implementing logistic regression baselines, PySpark ML models with cross-validation, and feed-forward neural networks
  • Evaluation: Rigorous cross-validation to assess generalization performance

Technical Implementation

The project leveraged PySpark for distributed data processing, enabling analysis at a scale impossible with single-machine approaches. XGBoost and TensorFlow models were trained and evaluated, with hyperparameter tuning performed through cross-validation. Visualization tools including Plotly, Seaborn, and Matplotlib supported exploratory analysis and results communication.

Team

This project was completed as part of UC Berkeley’s W261 (Machine Learning at Scale) course in collaboration with Ning (Louis) Li, Andrew Fogarty, and Siduo (Stone) Jiang.