Weather, Time and Taxi Ridership

The ability to predict taxi ridership could present valuable insights to city planners and taxi dispatchers in answering questions such as how to position cabs where they are most needed, how many taxis to dispatch, and how ridership varies over time.

I decided to look into it - predicting the number of taxi pickups in Chicago given time and weather pattern.

I combined historical weather data from Weather Underground and taxi ridership data from the City of Chicago data portal. I used Python’s Beautiful Soup and Selenium to scrape hourly weather information. I downloaded the taxi ridership data in CSV format.

Both the weather data and ridership data are time-series data. Before I could combine the information from the two datasets, I had to sort the data by date and time and then bin them in equally-spaced time bins. I merged information from the two datasets by matching the time bins.

The predictor variables or features in my predictive model were quantities such as temperature, humidity, pressure, wind speed, description of weather conditions (e.g. rain, snow, storm, drizzles, etc), time blocks (e.g. 12 am to 3 am), day of the week (e.g. Sunday, Monday), etc.

I built one model for each month. As a first pass, I explored the correlation between individual variables such as temperature and the number of rides. I found little correlation between individual features and the target. The following plot shows an example from January 2014.

corr

Intuitively, weather condition itself is combination of different weather features, and they may not have a linear relationship with the target. This motivated me to consider polynomial relationships between the features and the target.

I expanded the feature space by considering polynomial combinations of the numeric features with degree less than or equal three. Having a large set of features in a regression model may lead to overfitting due to over representation of certain features. Regularized regression algorithms such as Lasso and Ridge regression are appropriate for this situation. Python’s SciKit-Learn package was used for the regression tasks.

The following plot shows the results of Lasso and Ridge regressions.

lasso_ridge

Although, Ridge performs slightly better than Lasso, both models indicate that weather and time features can explain about 50% of the variation in ridership.

One known caveat of the model is that I have analyzed the data by month. Weather pattern is a continuous process and therefore, segmenting the weather data by intrinsic pattern (rather than by month) may lead to better predictions.

It would also be interesting to see how Uber and Lyft have affected the city’s taxi ridership in recent years.

Written on April 28, 2017