Custis Trail Bike Count Forecaster
In this article, I will step through the process of forecasting the number of bicycles traveling on the Custis trail daily by using sophisticated machine learning techniques. Originally, I was planning on using data from the Bikeometer in Rosslyn, but it has been offline since March 25th. Thankfully, there are several other counters on this trail as you can see from my earlier article mapping Arlington’s bike and pedestrian counters.
The longest running and probably most reliable trail counter in Arlington is the Custis Rosslyn EcoMulti installed in October of 2009. It is located on the Custis trail near where westbound lane of Lee Highway crosses over I-66. Given its long time series of data and reliability, it is probably the best source of data from this project.
My first step in this project was to launch Anaconda and start this Python 3.5 Jupyter Notebook. Then I import the modules I plan to use for this project. They are as follows:
import pandas as pd
import requests
from xml.etree import ElementTree
import numpy as np
Using the requests module to retrieve data from Bike Arlington Web Services, I pulled all the daily counter data for bicycles crossing the Custis Rosslyn counter (counterid=4) for the October 1, 2009 to May 15, 2017 time period.
url = "http://webservices.commuterpage.com/counters.cfc?wsdl&method=GetCountInDateRange&counterid=4&startDate=10/01/2009&endDate=05/15/2017&direction=&mode=B&interval=d"
xmlfile = open('xml_bikeometer.xml', 'w')
xmldata = requests.get(url)
xmlfile.write(xmldata.text)
xmlfile.close()
xml_data = 'xml_bikeometer.xml'
The code below parses the xml data and stores it in a Pandas dataframe, which is kind of like an Excel table.
document = ElementTree.parse(xml_data)
date = []
count = []
direction = []
for day in document.findall('count'):
date.append(day.attrib['date'])
count.append(day.attrib['count'])
direction.append(day.attrib['direction'])
dfbikes = pd.DataFrame({'date' : date, 'count': count, 'direction': direction}, dtype = int)
dfbikes['date'] = pd.to_datetime(dfbikes.date)
dfbikes = dfbikes.groupby('date').sum()
dfbikes.tail()
Then I loaded in weather data that I downloaded from the NOAA National Centers for Environmental Information. This U.S. Local Climatological Data for the National Airport weather station ended up being somewhat problematic. The data structure was a bit difficult to work with and had large blocks of missing data for some weather measurements, such as relative humidity. Given these issues, I decided to only extract the snow data and instead get the rest of the historical weather data from Weather Underground.
weather = pd.read_csv('weather09t17.csv', usecols=['REPORTTPYE','DATE', 'DAILYMaximumDryBulbTemp','DAILYAverageRelativeHumidity',
'DAILYAverageWindSpeed', 'DAILYPrecip', 'DAILYSnowfall', 'DAILYSnowDepth'])
daily_weather = weather.loc[weather['REPORTTPYE'] == 'SOD', :]
daily_weather['date'] = pd.to_datetime(daily_weather.DATE, format='%Y-%m-%d').dt.date
daily_weather = daily_weather.set_index('date')
snow_weather = daily_weather[['DAILYSnowfall', 'DAILYSnowDepth']]
snow_weather.tail()
swcolumns = snow_weather.columns
swindex = pd.date_range('2017-05-14', periods = 2).date
sw_append = pd.DataFrame([[0.0, 0], [0.0, 0]], columns=swcolumns, index=swindex)
snow_weather = snow_weather.append(sw_append)
snow_weather.tail()
The weather_kdca0917.csv file contains the 2009 to May 15, 2017 Arlington, Virginia historical weather data from Weather Underground. The available weather fields are listed below.
df_wu = pd.read_csv('weather_kdca0917.csv', index_col='date', parse_dates=True)
df_wu.dtypes
The dfsunrise data is sunrise and sunset times for Arlington, Virginia. It obtained from the U.S. Naval Observatory.
dfsunrise = pd.read_pickle('sunrise120120102017.pkl')
I then concatenated the four previous tables by date.
bikedataframe = pd.concat([dfbikes, snow_weather, df_wu, dfsunrise], axis=1)
#bikedataframe = bikedataframe.drop(['DATE', 'REPORTTPYE'], axis=1)
bikedataframe.head()
Then I converted the data to floating point numbers in preparation for the machine learning modeling. I also calculated weekday versus weekends, converted the sunset and sunrise times to fractional hours, and then calculated the number of hours of daylight time. The year variable was also adjusted to be the number of years since 2000 for scaling purposes.
bikedataframe = bikedataframe.drop(['sunrise', 'sunset'], 1)
bikedataframe = bikedataframe.replace(to_replace='T', value=0.01)
bikedataframe = bikedataframe.astype(float)
bikedataframe["dayofweek"] = bikedataframe.index.dayofweek
bikedataframe["weekday"] = 0
bikedataframe['weekday'] = np.where(bikedataframe['dayofweek'] < 5, 1, 0)
bikedataframe['sunsethour'] = (bikedataframe['sunset_int'] // 100) + ((bikedataframe['sunset_int'] % 100) / 60)
bikedataframe['sunrisehour'] = (bikedataframe['sunrise_int'] // 100) + ((bikedataframe['sunrise_int'] % 100) / 60)
bikedataframe['day_hrs'] = bikedataframe['sunsethour'] - bikedataframe['sunrisehour']
bikedataframe['year'] = bikedataframe['year'] - 2000
bikedataframe.tail(10)
Holidays can also influence bike usage, especially commuter patterns. I used a built in Pandas function to get all the Federal holidays for 2009 to 2017.
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2009', '2018')
bikedataframe = bikedataframe.join(pd.Series(1, index=holidays, name='holiday'))
bikedataframe['holiday'].fillna(0, inplace=True)
holidays
I also noticed that Bike to Work Day also impacts bike counts, so I added in those dates as bike_events. Are there any other bike events that should be added to these?
bike_events = pd.DatetimeIndex(['2010-05-21', '2011-05-20', '2012-05-18', '2013-05-17', '2014-05-16', '2015-05-15', '2016-05-20', '2017-05-19'])
bike_events
bikedataframe = bikedataframe.join(pd.Series(1, index=bike_events, name='bike_events'))
bikedataframe['bike_events'].fillna(0, inplace=True)
I then did a bit of data cleaning. First, I dropped all rows with any missing data.
bikedataframe.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
Then I dropped any rows with counter data that did not have positive bike counts. This eliminated days when the counter might not have been working. (After making the bike counter model, we could go back and predict the bike counts for those days when the counter might not have been working correctly.)
bikedataframe = bikedataframe.loc[bikedataframe['count'] > 0]
bikedataframe.shape
bikedataframe.dtypes
I decided to merge the snow data into one snowtotal field that merges snowfall and snow depth measurements.
bikedataframe['snowtotal'] = np.where(bikedataframe['DAILYSnowfall'] > bikedataframe['DAILYSnowDepth'],
bikedataframe['DAILYSnowfall'], bikedataframe['DAILYSnowDepth'])
After some playing around, I decided to include the following fields in the daily bike counter model:
- Max TemperatureF
- Min TemperatureF
- Max Humidity
- Max Wind SpeedMPH
- PrecipitationIn
- weekday
- holiday
- day_hrs
- snowtotal
- Fog
- year
- sunsethour
- CloudCover
- bike_events
names = ['Max TemperatureF', 'Min TemperatureF', 'Max Humidity', 'Max Wind SpeedMPH', 'PrecipitationIn', 'weekday', 'holiday', 'day_hrs',
'snowtotal', 'Fog', 'year', 'sunsethour', 'CloudCover', 'bike_events']
Next I set the model X and Y variables.
X = bikedataframe[names]
X.shape
y = bikedataframe['count']
Now that the data is all cleaned up and selected, it is ready for modeling. However, I wasn’t sure which model to use and how to specify it. This process could have taken me days of research and I did try traditional linear regression and a few slightly more advanced methods before I decided to let the machine do the work. I found a python module called TPOT that automatically creates and optimizes machine learning pipelines using genetic programming.
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
I set aside 25 percent of the data as a testing dataset and trained the model with the remaining 75 percent of the data. This helps avoid overfitting issues and allows us to test the model on data that was not used to create the model.
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75, test_size=0.25, random_state=1)
After to TPOT runs for an hour or two on the training dataset, it produces the recommended machine learning code that works best for this dataset.
tpot = TPOTRegressor(verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('bikeometer_pipeline_custis4boost.py')
TPOT recommended a machine learning pipeline that uses RobustScaler that standardizes the dataset to be robust to outliers. Then a model is created using Gradient Boosting for regression.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
exported_pipeline = make_pipeline(
RobustScaler(),
GradientBoostingRegressor(alpha=0.8, loss="huber", max_depth=5, max_features=0.45, min_samples_leaf=1, min_samples_split=20)
)
exported_pipeline.fit(X_train, y_train)
results = exported_pipeline.predict(X_test)
Using the text data, it appears the model explains 85.6% of the variance between the predicted and actual bike count scores in the test data. The model seems to work very well.
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
print('R^2 test: %.3f' % (r2_score(y_test, results)))
print(mean_squared_error(y_test, results))
print(explained_variance_score(y_test, results))
Using all the data (test and training), the model explained variance score is 0.91. The best possible score is 1.0.
y_predit = exported_pipeline.predict(X)
print('R^2 train: %.3f' % (r2_score(y, y_predit)))
exported_pipeline.named_steps['gradientboostingregressor'].feature_importances_
Looking at importance of each of the variables in this model, we can see that maximum temperature, minimum temperature, year and sunset hour are the most important. Fog was the least important feature. I might even want to consider dropping that component in the future from the model.
feature_importance = exported_pipeline.named_steps['gradientboostingregressor'].feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
print(names)
print(feature_importance)
Below is a plot of the actual counts in blue versus the daily predicted bike counts in green. I recommend right clicking on the chart to open the image in a new browser tab to see more of the details. The predicted counts and actual counts appear to track very close to each other in most cases.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()
bikedataframe['predicted'] = y_predit
bikedataframe[['count', 'predicted']].plot(figsize=(24,16),alpha=0.5)
Below you'll find a table that compares the actual count to predicted counts for the past 30 days. Most days the predictions are pretty close. The only item that jumps out at me is for May 6th. The prediction (789) was much higher than the actual count (319). I wonder if there was some disruption on the trail that day.
bikedataframe[['count', 'predicted']].tail(30)
Let me try to use this model to predict the daily bike counts for the Custis Rosslyn counter for the next few days. The weather forecast and other data for the next few days are listed below.
dfsunrise.loc['2017-05-17': '2017-05-20']
forecast_data = {'date': ['2017-05-17', '2017-05-18', '2017-05-19', '2017-05-20'],
'Max TemperatureF': [93.0, 92, 89, 70],
'Min TemperatureF': [71.0, 71, 59, 55],
'Max Humidity': [78.0, 64, 84, 71],
'Max Wind SpeedMPH': [11.0, 12, 11, 10],
'PrecipitationIn': [0.0, 0.0, 0.22, 0],
'weekday': [1, 1, 1, 0],
'holiday': [0, 0, 0, 0],
'day_hrs': [14.36, 14.39, 14.42, 14.45],
'snowtotal': [0, 0, 0, 0],
'Fog': [0, 0, 0, 0],
'year': [17.0, 17.0, 17.0, 17.0],
'sunsethour': [20.267, 20.283, 20.3, 20.317],
'CloudCover': [1, 3, 4, 3],
'bike_events': [0, 0, 1, 0]}
df_forecast = pd.DataFrame.from_dict(forecast_data)
df_forecast['date'] = pd.to_datetime(df_forecast.date)
df_forecast = df_forecast.set_index("date")
F = df_forecast[names]
y_forecast = exported_pipeline.predict(F)
y_forecast
print(int(y_forecast[0]), "bike counts forecast for May 17, 2017.")
print(int(y_forecast[1]), "bike counts forecast for May 18, 2017.")
print(int(y_forecast[2]), "bike counts forecast for May 19, 2017.")
print(int(y_forecast[3]), "bike counts forecast for May 20, 2017.")
I'll check back in the next few days and list the actual count data in the comments section below.
As you could imagine, a model of daily bike usage could be useful for planning purposes. You could also use a tool like this to monitor data quality from the counter sensors by detecting large deviations from predicted counts. It also might alert government staff when external issues might be impacting bike trail usage.
Comments
Comments powered by Disqus