The data was obtained from the UCI Machine Learning repository.
Variable | Description | Data type |
---|---|---|
instant | record index | Numeric |
dtday | Date | Datetime |
season | season (1:winter, 2:spring, 3:summer, 4:fall) | Categorical |
yr | year (0: 2011, 1:2012) | Categorical |
mnth | month ( 1 to 12) | Categorical |
holiday | weather day is holiday or not | Categorical, binomial |
weekday | day of the week | Categorical |
workingday | if day is neither weekend nor holiday is 1, otherwise is 0. | Categorical, binomial |
weathersit | 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog | Categorical |
temp | Normalized temperature in Celsius | Numerical |
atemp | Normalized feeling temperature in Celsius | Numerical |
hum | Normalized humidity | Numerical |
windspeed | Normalized wind speed | Numerical |
casual | Count of casual users | Numerical |
registered | Count of registered users | Numerical |
cnt | Count of total rental bikes including casual and registered (target) | Numerical |
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
bike_df = pd.read_csv("bike_rental_raw.csv")
bike_df
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 |
1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 |
2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 |
3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 |
4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
726 | 727 | 2012-12-27 | 1 | 1 | 12 | 0 | 4 | 1 | 2 | 0.254167 | 0.226642 | 0.652917 | 0.350133 | 247 | 1867 | 2114 |
727 | 728 | 2012-12-28 | 1 | 1 | 12 | 0 | 5 | 1 | 2 | 0.253333 | 0.255046 | 0.590000 | 0.155471 | 644 | 2451 | 3095 |
728 | 729 | 2012-12-29 | 1 | 1 | 12 | 0 | 6 | 0 | 2 | 0.253333 | 0.242400 | 0.752917 | 0.124383 | 159 | 1182 | 1341 |
729 | 730 | 2012-12-30 | 1 | 1 | 12 | 0 | 0 | 0 | 1 | 0.255833 | 0.231700 | 0.483333 | 0.350754 | 364 | 1432 | 1796 |
730 | 731 | 2012-12-31 | 1 | 1 | 12 | 0 | 1 | 1 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
731 rows × 16 columns
bike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 731 non-null int64 1 dteday 731 non-null object 2 season 731 non-null int64 3 yr 731 non-null int64 4 mnth 731 non-null int64 5 holiday 731 non-null int64 6 weekday 731 non-null int64 7 workingday 731 non-null int64 8 weathersit 731 non-null int64 9 temp 731 non-null float64 10 atemp 731 non-null float64 11 hum 731 non-null float64 12 windspeed 731 non-null float64 13 casual 731 non-null int64 14 registered 731 non-null int64 15 cnt 731 non-null int64 dtypes: float64(4), int64(11), object(1) memory usage: 91.5+ KB
bike_df.shape
(731, 16)
bike_df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
instant | 731.0 | 366.000000 | 211.165812 | 1.000000 | 183.500000 | 366.000000 | 548.500000 | 731.000000 |
season | 731.0 | 2.496580 | 1.110807 | 1.000000 | 2.000000 | 3.000000 | 3.000000 | 4.000000 |
yr | 731.0 | 0.500684 | 0.500342 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
mnth | 731.0 | 6.519836 | 3.451913 | 1.000000 | 4.000000 | 7.000000 | 10.000000 | 12.000000 |
holiday | 731.0 | 0.028728 | 0.167155 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
weekday | 731.0 | 2.997264 | 2.004787 | 0.000000 | 1.000000 | 3.000000 | 5.000000 | 6.000000 |
workingday | 731.0 | 0.683995 | 0.465233 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
weathersit | 731.0 | 1.395349 | 0.544894 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 |
temp | 731.0 | 0.495385 | 0.183051 | 0.059130 | 0.337083 | 0.498333 | 0.655417 | 0.861667 |
atemp | 731.0 | 0.474354 | 0.162961 | 0.079070 | 0.337842 | 0.486733 | 0.608602 | 0.840896 |
hum | 731.0 | 0.627894 | 0.142429 | 0.000000 | 0.520000 | 0.626667 | 0.730209 | 0.972500 |
windspeed | 731.0 | 0.190486 | 0.077498 | 0.022392 | 0.134950 | 0.180975 | 0.233214 | 0.507463 |
casual | 731.0 | 848.176471 | 686.622488 | 2.000000 | 315.500000 | 713.000000 | 1096.000000 | 3410.000000 |
registered | 731.0 | 3656.172367 | 1560.256377 | 20.000000 | 2497.000000 | 3662.000000 | 4776.500000 | 6946.000000 |
cnt | 731.0 | 4504.348837 | 1937.211452 | 22.000000 | 3152.000000 | 4548.000000 | 5956.000000 | 8714.000000 |
## Convert dteday to datetime
bike_df['dteday'] = pd.to_datetime(bike_df['dteday'])
bike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 731 non-null int64 1 dteday 731 non-null datetime64[ns] 2 season 731 non-null int64 3 yr 731 non-null int64 4 mnth 731 non-null int64 5 holiday 731 non-null int64 6 weekday 731 non-null int64 7 workingday 731 non-null int64 8 weathersit 731 non-null int64 9 temp 731 non-null float64 10 atemp 731 non-null float64 11 hum 731 non-null float64 12 windspeed 731 non-null float64 13 casual 731 non-null int64 14 registered 731 non-null int64 15 cnt 731 non-null int64 dtypes: datetime64[ns](1), float64(4), int64(11) memory usage: 91.5 KB
# Convert mnth, yr, holiday, weekday, workingday and weathersit to categorical variables
bike_df['season'] = bike_df['season'].astype('category')
bike_df['mnth'] = bike_df['mnth'].astype('category')
bike_df['yr'] = bike_df['yr'].astype('category')
bike_df['holiday'] = bike_df['holiday'].astype('category')
bike_df['weekday'] = bike_df['weekday'].astype('category')
bike_df['workingday'] = bike_df['workingday'].astype('category')
bike_df['weathersit'] = bike_df['weathersit'].astype('category')
bike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 731 non-null int64 1 dteday 731 non-null datetime64[ns] 2 season 731 non-null category 3 yr 731 non-null category 4 mnth 731 non-null category 5 holiday 731 non-null category 6 weekday 731 non-null category 7 workingday 731 non-null category 8 weathersit 731 non-null category 9 temp 731 non-null float64 10 atemp 731 non-null float64 11 hum 731 non-null float64 12 windspeed 731 non-null float64 13 casual 731 non-null int64 14 registered 731 non-null int64 15 cnt 731 non-null int64 dtypes: category(7), datetime64[ns](1), float64(4), int64(4) memory usage: 57.9 KB
bike_df.head(10)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 |
1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 |
2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 |
3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 |
4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 |
5 | 6 | 2011-01-06 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.089565 | 88 | 1518 | 1606 |
6 | 7 | 2011-01-07 | 1 | 0 | 1 | 0 | 5 | 1 | 2 | 0.196522 | 0.208839 | 0.498696 | 0.168726 | 148 | 1362 | 1510 |
7 | 8 | 2011-01-08 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.165000 | 0.162254 | 0.535833 | 0.266804 | 68 | 891 | 959 |
8 | 9 | 2011-01-09 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0.138333 | 0.116175 | 0.434167 | 0.361950 | 54 | 768 | 822 |
9 | 10 | 2011-01-10 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.150833 | 0.150888 | 0.482917 | 0.223267 | 41 | 1280 | 1321 |
#Create correlation matrix for numerical variables
correlation_matrix = bike_df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(data=correlation_matrix, annot = True, fmt='.2f', linewidths=.5)
<AxesSubplot:>
# Look for outliers
fig = px.histogram(bike_df, x="cnt",
marginal="box",
hover_data=bike_df.columns)
fig.show()
# Bike hire by season
fig = px.violin(bike_df, y = "cnt", x = "season", color = "season", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by season")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Season")
fig.show()
# Bike hire by year
fig = px.violin(bike_df, y = "cnt", x = "yr", color = "yr", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by year")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Year")
fig.show()
# Bike hire by month
fig = px.violin(bike_df, y = "cnt", x = "mnth", color = "mnth", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by month")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Month")
fig.show()
# Bike hire by holiday
fig = px.violin(bike_df, y = "cnt", x = "holiday", color = "holiday", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by holiday")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Holiday")
fig.show()
# Bike hire by week day
fig = px.violin(bike_df, y = "cnt", x = "weekday", color = "weekday", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by week day")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Week day")
fig.show()
# Bike hire by work day
fig = px.violin(bike_df, y = "cnt", x = "workingday", color = "workingday", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by work day")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Work day")
fig.show()
# Bike hire by weather type
fig = px.violin(bike_df, y = "cnt", x = "weathersit", color = "weathersit", box = True, points = "all",
hover_data = bike_df.columns)
fig.update_layout(title_text = "Bike hire by weather")
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Weather")
fig.show()
# Bike hire by date
fig = px.line(bike_df, x="dteday", y="cnt", title='Bike hired by day', hover_data = bike_df.columns)
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Date")
fig.show()
# Linear relationship?
fig = px.scatter(bike_df, x="temp", y="cnt", title='Bike hired by temperature', hover_data = bike_df.columns)
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Temperature")
fig.show()
# Linear relationship?
fig = px.scatter(bike_df, x="hum", y="cnt", title='Bike hired by humidity', hover_data = bike_df.columns)
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Humidity")
fig.show()
# Linear relationship?
fig = px.scatter(bike_df, x="windspeed", y="cnt", title='Bike hired by wind speed', hover_data = bike_df.columns)
fig.update_yaxes(title = "Total number of hired bikes")
fig.update_xaxes(title = "Wind speed")
fig.show()
bike_df.set_index("dteday", inplace = True)
bike_df.head(10)
instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dteday | |||||||||||||||
2011-01-01 | 1 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 |
2011-01-02 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 |
2011-01-03 | 3 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 |
2011-01-04 | 4 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 |
2011-01-05 | 5 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 |
2011-01-06 | 6 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.089565 | 88 | 1518 | 1606 |
2011-01-07 | 7 | 1 | 0 | 1 | 0 | 5 | 1 | 2 | 0.196522 | 0.208839 | 0.498696 | 0.168726 | 148 | 1362 | 1510 |
2011-01-08 | 8 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.165000 | 0.162254 | 0.535833 | 0.266804 | 68 | 891 | 959 |
2011-01-09 | 9 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0.138333 | 0.116175 | 0.434167 | 0.361950 | 54 | 768 | 822 |
2011-01-10 | 10 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.150833 | 0.150888 | 0.482917 | 0.223267 | 41 | 1280 | 1321 |
The total number of bikes hired was moderately correlated to the temperature, feeling temperature and count of casual users. Additionally, the bikes hired was highly correlated to the count of registered users. Since the total rental bikes is a sum of the casual and registered users, these variables were dropped from the dataset. Furthermore, the temp
and atemp
variables were correlated, thus, atemp
was dropped from the dataset.
The bike rental is influenced by the season, year, month, holiday and weather type. In contrast, the week day and working day do not influence the amount of bikes hired. For this reason, weekday
and instant
were deleted from the dataset.
The variable date
has an effect on the number of bike rentals which is represented on the different months. This is also captured with the month
variable.
The variable instant
represents an index in the dataset, which was removed to avoid overfitting of the model.
bike_df = bike_df.drop(columns = ["weekday", "workingday", "instant", "casual", "registered", "atemp"])
bike_df.columns
Index(['season', 'yr', 'mnth', 'holiday', 'weathersit', 'temp', 'hum', 'windspeed', 'cnt'], dtype='object')
#Check for missing values
bike_df.isnull().sum()
season 0 yr 0 mnth 0 holiday 0 weathersit 0 temp 0 hum 0 windspeed 0 cnt 0 dtype: int64
# Save clean dataframe
bike_df.to_csv("bike_rental_clean.csv", index=False)
# Separate target
features_name = bike_df.columns[:8]
X = bike_df[features_name]
y = bike_df["cnt"]
print("The features are", X.columns)
print("The target variable is", y.name)
The features are Index(['season', 'yr', 'mnth', 'holiday', 'weathersit', 'temp', 'hum', 'windspeed'], dtype='object') The target variable is cnt
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=123)
print('Training dataset: X_train=', X_train.shape, ', y_train', y_train.shape)
print('Testing dataset: X_test=', X_test.shape, ', y_test', y_test.shape)
Training dataset: X_train= (548, 8) , y_train (548,) Testing dataset: X_test= (183, 8) , y_test (183,)
baseline = np.mean(y_train)
y_baseline = np.repeat(baseline, len(y_test))
from sklearn.metrics import mean_squared_error
naive_MSE = mean_squared_error(y_test, y_baseline)
naive_RMSE=np.sqrt(naive_MSE)
print("Naive baseline RMSE: ", round(naive_RMSE, 2))
print("Naive baseline MSE: ", round(naive_MSE, 2))
Naive baseline RMSE: 1806.52 Naive baseline MSE: 3263499.33
from sklearn.dummy import DummyRegressor
dummy_baseline = DummyRegressor(strategy="median")
dummy_baseline.fit(X_train, y_train)
prediction = dummy_baseline.predict(y_test)
dummy_MSE = mean_squared_error(y_test, prediction)
dummy_RMSE = np.sqrt(dummy_MSE)
print("Dummy regression baseline RMSE: ", round(dummy_RMSE, 2))
print("Dummy regression baseline MSE: ", round(dummy_MSE, 2))
Dummy regression baseline RMSE: 1800.18 Dummy regression baseline MSE: 3240664.92