[Python] 캐글 자전거 수요 예측

2022-03-22 10 분 소요

1. 모듈 import

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import os
from scipy import stats

%matplotlib inline

mpl.rcParams['axes.unicode_minus'] = False

2. 데이터 로드

train = pd.read_csv("train.csv", parse_dates=["datetime"])
test = pd.read_csv("test.csv", parse_dates=["datetime"])

train.shape, test.shape

((10886, 12), (6493, 9))

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64
 2   holiday     10886 non-null  int64
 3   workingday  10886 non-null  int64
 4   weather     10886 non-null  int64
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64
 10  registered  10886 non-null  int64
 11  count       10886 non-null  int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   datetime    6493 non-null   datetime64[ns]
 1   season      6493 non-null   int64
 2   holiday     6493 non-null   int64
 3   workingday  6493 non-null   int64
 4   weather     6493 non-null   int64
 5   temp        6493 non-null   float64
 6   atemp       6493 non-null   float64
 7   humidity    6493 non-null   int64
 8   windspeed   6493 non-null   float64
dtypes: datetime64[ns](1), float64(3), int64(5)
memory usage: 456.7 KB

train.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

test.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
dtype: int64

#year, month, day, dayofweek(0~6), quarter, hour, minute, second 컬럼 생성
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['dayofweek'] = train['datetime'].dt.dayofweek
train['quarter'] = train['datetime'].dt.quarter
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.second
train.shape

(10886, 20)

test['year'] = test['datetime'].dt.year
test['month'] = test['datetime'].dt.month
test['day'] = test['datetime'].dt.day
test['dayofweek'] = test['datetime'].dt.dayofweek
test['quarter'] = test['datetime'].dt.quarter
test['hour'] = test['datetime'].dt.hour
test['minute'] = test['datetime'].dt.minute
test['second'] = test['datetime'].dt.second
test.shape

(6493, 17)

train.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	year	month	day	dayofweek	quarter	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	2011	1	1	5	1	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	2011	1	1	5	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	2011	1	1	5	1	2
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13	2011	1	1	5	1	3
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1	2011	1	1	5	1	4

test.head()

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	year	month	day	dayofweek	quarter	hour
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027	2011	1	20	3	1	0
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000	2011	1	20	3	1	1
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000	2011	1	20	3	1	2
3	2011-01-20 03:00:00	1	1	1	10.66	12.880	56	11.0014	2011	1	20	3	1	3
4	2011-01-20 04:00:00	1	1	1	10.66	12.880	56	11.0014	2011	1	20	3	1	4

3. EDA

fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3)
fig.set_size_inches(18, 8)

sns.barplot(data = train, x = 'year', y='count', ax=ax1)
sns.barplot(data = train, x = 'month', y='count', ax=ax2)
sns.barplot(data = train, x = 'day', y='count', ax=ax3)
sns.barplot(data = train, x = 'hour', y='count', ax=ax4)
sns.barplot(data = train, x = 'minute', y='count', ax=ax5)
sns.barplot(data = train, x = 'second', y='count', ax=ax6)

ax1.set(ylabel='Count')
ax2.set(ylabel='month')
ax3.set(ylabel='day')
ax4.set(ylabel='hour')

[Text(0, 0.5, 'hour')]

output_14_1

minute와 second는 값이 0밖에 없으므로 해당 Feature는 사용하지 않는다.

fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12, 10)

sns.boxplot(data = train, y='count', orient="v", ax=axes[0][0])
sns.boxplot(data = train, y='count', x = "season", orient="v", ax=axes[0][1])
sns.boxplot(data = train, y='count', x = "hour", orient="v", ax=axes[1][0])
sns.boxplot(data = train, y='count', x = "workingday", orient="v", ax=axes[1][1])


axes[0][0].set(ylabel='Count')
axes[0][1].set(xlabel = "Season", ylabel='month')
axes[1][0].set(xlabel = "Hour of the Day", ylabel='day')
axes[1][1].set(xlabel = "Woking Day", ylabel='hour')

[Text(0.5, 0, 'Woking Day'), Text(0, 0.5, 'hour')]

output_15_1

hour의 너무 많은 값이 이상치로 나온다. hour를 여러 조건으로 나누어 살펴보자.

fig, ((ax1, ax2, ax3, ax4, ax5)) = plt.subplots(5)
fig.set_size_inches(18, 25)

sns.pointplot(data = train, x = 'hour', y='count', ax=ax1)
sns.pointplot(data = train, x = 'hour', y='count', hue="workingday", ax=ax2)
sns.pointplot(data = train, x = 'hour', y='count', hue="dayofweek", ax=ax3)
sns.pointplot(data = train, x = 'hour', y='count', hue="weather", ax=ax4)
sns.pointplot(data = train, x = 'hour', y='count', hue="season", ax=ax5)

<AxesSubplot:xlabel='hour', ylabel='count'>

output_16_1

주중과 주말, 요일별로 hour당 수요량이 다르다.

train[['temp', 'atemp', 'weather', 'count', 'casual', 'registered']].corr()

	temp	atemp	weather	count	casual	registered
temp	1.000000	0.984948	-0.055035	0.394454	0.467097	0.318571
atemp	0.984948	1.000000	-0.055376	0.389784	0.462067	0.314635
weather	-0.055035	-0.055376	1.000000	-0.128655	-0.135918	-0.109340
count	0.394454	0.389784	-0.128655	1.000000	0.690414	0.970948
casual	0.467097	0.462067	-0.135918	0.690414	1.000000	0.497250
registered	0.318571	0.314635	-0.109340	0.970948	0.497250	1.000000

plt.figure(figsize=(12,10))
sns.heatmap(abs(train.corr()), annot=True, cmap='RdPu')
plt.show()

output_18_0

def concatenate_year_month(datetime) :
    return "{0}-{1}".format(datetime.year, datetime.month)

train["year_month"] = train["datetime"].apply(concatenate_year_month)

print(train.shape)
train[["datetime", "year_month"]].head()

(10886, 21)

	datetime	year_month
0	2011-01-01 00:00:00	2011-1
1	2011-01-01 01:00:00	2011-1
2	2011-01-01 02:00:00	2011-1
3	2011-01-01 03:00:00	2011-1
4	2011-01-01 04:00:00	2011-1

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 4)

sns.barplot(data=train, x="year", y="count", ax=ax1)
sns.barplot(data=train, x="month", y="count", ax=ax2)

fig, ax3 = plt.subplots(1, 1)
fig.set_size_inches(18, 4)

sns.barplot(data=train, x="year_month", y="count", ax=ax3)

<AxesSubplot:xlabel='year_month', ylabel='count'>

output_20_1

output_20_2

#이상치 제거
trainWithoutOutliers = train[np.abs(train["count"] - train["count"].mean()) <= (3*train["count"].mean())]

print(train.shape)
print(trainWithoutOutliers.shape)

(10886, 21)
(10768, 21)

fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12, 10)

sns.distplot(train["count"], ax=axes[0][0])
stats.probplot(train["count"], dist='norm', fit=True, plot=axes[0][1])

sns.distplot(np.log(trainWithoutOutliers["count"]), ax=axes[1][0])
stats.probplot(np.log(trainWithoutOutliers["count"]), dist='norm', fit=True, plot=axes[1][1])

C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





((array([-3.82886059, -3.6047202 , -3.48171232, ...,  3.48171232,
          3.6047202 ,  3.82886059]),
  array([0.        , 0.        , 0.        , ..., 6.63463336, 6.64118217,
         6.64118217])),
 (1.4123284761790171, 4.528748279449013, 0.9542628138734534))

output_22_2

4. 예측 준비

풍속 0값 예측

# widspeed 풍속에 0 값이 가장 많음 -> 잘못 기록된 데이터를 고쳐 줄 필요가 있음
fig, axes = plt.subplots(2)
fig.set_size_inches(18, 10)

plt.sca(axes[0])
plt.xticks(rotation=30, ha='right')
axes[0].set(ylabel='Count', title="train windspeed")
sns.countplot(data=train, x="windspeed", ax=axes[0])

plt.sca(axes[1])
plt.xticks(rotation=30, ha='right')
axes[1].set(ylabel='Count', title="test windspeed")
sns.countplot(data=test, x="windspeed", ax=axes[1])

<AxesSubplot:title={'center':'test windspeed'}, xlabel='windspeed', ylabel='count'>

output_25_1

trainWind0 = train.loc[train['windspeed']==0]
trainWindNot0 = train.loc[train['windspeed']!=0]
print(trainWind0.shape)
print(trainWindNot0.shape)

(1313, 21)
(9573, 21)

# 랜덤포레스트로 예측해서 풍속 넣기
from sklearn.ensemble import RandomForestClassifier

def predict_windspeed(data) :

    dataWind0 = data.loc[data['windspeed'] == 0]
    dataWindNot0 = data.loc[data['windspeed'] != 0]

    wCol = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]

    dataWindNot0["windspeed"] = dataWindNot0["windspeed"].astype("str")

    rfModel_wind = RandomForestClassifier()
    rfModel_wind.fit(dataWindNot0[wCol], dataWindNot0["windspeed"])
    wind0Values = rfModel_wind.predict(X = dataWind0[wCol])

    predictWind0 = dataWind0
    predictWindNot0 = dataWindNot0
    predictWind0["windspeed"] = wind0Values
    data = predictWindNot0.append(predictWind0)

    data["windspeed"] = data["windspeed"].astype("float")
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)

    return data

train = predict_windspeed(train)

fig, ax1 = plt.subplots()
fig.set_size_inches(18, 6)

plt.sca(ax1)
plt.xticks(rotation=30, ha='right')
ax1.set(ylabel='Count', title="train windspeed")
sns.countplot(data=train, x="windspeed", ax=ax1)

<AxesSubplot:title={'center':'train windspeed'}, xlabel='windspeed', ylabel='count'>

output_28_1

feature 정하기

categorical_feature_names = ["season", "holiday", "workingday", "weather", "dayofweek", "month", "year", "hour"]

for var in categorical_feature_names :
    train[var] = train[var]. astype("category")
    test[var] = test[var]. astype("category")

feature_names = ["season", "weather", "temp", "atemp", "humidity", "windspeed", "year", "hour", "dayofweek", "holiday", "workingday"]
feature_names

['season',
 'weather',
 'temp',
 'atemp',
 'humidity',
 'windspeed',
 'year',
 'hour',
 'dayofweek',
 'holiday',
 'workingday']

X_train = train[feature_names]

print(X_train.shape)
X_train.head()

(10886, 11)

	season	weather	temp	atemp	humidity	windspeed	year	hour	dayofweek
0	1	2	9.84	12.880	75	6.0032	2011	5	5
1	1	1	15.58	19.695	76	16.9979	2011	10	5
2	1	1	14.76	16.665	81	19.0012	2011	11	5
3	1	1	17.22	21.210	77	19.0012	2011	12	5
4	1	2	18.86	22.725	72	19.9995	2011	13	5

X_test = test[feature_names]

print(X_test.shape)
X_test.head()

(6493, 11)

	season	weather	temp	atemp	humidity	windspeed	year	hour	dayofweek	workingday
0	1	1	10.66	11.365	56	26.0027	2011	0	3	1
1	1	1	10.66	13.635	56	0.0000	2011	1	3	1
2	1	1	10.66	13.635	56	0.0000	2011	2	3	1
3	1	1	10.66	12.880	56	11.0014	2011	3	3	1
4	1	1	10.66	12.880	56	11.0014	2011	4	3	1

label_name = "count"

y_train = train[label_name]

print(y_train.shape)
y_train.head()

(10886,)

   1
  36
  56
  84
  94
Name: count, dtype: int64

RMSLE

과소평가에 패널티 줌
오차 제곱 평균의 제곱근
0에 가까울수록 정밀도 높음

from sklearn.metrics import make_scorer

def rmsle(predicted_values, actual_values) :

    predicted_values = np.array(predicted_values)
    actual_values = np.array(actual_values)

    log_predict = np.log(predicted_values + 1)
    log_actual = np.log(actual_values + 1)

    difference = log_predict - log_actual
    difference = np.square(difference)

    mean_difference = difference.mean()

    score = np.sqrt(mean_difference)

    return score

rmsle_scorer = make_scorer(rmsle)
rmsle_scorer

make_scorer(rmsle)

KFold 교차검증

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

5. 예측 모델

Linear Regression Model

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)

# 선형회귀 모델을 초기화
lModel = LinearRegression()

# 모델 학습
y_train_log = np.log1p(y_train)
lModel.fit(X_train, y_train_log)

# 예측, 정확도 평가
preds = lModel.predict(X_train)
print("RMSLE Value For Linear Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))

RMSLE Value For Linear Regression:  0.9798738624110362

Regularization Model - Ridge

가중치 모든 원소가 0에 가깝게 만들어 모든 Feature가 주는 영향을 최소화(기울기 작게 만듦)
max_iter : 반복 실행하는 최대 횟수

ridge_m_ = Ridge()
ridge_params_ = {'max_iter' : [3000], 'alpha' : [0.01, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000]}
rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)
grid_ridge_m = GridSearchCV(ridge_m_, ridge_params_, scoring=rmsle_scorer, cv=5)

y_train_log = np.log1p(y_train)
grid_ridge_m.fit(X_train, y_train_log)
preds = grid_ridge_m.predict(X_train)
print(grid_ridge_m.best_params_)
print("RMSLE Value For Ridge Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))

fig,ax = plt.subplots()
fig.set_size_inches(12, 5)
df = pd.DataFrame(grid_ridge_m.cv_results_)
df["alpha"] = df["params"].apply(lambda x:x["alpha"])
df["rmsle"] = df["mean_test_score"].apply(lambda x:-x)

plt.xticks(rotation=30, ha='right')
sns.pointplot(data=df, x="alpha", y="rmsle", ax=ax)

{'alpha': 0.01, 'max_iter': 3000}
RMSLE Value For Ridge Regression:  0.9798738604013573

<AxesSubplot:xlabel='alpha', ylabel='rmsle'>

output_43_2

여기서 ‘GridSearchCV’ object has no attribute ‘gridscores’ 라는 오류가 떴다.
구글에 쳐보니 sklearn의 이전 버전에는 ‘gridscores‘가 있고 새 버전은 ‘cvresults‘로 변경되었다고 한다.
그래서 df = pd.DataFrame(gridridge_m.cv_results) 고치고 했다.
고쳤더니 컬럼값도 바껴있다..
원하는 컬럼값을 찾아 바꿔주었다.
df[“parameters”] -> df[“params”]
df[“mean_validation_score”] -> df[“mean_test_score”]

Regularization Model - Lasso

계수를 0에 가깝게 만들려고 하며 이를 L1 규제라고 한다. 0인 계수는 완전히 제외하는 Feature가 생긴다는 의미
Feature 선택이 자동으로 이루어짐
alpha 값의 기본 값은 1.0이며, 과소 적합을 줄이기 위해서는 이 값을 줄여야한다.

lasso_m_ = Lasso()
alpha = 1/np.array([0.01, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000])
lasso_params_ = {'max_iter' : [3000], 'alpha' : alpha}

grid_lasso_m = GridSearchCV(lasso_m_, lasso_params_, scoring=rmsle_scorer, cv=5)
y_train_log = np.log1p(y_train)
grid_lasso_m.fit(X_train, y_train_log)
preds = grid_lasso_m.predict(X_train)
print(grid_lasso_m.best_params_)
print("RMSLE Value For Lasso Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))

fig,ax = plt.subplots()
fig.set_size_inches(12, 5)
df = pd.DataFrame(grid_ridge_m.cv_results_)
df["alpha"] = df["params"].apply(lambda x:x["alpha"])
df["rmsle"] = df["mean_test_score"].apply(lambda x:-x)

plt.xticks(rotation=30, ha='right')
sns.pointplot(data=df, x="alpha", y="rmsle", ax=ax)

{'alpha': 0.00125, 'max_iter': 3000}
RMSLE Value For Lasso Regression:  0.9798830685468796

<AxesSubplot:xlabel='alpha', ylabel='rmsle'>

output_46_2

Ensemble Model - RandomForest

from sklearn.ensemble import RandomForestRegressor
rfModel = RandomForestRegressor(n_estimators = 100)

y_train_log = np.log1p(y_train)
rfModel.fit(X_train, y_train_log)

preds = rfModel.predict(X_train)
score = rmsle(np.exp(y_train_log), np.exp(preds))
print("RMSLE Value For Random Forest: ", score)

RMSLE Value For Random Forest:  0.10687876851241544

Ensemble Model - Gradient Boost

이진 트리의 오차를 보완
무작위성X, 강력한 사전 가지치기

from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor(n_estimators = 4000, alpha=0.01)

y_train_log = np.log1p(y_train)
gbm.fit(X_train, y_train_log)

preds = gbm.predict(X_train)
print("RMSLE Value For Gradient Boost: ", rmsle(np.exp(y_train_log), np.exp(preds)))

RMSLE Value For Gradient Boost:  0.20502659343932583

predsTest = rfModel.predict(X_test)

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(12, 5)

sns.distplot(y_train, ax=ax1, bins=50)
sns.distplot(np.exp(predsTest), ax=ax2, bins=50)

C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<AxesSubplot:ylabel='Density'>

output_51_2

6. Submit

submission = pd.read_csv("sampleSubmission.csv")
submission

submission["count"] = np.exp(predsTest)

print(submission.shape)
submission.head()

(6493, 2)

	datetime	count
0	2011-01-20 0:00	12.422343
1	2011-01-20 1:00	5.785711
2	2011-01-20 2:00	4.143073
3	2011-01-20 3:00	4.055910
4	2011-01-20 4:00	3.508064

submission.to_csv("Score_{0:.5f}_submission.csv" .format(score), index=False)

※ Kaggle에 제출할 때 “RandomForestRegressor(n_estimators = 100)” 처럼 어떤 파라미터를 썼는지 작성하고 제출하면 더 좋다.

Twitter Facebook LinkedIn

최진실

[Python] 캐글 자전거 수요 예측

1. 모듈 import

2. 데이터 로드

3. EDA

4. 예측 준비

풍속 0값 예측

feature 정하기

RMSLE

KFold 교차검증

5. 예측 모델

Linear Regression Model

Regularization Model - Ridge

Regularization Model - Lasso

Ensemble Model - RandomForest

Ensemble Model - Gradient Boost

6. Submit

공유하기

댓글남기기

참고

[JavaScript] 비동기

[Next.js] Next15에서 돈 안 쓰고 Sentry 알림 받기

[Next.js] Next15에서 Metadata 설정하기

[Next.js] Next15에서 Tanstack Query 사용하기

최진실

bike-sharing-demand

1. 모듈 import

2. 데이터 로드

3. EDA

4. 예측 준비

풍속 0값 예측

feature 정하기

RMSLE

KFold 교차검증

5. 예측 모델

Linear Regression Model

Regularization Model - Ridge

Regularization Model - Lasso

Ensemble Model - RandomForest

Ensemble Model - Gradient Boost

6. Submit

공유하기

댓글남기기

참고

[JavaScript] 비동기

[Next.js] Next15에서 돈 안 쓰고 Sentry 알림 받기

[Next.js] Next15에서 Metadata 설정하기

[Next.js] Next15에서 Tanstack Query 사용하기