<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null datetime64[ns]
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 6493 non-null datetime64[ns]
1 season 6493 non-null int64
2 holiday 6493 non-null int64
3 workingday 6493 non-null int64
4 weather 6493 non-null int64
5 temp 6493 non-null float64
6 atemp 6493 non-null float64
7 humidity 6493 non-null int64
8 windspeed 6493 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(5)
memory usage: 456.7 KB
datetime 0
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64
datetime 0
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
dtype: int64
#year, month, day, dayofweek(0~6), quarter, hour, minute, second 컬럼 생성
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['dayofweek'] = train['datetime'].dt.dayofweek
train['quarter'] = train['datetime'].dt.quarter
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.second
train.shape
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3)
fig.set_size_inches(18, 8)
sns.barplot(data = train, x = 'year', y='count', ax=ax1)
sns.barplot(data = train, x = 'month', y='count', ax=ax2)
sns.barplot(data = train, x = 'day', y='count', ax=ax3)
sns.barplot(data = train, x = 'hour', y='count', ax=ax4)
sns.barplot(data = train, x = 'minute', y='count', ax=ax5)
sns.barplot(data = train, x = 'second', y='count', ax=ax6)
ax1.set(ylabel='Count')
ax2.set(ylabel='month')
ax3.set(ylabel='day')
ax4.set(ylabel='hour')
minute와 second는 값이 0밖에 없으므로 해당 Feature는 사용하지 않는다.
fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12, 10)
sns.boxplot(data = train, y='count', orient="v", ax=axes[0][0])
sns.boxplot(data = train, y='count', x = "season", orient="v", ax=axes[0][1])
sns.boxplot(data = train, y='count', x = "hour", orient="v", ax=axes[1][0])
sns.boxplot(data = train, y='count', x = "workingday", orient="v", ax=axes[1][1])
axes[0][0].set(ylabel='Count')
axes[0][1].set(xlabel = "Season", ylabel='month')
axes[1][0].set(xlabel = "Hour of the Day", ylabel='day')
axes[1][1].set(xlabel = "Woking Day", ylabel='hour')
[Text(0.5, 0, 'Woking Day'), Text(0, 0.5, 'hour')]
hour의 너무 많은 값이 이상치로 나온다. hour를 여러 조건으로 나누어 살펴보자.
fig, ((ax1, ax2, ax3, ax4, ax5)) = plt.subplots(5)
fig.set_size_inches(18, 25)
sns.pointplot(data = train, x = 'hour', y='count', ax=ax1)
sns.pointplot(data = train, x = 'hour', y='count', hue="workingday", ax=ax2)
sns.pointplot(data = train, x = 'hour', y='count', hue="dayofweek", ax=ax3)
sns.pointplot(data = train, x = 'hour', y='count', hue="weather", ax=ax4)
sns.pointplot(data = train, x = 'hour', y='count', hue="season", ax=ax5)
<AxesSubplot:xlabel='hour', ylabel='count'>
주중과 주말, 요일별로 hour당 수요량이 다르다.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 4)
sns.barplot(data=train, x="year", y="count", ax=ax1)
sns.barplot(data=train, x="month", y="count", ax=ax2)
fig, ax3 = plt.subplots(1, 1)
fig.set_size_inches(18, 4)
sns.barplot(data=train, x="year_month", y="count", ax=ax3)
<AxesSubplot:xlabel='year_month', ylabel='count'>
fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12, 10)
sns.distplot(train["count"], ax=axes[0][0])
stats.probplot(train["count"], dist='norm', fit=True, plot=axes[0][1])
sns.distplot(np.log(trainWithoutOutliers["count"]), ax=axes[1][0])
stats.probplot(np.log(trainWithoutOutliers["count"]), dist='norm', fit=True, plot=axes[1][1])
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
((array([-3.82886059, -3.6047202 , -3.48171232, ..., 3.48171232,
3.6047202 , 3.82886059]),
array([0. , 0. , 0. , ..., 6.63463336, 6.64118217,
6.64118217])),
(1.4123284761790171, 4.528748279449013, 0.9542628138734534))
# widspeed 풍속에 0 값이 가장 많음 -> 잘못 기록된 데이터를 고쳐 줄 필요가 있음
fig, axes = plt.subplots(2)
fig.set_size_inches(18, 10)
plt.sca(axes[0])
plt.xticks(rotation=30, ha='right')
axes[0].set(ylabel='Count', title="train windspeed")
sns.countplot(data=train, x="windspeed", ax=axes[0])
plt.sca(axes[1])
plt.xticks(rotation=30, ha='right')
axes[1].set(ylabel='Count', title="test windspeed")
sns.countplot(data=test, x="windspeed", ax=axes[1])
<AxesSubplot:title={'center':'test windspeed'}, xlabel='windspeed', ylabel='count'>
# 랜덤포레스트로 예측해서 풍속 넣기
from sklearn.ensemble import RandomForestClassifier
def predict_windspeed(data) :
dataWind0 = data.loc[data['windspeed'] == 0]
dataWindNot0 = data.loc[data['windspeed'] != 0]
wCol = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]
dataWindNot0["windspeed"] = dataWindNot0["windspeed"].astype("str")
rfModel_wind = RandomForestClassifier()
rfModel_wind.fit(dataWindNot0[wCol], dataWindNot0["windspeed"])
wind0Values = rfModel_wind.predict(X = dataWind0[wCol])
predictWind0 = dataWind0
predictWindNot0 = dataWindNot0
predictWind0["windspeed"] = wind0Values
data = predictWindNot0.append(predictWind0)
data["windspeed"] = data["windspeed"].astype("float")
data.reset_index(inplace=True)
data.drop('index', inplace=True, axis=1)
return data
train = predict_windspeed(train)
fig, ax1 = plt.subplots()
fig.set_size_inches(18, 6)
plt.sca(ax1)
plt.xticks(rotation=30, ha='right')
ax1.set(ylabel='Count', title="train windspeed")
sns.countplot(data=train, x="windspeed", ax=ax1)
<AxesSubplot:title={'center':'train windspeed'}, xlabel='windspeed', ylabel='count'>
categorical_feature_names = ["season", "holiday", "workingday", "weather", "dayofweek", "month", "year", "hour"]
for var in categorical_feature_names :
train[var] = train[var]. astype("category")
test[var] = test[var]. astype("category")
feature_names = ["season", "weather", "temp", "atemp", "humidity", "windspeed", "year", "hour", "dayofweek", "holiday", "workingday"]
feature_names
['season',
'weather',
'temp',
'atemp',
'humidity',
'windspeed',
'year',
'hour',
'dayofweek',
'holiday',
'workingday']
(10886,)
0 1
1 36
2 56
3 84
4 94
Name: count, dtype: int64
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
# 선형회귀 모델을 초기화
lModel = LinearRegression()
# 모델 학습
y_train_log = np.log1p(y_train)
lModel.fit(X_train, y_train_log)
# 예측, 정확도 평가
preds = lModel.predict(X_train)
print("RMSLE Value For Linear Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))
RMSLE Value For Linear Regression: 0.9798738624110362
ridge_m_ = Ridge()
ridge_params_ = {'max_iter' : [3000], 'alpha' : [0.01, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000]}
rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)
grid_ridge_m = GridSearchCV(ridge_m_, ridge_params_, scoring=rmsle_scorer, cv=5)
y_train_log = np.log1p(y_train)
grid_ridge_m.fit(X_train, y_train_log)
preds = grid_ridge_m.predict(X_train)
print(grid_ridge_m.best_params_)
print("RMSLE Value For Ridge Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))
fig,ax = plt.subplots()
fig.set_size_inches(12, 5)
df = pd.DataFrame(grid_ridge_m.cv_results_)
df["alpha"] = df["params"].apply(lambda x:x["alpha"])
df["rmsle"] = df["mean_test_score"].apply(lambda x:-x)
plt.xticks(rotation=30, ha='right')
sns.pointplot(data=df, x="alpha", y="rmsle", ax=ax)
{'alpha': 0.01, 'max_iter': 3000}
RMSLE Value For Ridge Regression: 0.9798738604013573
<AxesSubplot:xlabel='alpha', ylabel='rmsle'>
lasso_m_ = Lasso()
alpha = 1/np.array([0.01, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000])
lasso_params_ = {'max_iter' : [3000], 'alpha' : alpha}
grid_lasso_m = GridSearchCV(lasso_m_, lasso_params_, scoring=rmsle_scorer, cv=5)
y_train_log = np.log1p(y_train)
grid_lasso_m.fit(X_train, y_train_log)
preds = grid_lasso_m.predict(X_train)
print(grid_lasso_m.best_params_)
print("RMSLE Value For Lasso Regression: ", rmsle(np.exp(y_train_log), np.exp(preds)))
fig,ax = plt.subplots()
fig.set_size_inches(12, 5)
df = pd.DataFrame(grid_ridge_m.cv_results_)
df["alpha"] = df["params"].apply(lambda x:x["alpha"])
df["rmsle"] = df["mean_test_score"].apply(lambda x:-x)
plt.xticks(rotation=30, ha='right')
sns.pointplot(data=df, x="alpha", y="rmsle", ax=ax)
{'alpha': 0.00125, 'max_iter': 3000}
RMSLE Value For Lasso Regression: 0.9798830685468796
<AxesSubplot:xlabel='alpha', ylabel='rmsle'>
RMSLE Value For Random Forest: 0.10687876851241544
RMSLE Value For Gradient Boost: 0.20502659343932583
predsTest = rfModel.predict(X_test)
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(12, 5)
sns.distplot(y_train, ax=ax1, bins=50)
sns.distplot(np.exp(predsTest), ax=ax2, bins=50)
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
C:\Users\yun70\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:ylabel='Density'>
※ Kaggle에 제출할 때 “RandomForestRegressor(n_estimators = 100)” 처럼 어떤 파라미터를 썼는지 작성하고 제출하면 더 좋다.
댓글남기기