Bonifasius Pandu

This is a very popular dataset, which is the house prices in California. Similarly to the flight ticket price prediction, we could predict the price of houses with machine learning. In this attempt I'm gonna try to use Regression.

About the data

The data is downloadable from kaggle. It's called "California Housing Prices" dataset by Cam Nugent. Here's a link to the data set. The dataset itself used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'

Strategy

After exploring the dataset, I figured that this dataset is full of integers or float. However, after seeing several feature's histogram like total room and total bedrooms, the histogram is way off from the normal distribution. In my opinion this will make the data a little hard to train, so I normalized the data first.

python


train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)
train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] + 1)
train_data['population'] = np.log(train_data['population'] + 1)
train_data['households'] = np.log(train_data['households'] + 1)

Also, ocean proximity is a string data, and it has 5 unique values. So I could one hot encode this feature.

Model For this dataset I'm using Linear Regression from Skicit learn first.

python


X_train, y_train = train_data.drop(['median_house_value'], axis = 1), train_data['median_house_value']

reg = LinearRegression()
reg.fit(X_train, y_train)

reg.score(X_test, y_test)

The score was 67.94%. It's not a bad model, however, it's also not a good model. So I tried to use slightly different model, RandomForestRegressor.

python


from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor()
forest.fit(X_train, y_train)
forest.score(X_test_s, y_test)

This time, it scored much better, it's at 81.97%. It's a good model.

However, still I tried to tune the Hyperparameters.

python


from sklearn.model_selection import GridSearchCV

forest = RandomForestRegressor()

param_grid = {
  "n_estimators": [30, 50, 100],
  "max_features": [8, 12, 20],
  "min_samples_split": [2,4,6,8]
}

grid_search = GridSearchCV(forest, param_grid, cv = 5,
                         scoring= "neg_mean_squared_error",
                         return_train_score=True)

grid_search.fit(X_train_s, y_train)

It improved the score, but not significantly. It scored 82.06% now. Very slight improvement.

House Price Prediction