This is a very popular dataset, which is the house prices in California. Similarly to the flight ticket price prediction, we could predict the price of houses with machine learning. In this attempt I'm gonna try to use Regression.
About the data
The data is downloadable from kaggle. It's called "California Housing Prices" dataset by Cam Nugent. Here's a link to the data set. The dataset itself used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'
Strategy
After exploring the dataset, I figured that this dataset is full of integers or float. However, after seeing several feature's histogram like total room and total bedrooms, the histogram is way off from the normal distribution. In my opinion this will make the data a little hard to train, so I normalized the data first.
train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)
train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] + 1)
train_data['population'] = np.log(train_data['population'] + 1)
train_data['households'] = np.log(train_data['households'] + 1)
Also, ocean proximity is a string data, and it has 5 unique values. So I could one hot encode this feature.
Model For this dataset I'm using Linear Regression from Skicit learn first.
X_train, y_train = train_data.drop(['median_house_value'], axis = 1), train_data['median_house_value']
reg = LinearRegression()
reg.fit(X_train, y_train)
reg.score(X_test, y_test)
The score was 67.94%. It's not a bad model, however, it's also not a good model. So I tried to use slightly different model, RandomForestRegressor.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
forest.score(X_test_s, y_test)
This time, it scored much better, it's at 81.97%. It's a good model.
However, still I tried to tune the Hyperparameters.
from sklearn.model_selection import GridSearchCV
forest = RandomForestRegressor()
param_grid = {
"n_estimators": [30, 50, 100],
"max_features": [8, 12, 20],
"min_samples_split": [2,4,6,8]
}
grid_search = GridSearchCV(forest, param_grid, cv = 5,
scoring= "neg_mean_squared_error",
return_train_score=True)
grid_search.fit(X_train_s, y_train)
It improved the score, but not significantly. It scored 82.06% now. Very slight improvement.