Predicting the flight price is a small yet powerful project. It designed to estimate the cost of the ticket based on various imput features. Using the machine learning algorithms, the model will analyzes the historical data, and trying to predict the ticket pricing.
About the data
The data is downloadable from kaggle. It's called "Flight Price Prediction" dataset by Shubham Bathwal. Here's a link to the data set.
The features are airline, fligt code, depature city, stops, departure time, arrival time, arrival city, class, duration, days left, and price. By this, it's clear that we want to predict only 1 feature, the price and using all the other feautres.
Strategy
Of course the first thing that I need to do is explore the data. I want to know a little bit more about the data, what are the datatypes, how various are the data.
dfAfter that, I knew that I could drop the unnamed and flight number coloumn because it's not really necessary to predict the price. Also, there are some features like airline, source city, depature and arrival time, and destination city that have several values that could be on hot encoded.
Since it's a price, Regression maybe a good fit for the data.
Model Using the scikit library, I imported regression and train_test_split. The train test split is quite common 80:20.
reg = RandomForestRegressor(n_jobs=-1) #reduce process
reg.fit(X_train, y_train)
reg.score(X_test, y_test)
I got the score of 98.56%. This is super high and quite accurate model.
Afterwards, I search for the most important features to determine the prices.
importances = dict(zip(reg.feature_names_in_, reg.feature_importances_))
sorted_importances = sorted(importances.items(), key=lambda x: x[1], reverse=True)
sorted_importances
As expected, the most important feature here is indeed the class. Obviously, the Business Class would cost significantly more than the Economy Class.
In order to try to improve the model, I tried to tune the Hyperparameter.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'n_estimators':randint(10,100),
'max_depth': [10,20],
'min_samples_split': randint(2,11),
'min_samples_leaf': randint(1,5),
'max_features': [1.0, 'sqrt']
}
reg = RandomForestRegressor(n_jobs=-1)
random_search = RandomizedSearchCV(estimator=reg, param_distributions = param_dist, n_iter = 2, cv = 3,
scoring = 'neg_mean_squared_error', verbose= 2, random_state=10, n_jobs =-1)
random_search.fit(X_train, y_train)
best_regressor = random_search.best_estimator_
This would gave me the best reggresor that could be found. Of course, to tune the parameter, I could also tune several parameters manually, for example the n_estimators, or the max_depth.
However, 98.5% is already a very good model. This may happened cause the data is already pre-cleaned, also pre-processed.