This project is given a dataset contains the E-Mails labeled. The dataset itself is from Kaggle https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data
Since I'm working with strings or words, I need nltk (Natural Language Tool Kit) library.
stemmer = PorterStemmer()
corpus = []
stopwords_set = set(stopwords.words('english'))
for i in range(len(df)):
text = df['text'].iloc[i].lower()
text = text.translate(str.maketrans('', '', string.punctuation)).split()
text = [stemmer.stem(word) for word in text if word not in stopwords_set]
text = ' '.join(text)
corpus.append(text)This reduced the words to their root form, for example "running" in to "run", and it helps to simplify the tokens. The stopwords set that has been imported from a library is the basic stopwords like "the", "is", etc. Finally the for loop brings me the process stemmed word in a list, named corpus.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
y = df.label_num
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = RandomForestClassifier(n_jobs = -1)
clf.fit(X_train, y_train)Since the dataset is labeled, whether the email is a spam or a ham(not spam) I vectorized the emails, and directly trained them. It's fitted to the RandomForestClassifier and the accuracy of the rest result is quite high, 97%.
