Detect AI-Generated Phishing Emails with BERT | Full Project Tut | Dissertation | Final year project

Name: Detect AI-Generated Phishing Emails with BERT | Full Project Tut | Dissertation | Final year project
Uploaded: 2025-06-01T11:32:58.000Z
Duration: 2 h 26 min 35 s

Introduction to AI Project on Phishing Email Detection

Overview of the Project

The project focuses on detecting phishing emails using BERT, a language representation model.

It consists of two main parts: theory and coding implementation.

Understanding Phishing Emails

Phishing emails are deceptive messages designed to trick victims into revealing sensitive information.

Attackers send these emails containing malicious links that lead to fake websites mimicking trusted sites.

Victims may unknowingly enter their credentials, which are then stolen by attackers for malicious purposes.

Background and Problem Statement

Current Challenges in Detecting Phishing Emails

The rise of AI-generated content complicates the detection of phishing emails as they can closely mimic legitimate messages.

The goal is to build a BERT-based model that automatically identifies phishing attempts, thereby protecting users from scams and data theft.

Project Architecture

Steps Involved in the Project

Data Loading: Initial step involves loading the dataset for analysis.

Data Exploration: Analyze statistics such as word repetition and common phrases used in phishing emails.

Data Processing: Includes lowercasing text, removing stop words, and preparing data for input into the BERT model. This will be detailed during coding implementation.

Data Splitting Strategy

The dataset will be split into 70% training, 10% validation, and 20% testing; these ratios can be adjusted based on experimentation needs.

Model Training and Evaluation

Training Process

Tokenization is necessary to convert data into a format suitable for BERT input before training begins.

Model training will involve continuous evaluation against validation data to ensure accuracy metrics meet desired thresholds (accuracy, precision, recall, F1 score).

Iterative Improvement

If initial results are unsatisfactory (low accuracy or poor metrics), retraining will occur until acceptable performance levels are achieved before saving the model for future use.

Understanding Dataset Characteristics

Importance of Data Analysis

A thorough understanding of the dataset is crucial; this includes knowing who collected it and what columns it contains (e.g., number of records).

Understanding the Dataset for Phishing Email Detection

Overview of Data Source

The dataset is sourced from a Kaggle account belonging to Nasir Abdullah, which contains comprehensive information about phishing emails.

Usability of the dataset is rated at 10%, indicating that all necessary information regarding the dataset has been provided.

Dataset Composition

The dataset consists of 82,500 emails and includes 428,891 spam emails, suggesting a balanced representation for analysis.

It is crucial to cite this dataset if used in publications or research work.

Selected Data File

The presenter selects a specific file related to phishing emails, which has a size of 106 MB and contains two columns: "text" (the email content) and "label".

The "label" column indicates whether an email is normal (0) or phishing (1), providing annotated data for training models.

Setting Up the Coding Environment

Environment Setup

Anaconda is used to create a new environment named 'torch', where all required libraries are installed for the project.

Users unfamiliar with creating environments can find tutorials on YouTube or request assistance through comments.

Project Structure

The coding section begins with an overview of what will be covered in the Jupyter notebook, including exploratory data analysis (EDA), data preprocessing, model training, evaluation, and saving models.

Implementing Code for Data Handling

Initial Steps in Coding

First tasks include creating a virtual environment in Anaconda and installing necessary libraries such as pandas and PyTorch for data manipulation and model training respectively.

Loading the Dataset

Using pandas to load the CSV file into a DataFrame (df), checking its shape reveals there are 82,797 records across two columns. This confirms successful loading of data into memory.

Sample Data Inspection

A preview of the first five rows using df.head() provides insight into how the dataset appears visually before further processing steps are taken. This helps confirm that it contains expected values and structure.

Understanding Data Sampling and Preprocessing in Machine Learning

Data Sampling for Computational Efficiency

The speaker discusses the need for high computational power when working with large datasets, specifically mentioning 82,000 rows. To avoid glitches during video recording, they opt to sample only 15,000 rows for explanation purposes.

The function .sample() is introduced as a method to randomly select records from a larger dataset. The speaker emphasizes that this selection is purely for simplicity in demonstration.

A seed number (random state 42) is used to ensure reproducibility of the random sampling process. The index is reset to start from zero after sampling.

Data Type Conversion and Preprocessing Steps

Upon inspecting the sampled data, it’s noted that some labels appear as floating-point numbers instead of integers. This discrepancy arises due to mixed data types in the original dataset.

The necessity of converting these floating-point labels into integers is highlighted as part of preprocessing before further analysis can be conducted.

Justification for Limited Preprocessing

Unlike typical NLP projects that involve extensive preprocessing (like removing URLs or special characters), the speaker explains their choice to limit preprocessing steps due to the nature of phishing emails.

They argue that retaining certain features such as URLs and special characters is crucial because these elements are often indicative of phishing tactics and help the model learn what fake emails look like.

Handling Null Values and Lowercasing Text

The speaker plans to remove null values from the dataset using df.dropna(), ensuring changes are made directly on the original dataframe by setting inplace=True.

A conversion step follows where label columns are updated from floating-point numbers back into integers, maintaining consistency across data types.

Function Implementation for Text Processing

A custom function named clean_text is defined to convert text entries into lowercase format. This standardization aids traditional models which typically perform better with lowercased inputs.

The .apply() function will be utilized on each email entry within a specified column, effectively applying the lowercase transformation across all records in one go.

This structured approach ensures clarity in understanding how data sampling and preprocessing techniques are applied within machine learning contexts, particularly focusing on phishing email detection.

Data Exploration and Visualization Techniques

Understanding Data Frame Memory Utilization

The memory utilized by the data frame is presented in KB, providing insight into the dataset's size.

The shape function reveals the dimensions of the data frame, helping to understand its structure.

The head function displays the top rows of the dataset, allowing for a quick overview of its contents.

Analyzing Dataset with Graphical Representations

Introduction to graphical analysis as a method to explore data beyond statistical matrices.

Plans to plot graphs that illustrate various aspects such as email types, common words, and special character counts in emails.

Class Distribution Visualization

Using Seaborn's countplot function to visualize class distribution between legitimate and phishing emails.

The graph will show counts of unique values in specified columns, aiding in understanding email classifications.

Initial findings indicate approximately 7,000 legitimate cases versus around 8,000 phishing emails; overall dataset appears balanced.

Special Character Analysis in Emails

A box plot will be created to compare special character usage between legitimate and phishing emails.

Code implementation involves filtering out alphanumeric characters and spaces from text when counting special characters.

Common Words in Phishing Emails

Focus on identifying frequently used words in phishing emails to understand tactics used for user engagement.

Utilizing NLTK library for natural language processing; stop words are filtered out during analysis for clarity.

Data Processing and Visualization of Phishing Emails

Filtering and Preparing Data

The process begins by selecting a data frame where the label is equal to one, focusing solely on phishing emails for analysis.

The text from these emails is converted to lowercase and split into individual words for further processing.

A filtration step is introduced to remove stop words and non-alphabetic characters, ensuring only relevant words are retained.

A loop iterates through each word in the list of phishing words, filtering out any that are not alphabetic or are considered stop words.

Valid words are stored in a dictionary, allowing for an organized collection of filtered terms.

Analyzing Word Frequency

A frequency count of the remaining valid words is conducted, identifying the top 20 most common terms used in phishing emails.

These terms and their counts are unpacked into two separate lists for plotting purposes, facilitating visual representation.

A bar plot is generated to display the most frequently used words in phishing emails, highlighting terms like "money" and "accounts."

Visualizing Data with Word Clouds

The discussion transitions to creating a word cloud visualization that represents word frequency visually; larger words indicate higher usage rates.

Key terms such as "network," "daily," and "August" appear prominently in the word cloud due to their frequency in phishing communications.

Splitting Data for Model Training

After visualizations, attention shifts to splitting the dataset into training (70%), validation (10%), and testing (20%) sets for model development.

The train_test_split library is utilized for this purpose; it ensures balanced distribution among labels during data division.

Ensuring Balanced Distribution

The initial 80% of data will be further divided into 70% training data and 10% validation data while maintaining stratification across labels.

This technique helps prevent imbalanced distributions during model training, which can lead to biased results.

Data Preparation and Tokenization Process

Splitting the Dataset

The function is designed to split the dataset, taking 12% of the total 80% labeled data for testing, while the remaining will be used for training.

The validation data is structured as a list, with each email represented individually in this format.

Data Format Requirements

The current data format (list) is incompatible with the BERT model, which requires numerical input rather than text.

To prepare the data for BERT, it must be converted into numbers following a specific input format that BERT accepts.

Tokenization Steps

The tokenization process involves two main steps: normalization and formatting into BERT's required input structure.

Normalization converts all uppercase letters to lowercase and removes extra spaces from raw text.

Pre-tokenization and Model Understanding

Pre-tokenization breaks down sentences into smaller tokens; for example, "Hello" becomes one token.

A pre-trained model understands these tokens' meanings based on prior training on raw text data.

Post-processing Tokens

During post-processing, additional tokens are added: CLS (start of sentence) and SEP (end of sentence or separator between sentences).

This step ensures that models recognize sentence boundaries correctly during processing.

Utilizing Hugging Face's Tokenizer

A pre-trained BERT-based uncased tokenizer is employed to convert raw text into tokens effectively.

The tokenizer object processes training text data with parameters set for truncation and padding to maintain consistent input lengths across emails.

Handling Input Length Variations

Truncation adjusts longer emails to a specified length (e.g., 500 characters), while padding fills shorter emails with zeros up to that length.

Understanding Tokenization and Input Formatting in BERT Models

The Role of Tokenizers

The tokenizer processes raw data (e.g., emails) and converts it into a format suitable for the BERT model, specifically generating input IDs and attention masks.

Input IDs and Padding

Input IDs represent numerical values corresponding to tokens in the sentence. For example, "10 1 4 0" indicates token positions, with zeros used for padding to match sentence lengths.

Attention Masks Explained

The attention mask informs the model which tokens to focus on during processing. Tokens marked with '1' are important, while those marked with '0' indicate padding that should be ignored.

Preparing Data for Model Training

While input IDs and attention masks are generated, labels are initially missing from the dataset. A function is being developed to include these labels in the correct format.

Class Structure for Dataset Management

A class named EmailDataset is created to manage training data formatting by taking encodings and labels as inputs. This class prepares data for both training and validation purposes.

Converting Data into Tensors

Within a loop, input IDs and attention masks are converted into PyTorch tensors. This conversion is essential for compatibility with the BERT model's requirements.

Finalizing Dataset Preparation

The dataset now includes input IDs, attention masks, and labels formatted as tensors. This structured dataset will be utilized for training the BERT model effectively.

Training Strategies for BERT Models

Fine-Tuning vs. Training from Scratch

There are two approaches to train a BERT model: starting from scratch or fine-tuning an already pre-trained model. In this case, fine-tuning is chosen due to its efficiency.

Pre-Trained Model Advantages

The Hugging Face library provides a pre-trained BERT model that has been trained on extensive English datasets. Fine-tuning allows adaptation of this knowledge to specific tasks like email classification.

Characteristics of the Selected Model

The selected BERT model is described as "base" (indicating size), "uncased" (not sensitive to letter casing), meaning it treats words without regard to capitalization differences—enhancing its versatility across various text inputs.

Understanding Fine-Tuning in Machine Learning Models

Overview of Model Parameters

The speaker discusses different model sizes, highlighting a base model with 110 million parameters and larger models with up to 340 million parameters.

Emphasizes the importance of fine-tuning a pre-trained model on specific datasets, using code provided for implementation.

Fine-Tuning Explained

The analogy of training a "parrot" (a complete language model) is used to illustrate how fine-tuning works by adapting an existing model to new data.

Fine-tuning allows leveraging existing knowledge from a pre-trained model while focusing on specific tasks, making it more efficient than training from scratch.

Training Process and Parameters

Discusses the necessity of defining labels for classification tasks, specifically mentioning two labels: normal emails and phishing emails.

Introduces evaluation metrics such as accuracy, precision, and recall that will be explained later in the context of assessing model performance.

Hyperparameters Configuration

Highlights the significance of hyperparameters in training models, including learning rate and number of epochs; encourages experimentation with these settings.

Describes the role of training arguments as configuration classes that dictate how models are trained, including output directories and batch sizes.

Utilizing Hugging Face Trainer

The Hugging Face Trainer is introduced as an engine that manages the entire training lifecycle, handling aspects like batching and tokenization.

Details how the trainer integrates various components such as models, datasets, tokenizers, and evaluation metrics into one cohesive framework for effective training.

Evaluation Metrics During Training

Explains how during training predictions are made against defined metrics which include precision, recall, F1 score, and accuracy calculations based on label predictions.

Concludes with an overview of how these evaluations inform adjustments during training to improve overall model performance before moving onto actual training.

Training a Model with Hugging Face Trainer

Overview of the Training Process

The training process begins with the trainer.train method, which initiates model training using predefined configurations and datasets.

This method loads the dataset and model, processes batches, computes losses, performs backpropagation to adjust model weights, and evaluates performance based on validation strategies set to epochs.

After each epoch, the model is evaluated; checkpoints are saved in a specified output directory along with logs of training metrics from validation data.

Training Duration and Results

The speaker trained a model on Google Colab for 60 epochs over approximately 1 hour and 38 minutes. Training time varies based on dataset size and computational resources.

Observations show that as epochs progress, training loss decreases while validation accuracy increases alongside F1 score, precision, and recall metrics.

Logging and Evaluation During Training

Logs were sent to Hugging Face's workspace during training for detailed tracking of metrics such as evaluation steps per second and precision calculations.

The website used for logging is wdb.ai, which provides comprehensive insights into the training process.

Post-training Evaluation Steps

After completing training, the next step involves evaluating the model against test data to understand how loss decreased over time across epochs.

The speaker extracts log history from the trainer to analyze evaluation loss and accuracy by saving relevant data points for further analysis.

Graphing Loss and Accuracy Metrics

A subplot is created to visualize both training loss and evaluation loss across epochs in one graph for comparative analysis.

Initial observations indicate that if loss does not decrease over epochs or increases instead, it suggests poor model performance. Conversely, decreasing loss indicates effective learning by the model.

Model Evaluation and Testing Process

Understanding Model Learning Progress

The model's learning effectiveness is demonstrated through its performance across different iterations, achieving a score of approximately 0.98 in the fourth iteration, indicating significant improvement.

Saving and Loading the Model

Instructions are provided for saving the trained model by uncommenting specific code lines, allowing users to save their progress easily. The process for loading the model is also outlined, requiring only the path to the folder.

Preparing Test Data

The next step involves testing the model on a 20% unseen dataset, which has not been previously encountered by the model. This raw data must be converted into a numerical format suitable for input into the BERT model using tokenization techniques.

Tokenization and Encoding

Tokenization is performed on text data with parameters such as truncation and padding set appropriately to generate encodings that include IDs and attention masks necessary for feeding into the model.

Evaluating Model Performance

Various libraries are imported to evaluate accuracy metrics including precision score, recall score, and classification report from sklearn metrics; these built-in functions simplify obtaining results after providing them with test data inputs.

Extracting Predictions

After running predictions on test data using the trainer function, true labels are compared against predicted labels to calculate accuracy scores along with precision, recall, F1 scores, and generating a comprehensive classification report showcasing overall performance metrics.

Confusion Matrix Analysis

A confusion matrix is generated to visualize how well the model predicts actual versus predicted labels; it highlights true positives and negatives effectively demonstrating high accuracy (around 98%) with only minor misclassifications noted.

Importance of F1 Score in Imbalanced Datasets

In cases where datasets are imbalanced between classes, relying solely on accuracy can be misleading; thus, utilizing F1 scores becomes crucial for evaluating models under such conditions effectively.

Final Remarks on Model Deployment

If satisfied with performance metrics like accuracy reaching up to 99%, users can save their models for deployment in various applications or environments while also being able to load them later for further testing or evaluation against new unseen data sets. Suggestions for custom projects or improvements are welcomed at this stage as well.