Projects

As part of the MEng (Structured) programme with focus on Data Science our students are required to complete a final 60 credit data science research project where they are required to apply and consolidate the data science knowledge gained throughout the programme. For this purpose, students will solve a real-world data science project, providing solutions for each step of the data science project life cycle and document it in a research assignment.

For these projects, we collaborate with industry and academic partners who are willing to propose a topic, to provide the necessary data (if not publicly available) as well as to act as domain mentors. The data set needs to be complete.

If you are interested in partnering with us for such a project, please contact [email protected] for further information about a short project proposal and deadlines.

Project proposals reviewed by end of term 3 of a given year will be assigned to students for the following year.

Below, find a list of completed research assignments.

The assignments are grouped under the year of graduation.

2026

March 2026 Graduation

Measuring and Evaluating Implicit Intent from User Action in Recommender Systems

Explicit feedback such as ratings gives clear preference signals but suffers from low participation and severe data sparsity, while implicit feedback captured through views, add-to-cart events, and browsing is abundant but ambiguous, since these actions don't directly equate to preference. This study first investigates how to map implicit signals to explicit ones to better understand user intent, then examines whether the resulting predicted preference signals can improve recommendation relevance.

Using the Neural Collaborative Filtering (NCF) framework on the Retailrocket e-commerce dataset (2.76 million events from 1.4 million users across 235,061 items), an aggressive quality-filtering strategy following CRISP-DM produced a curated dataset of 804 highly engaged users, 1,810 items, and 4,266 purchases. Five model variants were evaluated using leave-one-out evaluation under a challenging 1:99 positive-negative ranking protocol.

Baseline NCF models trained on sparse purchase data alone achieved only 16–19% Hit Rate at 10, barely beating random guessing. In contrast, a Feature-Enhanced NCF model incorporating views and add-to-cart actions achieved 97.98% HR@10 and 92.51% NDCG@10 — 5.1-fold and 10-fold improvements respectively — converging to a 97% hit rate after just one training epoch.

Property Tax Validation Using Automatic Building Footprint Extraction from Aerial Images

Accurate, automatic extraction of building footprints from aerial imagery matters for urban planning, municipal property tax validation, and infrastructure management. This study evaluates four deep learning semantic segmentation models — U-Net, DeepLabv3+, PSPNet, and Feature Pyramid Network (FPN) — for distinguishing formal from informal buildings in high-resolution aerial imagery, using an inherently imbalanced dataset where informal structures were significantly under-represented.

Performance was assessed using Intersection over Union, precision, recall, F1-score, and overall accuracy. All models converged well during training, with FPN outperforming the others on both building types, reaching 99.38% overall accuracy and F1-scores of 97.73% (formal) and 90.37% (informal).

U-Net and DeepLabv3+ performed strongly on formal buildings but noticeably weaker on informal ones, highlighting their sensitivity to data imbalance. PSPNet showed more balanced improvements across both classes, outperforming U-Net and DeepLabv3+ on informal buildings specifically, though it still fell short of FPN overall.

Automated Characterization of Grapevine Bunches

This study develops a fully automated pipeline for grape bunch characterisation using computer vision, deep learning, and machine learning, replacing the inefficiency and subjectivity of manual characterisation by extracting and analysing grape bunch morphology from RGB images. The pipeline integrates object detection, segmentation, morphological feature extraction, and regression-based prediction in a single workflow.

A YOLOv11n model trained for grape bunch detection achieved precision and recall above 0.99, while a YOLOv11n-seg model used for berry segmentation achieved a mean average precision above 0.83. Extracted morphological features trained regression models for bunch weight, volume, and total berry count, with a random forest regressor performing best, achieving coefficients of determination above 0.84 for both weight and volume.

A self-organising map clustered grape bunches by cultivar and compactness based on extracted features, achieving an average cluster purity of 0.836 for cultivar classification, confirming that the extracted morphological features are strongly discriminative. The pipeline provides consistent, scalable analysis suitable for digital viticulture.

Synthetic Data Generation for a Railway Defect Detection Model Using Nvidia Omniverse

This research evaluates whether a carefully controlled, simulation-based synthetic pipeline can replace large real-world datasets for exterior defect inspection on railway rolling stock, using visible graffiti on commuter train panels as a case study, given the shortage of well-annotated, repeatable imagery captured under real operating conditions.

A synthetic image generation pipeline built in NVIDIA Omniverse used 3D modelling, physically based rendering, and scripted variation in background, lighting, camera viewpoint, and clutter to model a modern electric commuter train, generating around 15,000 balanced colour images. A real validation set of 122 images and a real test set of 124 images assessed how well a model trained only on synthetic images generalised to reality.

An 18-layer residual network trained solely on synthetic images performed almost perfectly on synthetic validation data but near chance-level on real validation images. Adjusting sensor-aware augmentations, input resolution, and the volume of synthetic training images substantially improved real-world performance — the best configuration (all augmentations, 488×448 input, batch size 256) reached 80.7% accuracy, 83.9% recall, 78.8% precision, an F1 of 81.3%, and an AUC of 0.891 on the real test set. Saliency maps confirmed the model focused on graffiti regions rather than background clutter, showing that a well-designed synthetic-only pipeline paired with standard augmentation can substantially reduce reliance on large real datasets for railway exterior inspection.

Development of a Deep Learning-Based Diagnostic Tool for Identifying Paediatric Lymphobronchial Tuberculosis from Bronchoscopic Images

Tuberculosis is a leading global infectious cause of death, but paediatric diagnosis is challenging, and severe manifestations like lymphobronchial tuberculosis (LBTB) can cause airway compression that is often missed by external imaging and requires subjective, specialist-dependent bronchoscopic interpretation. This research develops a deep learning computer-aided diagnosis tool to objectively identify LBTB lymph node involvement and airway compression in children under five, aiming to meet the WHO's Target Product Profile for triage tests (sensitivity ≥90%, specificity >70%).

A novel multi-headed ensemble of five specialised convolutional neural networks, each trained on a specific anatomical region, was built, alongside an image enhancement pipeline using CIE Lab* colour space decoupling and adaptive histogram equalisation, which was validated to significantly improve sensitivity. A novel integer linear programming formulation curated balanced training datasets from imbalanced source data while preventing data leakage.

The final ensemble achieved a mean sensitivity of 93.90%, exceeding the WHO threshold, though mean specificity was limited to 42.50%. Interpretability analysis identified morphological confounds driving false positives, including shortcut learning on specular reflections and misinterpreting dark lumens as pathology. The work establishes a feasibility proof-of-concept for a high-sensitivity AI screening tool capable of ruling out severe LBTB, while showing that achieving human-level specificity requires further mitigation of optical artefacts.

A Comparative Study of Shallow and Deep Ensemble Methods for Classification

Deep learning's growth since 2012 has fostered the assumption that deep ensemble (DE) methods should outperform shallow ensemble (SE) methods, though direct comparisons remain limited. This research asks whether SE methods can match DE predictive performance while requiring far less computational cost, restricting scope to supervised binary classification.

Six datasets across three benchmark domains were used to evaluate SEs (built via bagging, boosting, and stacking on a predefined set of base learners) against DEs (two literature-based ensembles plus a deep MLP bagging ensemble), using cross-validation and comparing accuracy, precision, recall, specificity, and F1 with Friedman and Nemenyi statistical tests.

No statistically significant performance difference was found between SE and DE methods. On the image classification dataset, the best SE reached 93.17% accuracy versus 89.68% for the best DE — notable since the SE used manually extracted features while the DE learned features automatically — though the difference wasn't statistically significant. SEs delivered roughly three times greater cost-efficiency (accuracy per unit training time), challenging the assumption that deeper ensembles consistently yield superior performance.

Multi-Objective Optimization in Recommender Systems

Recommender systems must balance competing goals — accuracy, diversity, novelty, user satisfaction, and revenue — that cannot be optimised independently, and traditional single-objective approaches fail to address these trade-offs. This study applies a non-dominated sorting genetic algorithm III (NSGA-III) to a hybrid recommender system, jointly optimising the hybrid filtering blend parameter α and the recommended item list across all five objectives.

Pre-optimisation results showed high α values reduced accuracy due to poor collaborative filtering, but NSGA-III optimisation produced Pareto-optimal solutions concentrated near α = 1.0, showing collaborative filtering excels when α and the recommendation list are co-evolved. The resulting Pareto fronts revealed trade-offs among the five objectives, with convergence confirmed via hypervolume and generational distance metrics stabilising after roughly 150 generations.

The findings show NSGA-III effectively discovers recommendation strategies balancing user experience and business value, underscoring the importance of multi-objective optimisation for recommender design and establishing a foundation for future work on personalised trade-offs and scalable deployment.

Bias Estimation and Evaluation in Recommender Systems

Recommender systems act as key gatekeepers of the digital economy, yet their reliability is often compromised by algorithmic biases that favour popular content over genuine user relevance, while traditional metrics like RMSE rely on observational correlations that can't distinguish true preference from systemic confounders. This thesis applies a causal inference framework to estimate and mitigate sociotechnical biases in a large-scale MovieLens dataset of 32 million ratings.

The research identifies User Scale Bias — the tendency of some users to rate generously or critically — as a bias artefact roughly ten times larger than the sociological biases of popularity or recency. A novel causal pipeline using Z-Score Normalisation to neutralise user scale, followed by Inverse Probability Weighting (IPW) for global bias estimation, reduced User Scale Bias by 95.1%, revealing a latent "true" preference signal.

The study also uncovers a "Novelty Inversion" (Simpson's Paradox), where the causal effect of niche content flips from negative to positive once user scale is controlled — showing that the "long tail" penalty often seen in recommender systems is a statistical illusion driven by critical niche audiences rather than actual content quality. A complexity analysis found Propensity Score Matching offers local precision but is intractable at production scale, while IPW is validated as a scalable solution capable of real-time bias mitigation without sacrificing directional accuracy.

Federated Learning with Few Clients: An Empirical Analysis of Heterogeneity Effects with Proportional Data Scaling

Valuable temporal data in domains like healthcare, finance, and sensor networks is often fragmented across organisational silos, and privacy regulations make it increasingly difficult to leverage collective expertise across organisations. Federated Learning (FL) enables collaborative learning without sharing sensitive data, but a persistent problem is statistical heterogeneity between participating organisations' datasets, which can degrade model performance and convergence stability — an issue little studied in the few-client setting, or where total training data scales with client count.

This thesis develops Proportional Pool Partitioning (PPP), where training data scales with client count, evaluating FL performance across two, five, and ten clients under varying heterogeneity conditions using GRU models on benchmark image datasets.

Results show moderate heterogeneity permits positive performance scaling with client count, but severe quantity and label skew causes two-client federations to outperform larger ones. Pairwise KL divergence showed strong predictive power within datasets, though performance stability degraded sharply under extreme heterogeneity. These findings establish PPP as a framework for few-client research and show that heterogeneity, task complexity, and federation size interact in ways that challenge conventional FL scaling assumptions.

Optimisation of Hyperparameters in Deep Learning-Based Recommender Systems

This research evaluates whether multi-fidelity hyperparameter optimisation can speed up experimentation without sacrificing recommendation quality, using the Amazon Beauty dataset with two representative models — Neural Collaborative Filtering (NCF) and SASRec — optimised under both BCE and BPR loss functions, comparing BOHB (multi-fidelity) against single-fidelity Random Search and Bayesian Optimisation baselines.

An η sweep recomputed the Hyperband geometry for each η, keeping a consistent first-rung screening regime so η could be interpreted as a speed-quality control knob. Results show BOHB provides the strongest overall performance-efficiency trade-offs, with a Pareto analysis explicitly quantifying the speed-quality frontier — moderate η (6–7) tends to maximise iteration speed, while smaller η improves quality at the cost of longer optimisation time, with SASRec generally outperforming NCF in quality regimes.

The completed experiments show multi-fidelity BOHB delivers faster optimisation cycles while preserving strong recommendation quality, dominating single-fidelity baselines under combined speed-and-quality analysis, and providing a practical, controllable mechanism for accelerating deep recommender tuning without meaningfully reducing final ranking performance.

Explainable Remaining Useful Life Predictions Using Ensemble Tree Models

Remaining useful life (RUL) models predict when an engineered asset will reach an unacceptable state, but most published work focuses only on deterministic predictions, neglecting the failure probabilities and interpretable explanations that risk-based decision-making requires. This study proposes ensemble tree models for RUL predictions suitable for decision-making, using the many estimators within the ensemble to obtain accurate confidence intervals, alongside a novel rule extraction method to explain low expected life.

Applied to two benchmark datasets, a random forest regressor was compared against an LSTM network, a popular deep learning method for RUL prediction.

The random forest model achieved comparable predictive accuracy to the LSTM while providing better uncertainty estimates and lower computational cost, and offered interpretable rules distinguishing between domains of low, normal, and high expected useful life for an engineered asset.

Training Neural Network Ensembles with Dynamic Multi-Modal Particle Swarm Optimisation to Address Concept Drift

Artificial neural networks struggle to maintain performance when the underlying data distribution changes over time (concept drift). Ensemble learning, which combines multiple models, offers one mitigation, and particle swarm optimisation (PSO) provides an alternative to gradient-based training. This research investigates multi-quantum swarm optimisation (MQSO), a dynamic multi-modal PSO variant, for training ANN ensembles under concept drift, evaluated against traditional gradient-based training across synthetic binary and multi-class classification problems of varying complexity.

Experiments across multiple drift scenarios showed algorithm suitability depends strongly on dataset characteristics and problem complexity: MQSO achieved superior performance on lower-dimensional problems with linear decision boundaries, while gradient-based methods with periodic retraining performed better on higher-dimensional, non-linear problems. Algorithms incorporating retraining consistently outperformed those without it, confirming the importance of adaptation under concept drift.

MQSO's multi-swarm architecture proved effective for tracking rapidly changing decision boundaries in simpler problems, though computational complexity limited its applicability to higher-dimensional datasets — providing empirical evidence for MQSO's effectiveness under specific conditions and practical guidance for algorithm selection based on dataset characteristics and drift patterns.

Agentic AI for Dynamic Travelling Salesman Problem with Drone Intercept: A Feasibility Study Against Metaheuristic Baselines

This research investigates whether a planner-validator-critic architecture driven by large language models (LLMs) can directly solve the dynamic travelling salesperson problem with drone intercept (DTSPDi), which models a truck-carried drone that launches and intercepts mid-route, including geometric interception constraints, vehicle synchronisation, and coordinate drift as customer locations change during operations.

A three-tier incremental protocol isolates where LLM capabilities break down, progressing from classical TSP to static TSPDi to full DTSPDi with coordinate drift. LLM planners propose routes via the Optimisation by PROmpting (OPRO) framework, with deterministic validators enforcing constraints and critics providing feedback, benchmarked against Concorde, SaNSDE, and ACS-KT respectively across twelve instances of 10–75 customers, with the final tier comparing three mini LLMs (GPT-4o, GPT-4.1, GPT-5) against ACS-KT.

GPT-4.1 mini achieved mean delivery times 14.73% longer than ACS-KT at 91.0% feasibility; GPT-5 mini attained the highest feasibility rate (98.6%); GPT-4o mini offered lowest cost but weaker feasibility and quality. The unified output schema influenced feasibility more than model choice, and the iterative planner-validator-critic loop enabled competitive performance within constrained budgets, though all models degraded beyond 50 customers — indicating LLM-based planners are not yet competitive end-to-end solvers for DTSPDi, but can serve as interpretable assistants or hybrid components alongside specialised algorithms.

A Neural Style Transfer Approach to Overcome Data Scarcity in Fruit Sorting Systems

South Africa's fresh fruit industry relies on automated sorting systems, but the deep learning models behind them need large labelled datasets, which are costly and impractical to collect for every fruit cultivar. This research investigates whether neural style transfer can address data scarcity by generating cultivar-specific visual characteristics for training fruit classification models, using peach images as content and a small set of plum images as style references to produce stylised synthetic plums via a pretrained CNN-based style transfer framework.

The synthetic images were used to train a classification model evaluated on real plum images. Neural style transfer produced visually plausible images but did not consistently preserve the fine defect details needed to distinguish conforming from non-conforming fruit, with the model achieving roughly 50% accuracy, 23% recall, and 0.60 AUC when transferring from synthetic to real images, and a confusion matrix showing a strong bias toward predicting conforming fruit.

A learning curve analysis showed neural style transfer works best in scenarios with limited labelled data — combining a small number of real labelled images with a larger set of synthetic ones improved performance over synthetic data alone; with only 25 labelled real images, adding synthetic images raised accuracy from about 50% to 59%. The research concludes that neural style transfer cannot replace real labelled data for fine-grained fruit quality classification but can serve as a complementary augmentation strategy when target-domain data is scarce.

Maritime AIS Anomaly Detection

This research develops an autoencoder-based anomaly detection system for maritime vessel surveillance using Automatic Identification System (AIS) data, addressing the challenge of automatically identifying anomalous vessel behaviours from large-scale AIS data streams while maintaining operational efficiency. Using over 2 million observations from a single day after quality filtering, a symmetric encoder-decoder architecture was built with feature engineering including cyclical encodings for directional and temporal variables.

The autoencoder achieved substantial training loss reduction with consistent generalisation and no overfitting. Using a percentile-based reconstruction error threshold, the system flagged approximately 1% of observations as anomalous, achieving over 20-fold separation in mean reconstruction error between normal and anomalous observations — substantially exceeding typical benchmarks for traditional statistical methods.

Feature-level analysis revealed interpretable anomaly patterns including directional deviations, speed variations, and temporal irregularities, and the system's processing efficiency validated its feasibility for near-real-time surveillance. Vessel-level aggregation identified over 200 vessels with persistent anomalous behaviours, providing actionable intelligence for targeted investigation while keeping alert volumes manageable for human operators.

Transparent Modelling Methods for Consensus of Efficiency Improvements

This report tests various interpretable, transparent, and accurate modelling techniques on data from a large industrial facility to model energy consumption at observed production levels, ultimately using these models to estimate the facility's Energy Efficiency (EE) by comparing baseline-period expectations against observed values during a claim period.

Rather than aiming for a single EE improvement value with minimal uncertainty, the study focuses on drawing consistent conclusions across multiple modelling methods, applying them to the full dataset, operational data, statistically cleaned data, and a dataset adjusted for operational boundaries.

This approach produced highly consistent results across all four datasets, highlighting the value of using diverse modelling techniques in EE analysis to improve accuracy and confidence in conclusions, extended further by testing an ensemble technique combining traditional and transparent modelling approaches.

Transformer Models for Bone Fracture Detection and Classification from X-Rays

Accurate detection and classification of bone fractures from X-rays is essential for orthopaedic care, but radiographic interpretation remains challenging in busy clinics, particularly for subtle or complex fractures. While most existing automated systems rely on CNNs, this research evaluates transformer-based object detection models — ViTDet, Swin transformer, and RT-DETR — for fracture detection and classification, comparing them against a strong CNN baseline, YOLOv8-L, under limited clinical data conditions.

A structured experimental framework used annotated wrist radiograph datasets, training all four models under controlled conditions with variations in augmentation and class-imbalance handling, assessed using standard object detection metrics including mean average precision.

YOLOv8-L delivered the most reliable detection performance across all configurations. ViTDet was the strongest transformer-based detector and competitive under selected conditions but remained consistently below the CNN baseline. Transformer models showed pronounced sensitivity to class imbalance and limited data, contributing to greater variability, indicating that convolutional architectures currently preserve greater robustness for automated fracture detection in limited-data clinical settings.

To Go Deep, Or Not to Go Deep

Deep learning models have become the prevailing approach to image classification, often adopted without critically assessing whether such complexity is necessary. This research re-examines the assumption that deeper architectures are inherently superior, investigating whether shallow neural networks supported by informed preprocessing can match deep learning accuracy while remaining efficient and interpretable, using MNIST, Fashion MNIST, and CIFAR-10 of increasing complexity.

A structured, data-centric process combined exploratory data analysis, model-centric preprocessing, dataset-centric enhancement, and statistical validation, developing class-specific preprocessing pipelines and handcrafted features (edge, texture, colour descriptors) tailored to each dataset's visual attributes.

Across thirty independent runs, the optimised shallow networks achieved mean test accuracies of 98.67% on MNIST, 88.16% on Fashion MNIST, and 78.44% on CIFAR-10 — matching reported deep learning baselines on the first two, and remaining viable on CIFAR-10 when supported by targeted preprocessing, feature engineering, and ensembling. Statistical analysis confirmed these improvements were significant alongside reduced computational cost, challenging the presumption that depth is a prerequisite for high performance.

Incremental Class Learning Using Dynamic Meta-Heuristics

Incremental class learning (ICL) is a paradigm where disjoint groups of categorical information are sequentially presented to an optimisation algorithm over time, with the spatial severity and temporal frequency of these changes defining problem complexity. Static methods like stochastic gradient descent (SGD) work well in stationary environments but degrade under complex, dynamic optimisation problems, prompting research into population-based meta-heuristics — though little work has examined their performance specifically in ICL settings.

This study compares quantum particle swarm optimisation (QPSO), random immigrant genetic algorithms (RIGA), and dynamic evolutionary programming (DEP) as approaches to ICL, assessed using quantitative measures of complexity before and after environmental changes, with SGD at various reinitialisation ratios used as a baseline.

SGD with a 50% reinitialisation ratio produced superior stability after environmental changes but poor exploitation beforehand. QPSO achieved optimal stability and exploitation using a large, decaying quantum radius of 25. RIGA and DEP showed weaker stability but tracked optima effectively before each change — providing empirical evidence that ICL creates abrupt environments marked by large spatial shifts whenever a new category is introduced.

Utilizing Transformer Models for Analysing Unstructured Feedback in Recommender Systems

This study investigates how transformer-based embeddings and architectures affect review-based recommender systems that rely on unstructured text feedback, which is rich in context and sentiment but introduces methodological challenges due to noise and semantic complexity. The research compares static GloVe embeddings against contextual MiniLM-L6-v2 embeddings, and DeepCoNN against a lightweight transformer backbone (bert-tiny), on the Amazon All Beauty dataset across rating-prediction and three-class sentiment classification tasks.

MiniLM-L6-v2 consistently outperformed GloVe across all tasks, reflecting the value of contextual representations for nuanced and neutral sentiment. The best configuration, MiniLM paired with bert-tiny, achieved 82.30% classification accuracy and the lowest MAE (0.6673), while MiniLM paired with DeepCoNN achieved the lowest RMSE (1.0022), indicating stronger control of large errors — showing a trade-off between more consistent predictions (transformer) and better control of extreme deviations (CNN).

No configuration achieved balanced recall across negative, neutral, and positive classes: GloVe-based models behaved largely as polarity detectors, while MiniLM-based models improved neutral recognition but introduced other trade-offs, suggesting review ratings function more like a continuous scale than a discrete sentiment taxonomy. The study demonstrates the viability of lightweight transformer backbones under constrained computational resources for unstructured feedback analysis in recommender systems.

Leveraging Machine Learning in Learning Analytics Dashboard to Identify High-Performing Students in a Hybrid Online Master's Programme

As hybrid and online postgraduate learning grows, so does the need for data-driven tools supporting student success. Unlike most research that predicts at-risk students, this study takes a strengths-based approach — modelling high achievers' behaviours in a hybrid Master of Engineering Management programme to derive replicable best practices, drawing on engagement logs, assessment records, demographic attributes, and career-related information from a learning management system.

Feature engineering grouped over 40 event types into engagement dimensions such as assignment activity, discussion participation, content interaction, and reflection. A random forest classifier, chosen for interpretability, was tested across five configurations with different train-test splits and target definitions, using SMOTE to address class imbalance, including an academic-year-based split simulating real-world prediction of future cohorts from past patterns.

The best model achieved 81% accuracy and 90% recall in identifying high-performing students. Previous academic performance was the strongest predictor, followed by discussion participation, assignment engagement, and sustained content interaction, while demographic features like race and job title contributed minimally — suggesting engagement behaviour and academic history matter more than socio-demographic attributes. A proposed Learning Analytics Dashboard visualises these findings to support an ethical, behaviour-focused (rather than demographic-profiling) approach to predictive analytics in education.

A Particle Swarm Optimisation-Based Maximum-Margin Classifier

Training support vector machines conventionally relies on quadratic programming, which requires strict convexity and clean, balanced data — assumptions that real-world noisy, imbalanced, high-dimensional datasets often violate, increasing computational cost and reducing solver reliability. This research develops a particle-swarm-optimisation-based maximum-margin classifier as a solver-free alternative, restricting optimisation to Tomek link pairs near class boundaries to reduce effective problem dimensionality and improve margin estimation.

The framework extends to non-linear classification through kernel functions such as radial basis function and polynomial kernels, implicitly mapping data into higher-dimensional spaces to handle class overlap and noise.

Across seven benchmark and four synthetic datasets, the proposed classifier achieved classification accuracy and F1 scores comparable to, or exceeding, SVMs trained via quadratic programming. Although it requires more computation due to iterative fitness evaluations, it provides a flexible alternative that avoids assumptions of convexity, differentiability, and dual formulation.

Stock Selection with Machine Learning

This study evaluates whether modern multivariate time-series (MVTS) models can select U.S. equities for fixed holding periods and outperform a market proxy (the Vanguard S&P 500 Growth ETF, VOOG), framing stock selection as an MVTS classification task using historical accounting variables, engineered financial features, and market prices for roughly 480 S&P 500 constituents.

A task-aligned "long-reward" evaluation metric emphasises high precision subject to a minimum recall, penalising false positives among low performers to align model scoring with portfolio construction. Models underwent hyperparameter tuning and portfolio simulation across multiple random seeds, with statistical comparisons via Friedman/Iman-Davenport and Nemenyi tests.

Medium- to long-term horizons (8–24 quarters) showed stronger predictive signal than short horizons. TARNet, a task-aware reconstruction network, attained the best average rank overall and, deployed as a concentrated long-horizon satellite allocation overlaying a VOOG buy-and-hold core, improved central returns while remaining competitive on downside risk. The work contributes a transparent, task-aligned pipeline for evaluating MVTS models in equity selection, showing such models combined with simple benchmark exposures can deliver superior medium- to long-horizon performance.

Monitoring Explainability: A System to Detect Changes in Model Feature Importance and Underlying Relationships

Concept drift threatens machine learning models in financial environments, and traditional monitoring based on performance metrics or input distributions suffers from label latency, an inability to detect shifts in model reasoning, and insensitivity to subgroup fairness degradation. This study proposes an explainability-aware framework treating model explanations as primary drift indicators, integrating SHAP values with statistical detection across three engines: a SHAP-MMD detector for explanation stability, a concept drift monitor for feature relationships, and a fairness monitor for subgroup consistency.

Validated on Taiwan credit card data under synthetic drift scenarios, the SHAP detector achieved a 0.90-window mean detection delay — an 86.7% improvement over feature-based methods and 88.8% over performance-based approaches — while maintaining 68.6% precision.

The fairness extension showed consistent sensitivity across demographic subgroups, with low variability in violation rates across sex and education attributes, confirming the framework avoids fairness blind spots during drift events. The work bridges explainable AI with operational monitoring, offering financial institutions a proactive tool for model governance and earlier, more actionable insight into model reasoning than conventional approaches.

The Evaluation of the Impact of Data Augmentation on the Performance of Fracture Detection and Classification Models

Deep learning increasingly supports medical image analysis, including detecting fractures in X-rays, but large labelled datasets are costly and hard to obtain, especially for rare conditions. This research addresses two challenges limiting wrist fracture detection: a scarce target dataset (774 images, half without fractures) and class imbalance across four fracture classes, where three classes make up half the fracture images, biasing models toward the majority class.

The study explores traditional data augmentation, mix-based augmentation (CutMix), and generative modelling (GANs and diffusion models) as solutions. The generative models produced images resembling wrist X-rays but lacking the fine anatomical detail needed to define fractures, misalignment, and displacement, making them unsuitable for object detection.

Traditional data augmentation and CutMix both improved YOLOv8 and YOLO11 object detector performance. Traditional augmentation achieved the highest mAP50 (56.51% on YOLO11, 53.55% on YOLOv8), while CutMix performed better on the stricter mAP50–95 metric (26.71% on YOLO11, 25.19% on YOLOv8), indicating better fracture localisation despite a lower mAP50.

Financial Fraud Detection in Credit Card Transactions with Ensembles and Explainable AI

Credit card fraud is escalating in scale and complexity, outpacing traditional rule-based detection and exposing financial institutions to operational and regulatory risk. This study develops a fraud detection framework integrating Ensemble Machine Learning with Explainable AI to achieve both strong predictive performance and transparent, defensible decisions, following the CRISP-DM methodology across two datasets: the imbalanced European Credit Card Fraud dataset and a simulated feature-rich transactional dataset for interpretability analysis.

Random Forest, Gradient Boosting, and a stacking meta-learner were trained and evaluated using precision, recall, F1-score, and ROC-AUC, with ensemble approaches consistently outperforming single-model baselines, yielding higher recall and fewer false positives — both critical for real-world fraud mitigation.

SHAP and LIME were incorporated to produce robust global and local explanations, enabling analysts, auditors, and regulators to validate model behaviour. The findings establish that a tightly coupled ensemble-XAI architecture is both technically superior and operationally viable for large-scale financial deployment.

ECG Arrhythmia Classification Using a Weighted Recurrent Neural Network Ensemble

Cardiovascular disease remains a leading global health burden, with arrhythmias a major contributor, making early, accurate detection from ECG signals essential — though class imbalance, signal noise, and inadequate temporal modelling continue to limit automated interpretation. This dissertation develops a reproducible pipeline for ECG arrhythmia classification using the Chapman-Shaoxing 12-lead dataset, combining advanced preprocessing, class-balancing, multiple deep learning architectures, and extensive comparative evaluation.

A multi-stage denoising strategy (Butterworth low-pass filtering, LOESS baseline correction, Non-Local Means smoothing) enhanced ECG morphology, with SMOTE-Tomek sampling and R-peak-aligned windowing addressing class imbalance. Several baseline and RNN architectures — CNNs, BiLSTMs, BiGRUs, and weighted ensembles — showed large gaps between accuracy and macro-F1 scores, revealing an inability to reliably discriminate minority arrhythmia classes under traditional architectures.

A Hybrid CNN-TCN-Attention model, combining local morphological feature extraction, long-range temporal receptive fields, and global salience weighting, substantially outperformed all earlier models, achieving 92.98% accuracy, 92% precision, 93% recall, and a 92% F1-score, and demonstrating better handling of imbalanced classes than the RNN ensemble approaches.

Automated Identification of Musical Instrument from Sound

Automatic musical instrument identification remains a challenging subfield of music information retrieval, hindered by limited dataset sizes and the underutilisation of hierarchical structures suited to this inherently hierarchical classification problem. This thesis combines two publicly available datasets into a larger, more diverse dataset for training and evaluation, then trains and evaluates a range of machine learning and deep learning models on it.

A hierarchical classification approach was proposed for traditional machine learning models, motivated by the prior success of binary classifiers in similar tasks.

Support vector machines benefited significantly from the hierarchical structure, achieving 98% precision, while the convolutional neural network surprisingly did not outperform the traditional models in this context — demonstrating the continued relevance of traditional machine learning alongside deep learning for instrument classification, with potential applicability to real-time audio classification systems.

Establishing Modern Data Architectures to Enhance Data Analytics and Data Science in South African Enterprises

Despite recognising data's strategic value, many South African enterprises remain constrained by fragmented, legacy-driven data environments marked by data silos, manual integration, and once-off analytics initiatives, limiting the long-term impact of analytics and inhibiting AI and ML operationalisation. While modern data architectures (cloud data lakes, lakehouse platforms, distributed analytics) offer clear potential, their adoption in South Africa has been uneven.

This study investigates the factors hindering South African organisations from establishing modern, scalable data architectures using a mixed-methods approach combining a literature review with an industry practitioner survey, applying binary clustering (Jaccard distance, agglomerative hierarchical clustering) to identify organisational profiles based on perceived barriers and enablers.

Barriers such as legacy system dependence, data silos, skills shortages, governance immaturity, and budget constraints proved systemic and widely experienced, while enablers like leadership support and innovation orientation varied more. The results reveal a persistent gap between strategic intent and operational capability, and the study proposes a readiness-driven framework with tailored implementation roadmaps for entry-level, mid-level, and advanced organisations, favouring phased, context-aware modernisation over purely technology-led adoption.

A Critical Review and Analysis of Particle Swarm Optimisation Approaches for Multi- and Many-Objective Optimization

Particle swarm optimisation (PSO) has expanded into a diverse family of algorithms addressing multi- and many-objective optimisation problems. This research provides a critical review and empirical analysis of multi-objective PSO variants (dominance-based, decomposition-based, criterion-based, hybrid), alongside many-objective PSO algorithms including modified dominance formulations, reference-point-driven methods, and virtual Pareto-front modelling, evaluated on standard benchmark suites (ZDT, WFG for multi-objective; DTLZ, WFG for many-objective) using indicators like inverted generational distance and hypervolume.

For multi-objective optimisation, PSO variants using competitive mechanisms or multiple complementary search strategies achieved the strongest convergence and diversity across both convex and complex Pareto-front geometries, while several decomposition-based algorithms showed sensitivity to problem structure and struggled with scalability.

For many-objective optimisation, algorithms integrating reference information, adaptive dominance modification, or virtual Pareto-front models delivered the most robust performance by preserving selection pressure as objectives increased. The findings offer a consolidated understanding of PSO behaviour across benchmark families and objective counts, providing practical guidance for algorithm selection.

Adaptive LLM Routing for Effective Agentic Workflows

Deploying large language models in enterprise settings raises challenges in balancing cost, latency, and data security, particularly for specialised agentic workflows like natural language to SQL (NL2SQL) query translation — high-capability proprietary models offer superior reasoning but at higher cost and privacy risk, while smaller open-source models reduce cost but may underperform on complex tasks. This research designs, develops, and evaluates a framework for dynamically routing and orchestrating multiple LLMs within an NL2SQL agentic workflow.

The framework accepts a natural language query, database connection parameters, and available tools, then uses three routing strategies to select a model from a curated multi-provider repository: a supervised classifier trained on historical performance, retrieval-augmented generation over similar past queries and their winning models, and a supervisory model reasoning over textual model descriptions. It outputs the query result alongside detailed cost, latency, and quality metrics.

The framework was validated on a curated dataset of natural language queries paired with verified database queries, using a comprehensive evaluation protocol combining automated LLM scoring and execution tracing, and is designed as a modular, extensible system to help domain experts and enterprise users obtain cost-effective, high-quality responses from agentic systems.

Inference of Soluble Solids in Pineapples Using Hyperspectral Imaging

In South Africa's pineapple juice concentrate industry, total soluble solids (TSS) content is assessed via destructive sampling from each delivery, a time-consuming, wasteful process based on a small, non-representative sample. This research investigates hyperspectral imaging (HSI) — which combines imaging with spectroscopy to capture spectral information per pixel — as a rapid, non-destructive alternative for estimating TSS in pineapples destined for concentrate production.

A total of 93 short-wave infrared and 82 visible near-infrared hyperspectral images of pineapple sections were acquired and pre-processed using five regimes to create ten datasets, comparing partial least squares regression (PLSR) against a one-dimensional convolutional neural network (1-D CNN) using cross-validation.

The best 1-D CNN, a spectral variant trained on the visible near-infrared dataset, achieved an RMSECV of 1.3308 °Brix and R² of 0.5640, while the optimal PLSR model on unprocessed short-wave infrared data achieved superior performance with an RMSECV of 0.9799 °Brix and R² of 0.7547. The results suggest that with a larger, more diverse training dataset, hyperspectral imaging could enable accurate, rapid, non-destructive TSS assessment in whole pineapples, addressing a key limitation in current quality evaluation practices.

2025

December 2025 Graduation

Evaluation of Self-Organising Maps for Defect Pattern Recognition and Quality Optimisation in Stainless Steel Manufacturing

Steel production involves complex manufacturing processes where defects often result from a combination of variable process parameters, and conventional monitoring approaches are limited in providing high-quality predictions and precise detection. While multimodal data integration has theoretical support, empirical validation in the steel industry is lacking. This study investigates whether adding text-based information from operator reports improves defect detection compared to using process parameters alone.

Several text representation methods — term frequency-inverse document frequency (TF-IDF), bag-of-words (BoW), and global vectors for word representation (GloVe) — were compared against a baseline without text features, using a dataset combining continuously captured numerical production parameters, unstructured text, and operator inspection data. A systematic, data-driven investigation used self-organising maps with multi-objective optimisation balancing topological preservation and classification performance.

The results indicate that including text features does not significantly improve classification performance, with no significant difference found between methods — reiterating the need to validate rather than assume performance improvements from a multimodal approach. A hybrid self-organising map (HybridSOM) analysis identified four process variables with the highest importance: continuous casting mould width, two secondary cooling flow zones, and the unground surface indicator. The results suggest industrial quality systems can prioritise sensor infrastructure over complex text analysis, since a focus on numerical process parameters provided sufficient robustness for defect detection without the added complexity of multimodal text integration.

Advancing Energy Forecasting: Evaluating and Enhancing Large Language Models for Individualized Consumption Prediction

Effective energy resource management requires strong forecasts able to handle changing and complex consumption patterns. In areas such as Tembisa, South Africa, where varied socioeconomic factors affect local consumption trends, traditional models struggle to fully capture complex dynamics and anomalies in time-series energy data. As IoT devices and meter manipulation grow more complex, large language models (LLMs) offer a promising alternative to conventional techniques. This study evaluates and improves the effectiveness of GPT-2 for energy consumption prediction, comparing it against a deep learning-based LSTM model and traditional forecasting models — ARIMA, support vector regression, and random forest.

Among all models, GPT-2 achieved the smallest mean squared error and best coefficient of determination (R²) in terms of accuracy and generalisation, while traditional models struggled with the Tembisa dataset's non-linear and temporal complexity — LSTM showed clear overfitting during unexpected consumption swings, and SVR and random forest generalised poorly. Data preparation handled missing values via linear interpolation and outliers via Z-score and interquartile range techniques, with a custom PyTorch dataset class converting numerical time-series data into descriptive phrases to help the transformer model learn from the data more effectively, enabling GPT-2 to produce accurate forecasts across seasonal and temporal variations.

Refining GPT-2 with augmented datasets improved its predictive adaptability further, increasing MSE and R² performance on both validation and test datasets as training data was updated, showing that GPT-2 could adapt to changing consumption patterns through continuous learning. Using ARIMA, the study also examined anomaly identification via residual analysis and Z-score criteria, successfully identifying erratic consumption patterns that may indicate meter manipulation or unexpected consumption peaks. The results show the promise of transformer-based models like GPT-2 to outperform conventional approaches for complex and dynamic energy consumption datasets, offering an evolving method for enhancing energy management in areas such as Tembisa by tracking seasonal fluctuations and shifting consumption patterns.

March 2025 Graduation

Set-based Particle Swarm Optimization for Training Support Vector Machines

This research explores the application of set-based particle swarm optimization (SBPSO) to the training of support vector machines (SVMs), addressing challenges in hyperparameter tuning, noisy datasets, and computational efficiency. SVMs, celebrated for their classification precision, often face limitations due to their sensitivity to parameter selection and difficulties in handling high-dimensional or noisy data. SBPSO, an extension of traditional particle swarm optimization (PSO), is tailored for discrete optimization problems, making it a promising approach for optimizing SVM performance.

The study investigates two approaches: standard SBPSO-SVM training and SBPSO-SVM training with Tomek links preprocessing, which enhances data quality by reducing noise and refining decision boundaries. Experiments conducted on five benchmark datasets reveal that both methods significantly reduce the number of support vectors while maintaining competitive accuracy and F1 scores. However, training times were substantially longer than those of standard SVMs, highlighting a need for further optimization.

To address these challenges, dynamic control of SBPSO parameters was introduced, alongside advanced preprocessing techniques such as principal component analysis (PCA) with Gaussian mixture model (GMM) noise filtering and Wilson editing. While these enhancements improve training efficiency and performance for complex and noisy datasets, the algorithm still struggles to scale effectively to very noisy, large, and highly complex datasets.

This research contributes to the ongoing development of hybrid optimization frameworks, providing insights into balancing computational costs with classification performance. The findings underscore the potential of SBPSO-SVM as a robust tool for advancing machine learning applications in diverse, real-world scenarios.

Property tax validation using automatic building footprint extraction from aerial images

Property valuation is essential to determine the rates and taxes needed for municipal services. Property valuation depends on the number and size of buildings on a property. A tedious manual process is used to create the outlines of buildings and calculate the building area. Therefore, this project aims to develop a process to generate building outlines from unmanned aerial vehicle raster images with as little human intervention as possible. The solution developed uses semantic pixel classification to detect buildings and the building outlines. The outlines can then be used to validate property valuations.

To perform semantic pixel classification, a U-Net architecture was selected. Various experiments were conducted to find the optimal U-Net architecture. The output of the semantic pixel classification was used along with a contour extraction method to extract the building's outline. Similarly, experiments were conducted to select the optimal contour extraction method. The U-Net model and contour extraction method are combined to create a process capable of extracting building outlines from raster images.

Experiments were performed using a human-in-the-loop approach, a variant of active learning. The training results show accuracy, recall, precision, and intersection over union above 90%. Even though the training showed excellent training and validation metrics for the experiments, the project shows how critical the training data is to predicting test data and determining the quality of image segmentation and building outline extraction. Finally, the process produces vector data that accurately represents 80 to 90% of buildings with an area error of less than one square meter.

Credit Scoring and Risk Assessment Using Machine Learning and Overdraft History

Credit scoring, by definition, is a quantitative methodology and evaluation method by which lenders assess whether a borrower (either an individual or a business) is able to repay a debt if credit is granted. A credit score is typically generated at the end of the credit scoring process, and it is a fundamental element that influences an individual's access to credit. It acts as a gateway to financial resources such as loans, credit cards, and others, highlighting the importance of fairness, non-discrimination, and ethical practices to ensure equitable access to credit, free from prejudice and bias.

Credit history is typically the key factor in traditional scoring methods, including the FICO score, logit models, and expert judgment-based models, among others. As a result, individuals who have never borrowed may be overlooked or subjected to high-interest rates. To address these limitations, this study leverages overdraft information to develop a dynamic, inclusive, and effective credit scoring framework. This framework integrates both traditional credit history data and overdraft data, which is often underutilized but can potentially serve as an indicator of good versus bad borrowers. Additionally, the literature does not identify which machine learning method is best suited for credit scoring tasks. To overcome this uncertainty, the following algorithms are trained: KNN, Naive Bayes, Decision Trees, ANN, and SVM, to predict bank customers that are likely to default or not default on credit using three distinct datasets: overdraft, credit history, and a combination of both. The performance of these algorithms is evaluated to determine the most accurate predictive method.

Through a series of hyperparameter tuning across the algorithms, the results of this study suggest that Naive Bayes is particularly effective when both credit history and overdraft data are available, as it demonstrated minimal misclassifications and robustness in classifying customers correctly. The algorithm performed best on the three tested datasets, achieving accuracy rates of 99.01% for the credit history dataset, 99.5% for the hybrid dataset, and 100% for the overdraft dataset. KNN also performed well, with accuracy rates of 98.93% for credit history, 99.3% for the hybrid dataset, and 99.97% for overdraft.

Additionally, a comparison of the overdraft credit scores versus credit history scores indicated that overdraft-based scores reflect a more optimistic distribution, with a significant reduction in the percentage of customers categorized as poor when using overdraft data alongside credit history. The combination of both datasets resulted in more accurate credit assessments, increasing the number of customers qualifying for credit approval. Specifically, 75% of customers qualified using the combined dataset, compared to 65% with overdraft data and 45% with credit history alone.

The results of this study offer new perspectives for financial institutions that traditionally rely solely on credit history data to profile individuals. This unique study represents a potential game-changer in the field, with the capacity to bring about a significant paradigm shift in lending and borrowing practices. If successfully adopted, this approach could create a mutually beneficial situation for both lenders and borrowers. Individuals often denied credit due to a lack of credit history would no longer be excluded, thereby enhancing decision-making processes and potentially increasing profitability.

Machine Unlearning of Convolutional Neural Networks to Address the Right to Be Forgotten

This research assignment examines whether personally identifiable information can be removed from a convolutional neural network using a machine unlearning algorithm and verified as removed to ensure compliance with the right to be forgotten as outlined in the General Data Protection Regulation. Machine unlearning examines whether data removal can be achieved while preserving machine learning model performance without fully retraining a machine learning model.

In this research assignment a convolutional neural network is trained on facial images. The performance of the convolutional neural network before and after applying a machine unlearning algorithm is then established. The evaluation examines the extent of data required for machine unlearning, such as whether a single image, multiple images, or all images used during training are necessary to remove the presence of data associated with an individual.

Machine unlearning demonstrated effectiveness in removing specific data from the convolutional neural network, as measured by a membership inference attack. The machine unlearning algorithm, which utilises Kullback-Leibler divergence and weight regularisation, enabled the removal of data for a single individual as well as for a forget set composed of a sampled group of individuals without requiring full retraining. The study shows that unlearning can be successfully achieved while preserving the generalisation capabilities of a convolutional neural network.

Towards an automated medical image classification pipeline

Radiological departments have high demands for efficiency and diagnostic quality, and interpretation of radiographs is highly variable between radiographers. The process followed in a radiological department to support patients with health services can be made more efficient. Parts of the process, such as retrieving and processing data, can be automated with artificial intelligence to expedite the process and increase the quality of services offered.

Deep learning is a subfield of artificial intelligence, and transfer learning is a subfield of deep learning. Transfer learning can be applied to image classification tasks to improve the predictive accuracy of classes. Medical images cover several modalities such as X-rays, ultrasound, magnetic resonance imaging, and angiographs, amongst others. Several transfer learning methods are compared to perform classification on two model components. The first component is a machine learning model that can predict the medical image modality type of an image. The second component is a machine learning model that can predict the body part from human anatomy.

This research assignment covers the creation of a medical imaging dataset that is sourced from open source datasets. A variety of transfer learning models such as residual neural networks, dense neural networks, and efficient neural networks are evaluated on this dataset. The results of this research assignment show that lightweight transfer learning methods can successfully be applied to perform classification on medical imagery. The best performing models of both components are combined in a transfer learning classification pipeline. The transfer learning pipeline produced a predictive accuracy of 96.3034% on testing data.

Evolving Oblique Decision Trees

This study investigates the induction of classification oblique decision trees using genetic programming, with constraints imposed on the genetic operators and the fitness function. Additionally, the study examines the effect of introducing pre-defined genetic programs in the initial population of the evolutionary process on the performance of the genetic programs in solving classification tasks. The pre-defined individuals in the initial population were generated by leveraging clustering techniques and methodology inspired by the CLine decision tree.

The goals were achieved by developing constrained genetic programs to induce oblique decision trees. The results demonstrate that using genetic programming with applied constraints for classification purposes is feasible and results in decision trees that perform exceptionally well compared to standard axis-aligned and oblique decision trees, albeit at the cost of increased computational resources. Results from the experiment also highlight that the overall performance of genetic programming-based algorithms relies more heavily on the evolutionary process itself rather than the introduction of initial population diversifying techniques.

Horticulture Supplier Delivery Forecast

Supermarket retailers rely on suppliers to meet customer demands, but suppliers often face disruptions that prevent them from delivering the agreed quantities. This is true in the horticultural sector, where weather and logistical challenges affect delivery reliability. Accurate forecasting of horticultural supplier deliveries is critical for supermarket retailers, as fresh fruit is a key source of revenue. This highlights the need for improved forecasting methods that use predictive analytics to improve forecast accuracy.

The main objective was to develop a predictive analytics solution to forecast deliveries from horticultural suppliers, focusing on fresh fruit. The research aims to help retailers align supply with demand, reducing stock shortages and managing variability in deliveries. The study employs machine learning models trained on 24 months of historical data, incorporating derived features that represent factors influencing delivery reliability. The models, including a baseline model, are evaluated over a 6 month period, using 69 exclusive suppliers and 32 product types.

The research assignment found that the majority of the models outperformed the baseline, with random forest and GRU models performing the best based on standard evaluation metrics. The baseline model achieved a mean absolute error (MAE) of 30.35, while the random forest model reduced the MAE to 0.47, demonstrating a significant improvement in forecasting accuracy. The findings show that the integration of predictive analytics and the incorporation of influential factors address key challenges faced by retailers, such as inconsistent supplier deliveries, and can improve forecast visibility and customer satisfaction. This study contributes to predictive analytics in the horticulture supply chain, highlighting the importance of integrating factors to optimise forecasting.

Behavioural Scorecards Development and Machine Learning

This study compares traditional behavioural scorecards based on logistic regression (LR) with machine learning (ML) for credit risk assessment. The study aims to improve predictive performance while maintaining model interpretability to comply with Basel regulatory standards. To achieve this, the study introduces the Bayesian Weight of Evidence Optimizer (BWOpt) for binning optimization in LR models and proposes the interpretable prepruned penalized logistic tree regression (P-PLTR) alongside RuleFit. It also explores the effects of sampling strategies (undersampling and oversampling) on model performance with imbalanced datasets.

Results show that traditional scorecards outperform ML models, particularly with oversampled data. While RuleFit and P-PLTR show competitive performance with undersampling, P-PLTR suffers from instability in rule sets. BWOpt-enhanced LR models outperform both ML methods, highlighting the value of feature engineering. These findings align with existing literature, which suggests that ML models do not significantly outperform statistical models such as LR in structured data, though ML may offer advantages with unstructured data. Given their balance of interpretability and predictive power, traditional scorecards remain well-suited for regulated environments.

Evaluating Heterogeneous Graph Embeddings for Product Substitute Identification with LLM-Generated Attributes

In the context of the food retail sector, the identification of product substitutes is crucial for several reasons, including the determination of the assortment of store products, the design of marketing campaigns, the promotion of items, and the avoidance of potential cannibalisation when introducing new products. Given the extensive range of products and categories, understanding product relationships and consumer purchasing behaviours is essential. Product relationships can be categorized as complements, substitutes, or irrelevant product pairs. This study seeks to investigate product substitutes through the process of product clustering. The first of three research objectives is to determine whether the use of product attributes leads to the formation of consistent and informative product groups. The second objective aims to determine whether usable and accurate product attribute values can be derived from product descriptions using large language models (LLMs). The final objective is to evaluate the impact of the LLM-generated product attribute values on the formation of substitute product groups.

To determine product attributes, a combination of structured and unstructured data sources is utilised from a prominent South African food retailer with the intention to elucidate product substitute relationships while integrating a degree of explainability derived from the product attributes. The framework known as Product Attribute Value Extraction (PAVE) offers a prompt engineering template as an efficient method for extracting explicit and implicit attributes from product descriptions with an accuracy level of up to 85% in this study depending on the chosen model. While great accuracy is obtained, there are slight nuances to the accuracy of the different attributes, where some have a significantly lower extraction accuracy for most of the models tested. However, even with significantly high accuracy rates, the LLMs can be further fine-tuned for use-case-specific tasks, allowing for even higher accuracy.

In the pursuit of identifying product substitutes, product attributes are utilised in conjunction with transaction data to capture purchasing behaviours. Various graph embedding and graph clustering models are evaluated to identify a model that can fulfil the dual objectives of substitutability and explainability. A heterogeneous graph embedding is chosen for conducting the substitutability analysis, in combination with similarity-based and graph-based clustering algorithms. The heterogeneous model is selected due to its higher potential for offering context-specific explainability amidst the continuously evolving domain of product relationships. Experiments are conducted to attempt to cluster products into substitute categories that correspond with both the retailer's groupings and those of a baseline model.

The findings indicate that the use of product attributes does not constitute the most effective and scalable approach to achieve substitute product categorisation. This limitation arises from the inherent sensitivity of heterogeneous graphs to both configuration settings and input data, which may require tailored and context-specific model calibrations. Further investigations are warranted to explore the potential integration of product attributes into heterogeneous graph embeddings for substitute categorisation. Alternatives could include knowledge graphs and link prediction, or the adaptation of the PAVE framework to facilitate the extraction of product substitutes from a list, potentially enriched with external data.

Deriving an Agricultural Soil Quality Index from Soil Microbiome using Autoencoders

Soil quality plays a pivotal role in sustaining ecosystems, influencing climate change, and supporting agricultural productivity. Degradation of soil can severely threaten food security and exacerbate global warming. Current definitions and indices for assessing soil quality concentrate on a single soil function or fail to consider the important interrelationships and dynamics between soil properties. Principal component analysis is commonly used to establish a soil quality index through additive or weighted additive models. Principal component analysis is, however, inadequate when nonlinear relationships or high correlation exist among variables. Moreover, additive methods require prior knowledge of how specific soil properties impact quality without considering interdependencies among them. These limitations complicate the integration of the soil microbiome into a soil quality index. Given the complexity and diversity of microbial communities in soil, there are limited studies that define soil quality from a microbial perspective. Yet, the soil microbiome is essential for maintaining soil functionality and preventing degradation. Thus, there is a need to develop a soil quality index that incorporates microbial activity to enhance food security and promote sustainable agriculture.

This study proposes the use of autoencoders to develop a soil quality index derived from soil microbiome data. To address the high dimensionality of the microbiome dataset, four feature selection techniques — principal component analysis, Pearson correlation, agglomerative hierarchical clustering, and Louvain community detection — were implemented to generate minimum datasets which were used to train various autoencoder designs. The output from the autoencoder's bottleneck layer was used to derive a soil quality index, which was evaluated against microbial diversity indices.

The soil quality index showed a strong correlation with the Chao1 diversity index and moderate correlations with the Shannon and Simpson diversity indices. Among the minimum datasets used, the dataset generated using agglomerative hierarchical clustering produced a soil quality index with the highest correlations to microbial diversity indices. The soil quality index derived using a sparse autoencoder was particularly favoured due to its simplicity, as it reduces to a sigmoid function during inference, enhancing explainability and interpretability.

Incremental Feature Learning: A Constructive Approach to Training Neural Networks with Dynamic Particle Swarm Optimisation

Incremental feature learning (IFL) is a supervised machine learning (ML) paradigm for feedforward neural networks (NNs), where the input layer of the NN is incrementally constructed over time. The benefits of such a paradigm are twofold; the first is the ability afforded to a NN to dynamically incorporate new features as they become available over time without the need for retraining; the second is a reduction in overfitting behaviour and model complexity, and hence improved NN generalisation ability. A feature ranking approach based on feature importance is used to determine the order in which features are integrated into the model.

The incremental addition of features to a NN results in a dynamic optimisation problem (DOP); more specifically, a DOP with dimensionality expansion, where both the surface and the dimensionality of the search space evolve over time. Particle swarm optimisation (PSO) is an established method for training feedforward NNs, and has been shown in multiple studies to outperform traditional backpropagation (BP). Modified PSO algorithms have been developed to deal with dynamic environments, and have been successfully applied to train feedforward NNs in dynamic environments. This study adapts various dynamic PSO variants for use in DOPs with dimensionality expansion.

The adapted dynamic PSO variants are used to train incrementally constructed NNs (INNs) using the proposed IFL framework, and the results are compared to those of fully constructed NNs (FNNs) trained using traditional BP and standard PSO on a complete dataset. Experiments were conducted on fifteen diverse datasets spanning regression and classification tasks. The results show that IFL effectively enables NNs to dynamically incorporate new features as they become available over time, and that IFL provides desirable performance in terms of overfitting behaviour and can be used as a regularisation technique.

Grading Infrastructure Conditions through Machine Learning using Infrastructure Report Cards and Media Reports

Public infrastructure is of critical importance to advance job creation, equity, sustainable development, and economic growth, yet there is a lack of information regarding infrastructure conditions to enable informed infrastructure investment decisions in South Africa. The South African Institution of Civil Engineering publishes Infrastructure Report Cards where ratings are applied to different infrastructure sectors based on factors such as condition, capacity, and performance. A general lack of information, however, restricts the compilation of infrastructure report cards. Online news articles are targeted as an alternative data source to compile infrastructure report cards due to their availability, real-time and geographical coverage, as well as insights into the socio-political infrastructure issues which are not adequately captured in technical reports. However, online news articles do not rate infrastructure conditions explicitly, which makes it difficult to summarise and extract findings. In this research assignment, a machine learning model that automatically rates infrastructure conditions from online news articles is developed.

A cross-domain modelling approach was adopted where the knowledge gained from training machine learning models on the source domain was utilised to make predictions on the target domain. Label descriptions from The South African Institution of Civil Engineering and the American Society of Civil Engineers infrastructure report cards were collected and compiled to form a source domain, while extracted online news articles were used as a target domain. An in-domain modelling approach was adopted to determine the feasibility of the datasets. In the cross-domain modelling approach, six machine learning models were trained on the scorecard dataset and evaluated on an annotated sample of the online news article dataset. The six models included three ordinal regression models, a long short-term memory model, and two hybrid models where active learning and random sampling were combined with the long short-term memory model.

The logistic ordinal regression all-threshold model achieved the best mean squared error score of 1.255 on the test dataset, with the ordinal ridge regression model achieving the best mean absolute error of 0.788. These results suggest that the models in this research assignment can, on average, predict the article labels within a margin of less than one grade from the true label. The findings suggest that the models performed well in the in-domain learning and are able to label news articles, but struggled with domain adaption in cross-domain learning due to misalignment between the features and labels of the two datasets.

Automated Road Detection and Classification for Urban and Rural Areas Using Aerial Imagery

This research assignment presents an automated approach to digitise roads from aerial imagery using deep learning techniques, focusing on distinguishing between paved and gravel roads. This work addresses the need for efficient and accurate road mapping in geographical information systems, supporting applications in urban planning, autonomous driving, and infrastructure management.

The solution utilises a DeepLab model based on the EfficientNetV2M architecture to identify and extract roads from aerial images and perform road quality condition assessment on the extracted road. The DeepLab model developed achieved a mean Intersection over Union score of 0.87 and a mean F1 score of 0.91.

After segmentation, the segmented masks are converted into polygons using image processing techniques. These are then compiled into geographical information system-compatible shapefiles with detailed attribute mapping for road type classification. The developed pipeline incorporates parallel processing and optimised contour detection algorithms to efficiently handle large datasets, along with error handling and logging mechanisms to maintain robustness.

This automated approach significantly reduces the manual effort required for road digitisation, offering a scalable solution for updating digital maps and enhancing geographical information system capabilities. This research assignment demonstrates the potential of deep learning in automating and improving the accuracy of spatial data extraction from aerial imagery, contributing to the fields of autonomous navigation and smart city infrastructure development.

Utilizing Unsupervised Machine Learning to Identify Patterns and Anomalies in the JSE Top 40 Equities

This study investigates using unsupervised machine learning to uncover hidden relationships and anomalies among the Johannesburg Stock Exchange (JSE) Top 40 equities. By transforming raw time-series data into informative metrics that include returns, volatility, average trading volume, and fundamental indicators such as earnings per share (EPS) and the price-to-earnings (P/E) ratio, the research aims to uncover patterns that can be used to inform investment management strategies. The data is analyzed as a snapshot, with the intention that this process can be continuously applied over different time frames to gain insights into opportunities and manage portfolio risk. This approach is not designed as a long-term buy-and-hold strategy but rather to spot changes in snapshot data.

Different clustering algorithms, namely K-Means, DBSCAN, and hierarchical clustering, were employed in combination with dimension reduction techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). The models were evaluated using internal metrics, namely the Silhouette Score and Davies-Bouldin Index. Additionally, the JSE sector classifications served as an external ground truth for validation and to identify anomalies that could be leveraged for investment opportunities.

The results indicate that t-SNE combined with hierarchical clustering produced the most well-organised clusters, achieving a Silhouette Score of 0.5023 and a Davies-Bouldin Index of 0.5296. The analysis uncovered both expected sector groupings and notable anomalies, such as companies clustering outside their designated sectors due to similar financial characteristics. Shapley Additive Explanations (SHAP) analysis was used to provide insights into feature importance within clusters, enhancing the interpretability of the results.

In conclusion, the study demonstrates that unsupervised machine learning techniques are effective in detecting meaningful patterns and anomalies in stock market data. These insights offer practical implications for investment management by providing a data-driven approach to portfolio diversification and risk assessment. This research contributes to the financial literature by showcasing the utility of advanced clustering methods in the context of the South African equity market, which could guide future studies in emerging markets.

Advancing Distal Radius Fracture Classification Using Metric Learning: A Triplet Neural Network Approach

Recent advancements in computer vision and deep learning have enhanced distal radius fracture analysis, offering the potential to alleviate challenges in medical diagnostics in developing countries. This research investigates the application of metric learning architectures, particularly triplet neural networks, for the classification of distal radius fractures according to the AO/OTA fracture classification system. The study aims to address challenges associated with data scarcity and model generalisation while improving automated fracture detection and classification accuracy.

The research followed the cross-industry standard process for data mining (CRISP-DM), progressing through business understanding, data preparation, modelling, and evaluation phases. The GRAZPEDWRI-DX dataset was utilised as the source domain for transfer learning to perform fracture object detection on a small target distal radius dataset (DIRAD), alongside traditional data augmentation techniques to mitigate data limitations. The object detection model, based on YOLOv8, achieved a mean average precision (mAP) of 93.8% at 50% Intersection over Union (IoU) on the GRAZPEDWRI-DX dataset and 73.1% on the DIRAD dataset.

The feature extractor of a VGG19 convolutional neural network, alongside a custom embedding neural network, was employed as the foundation of the developed triplet neural network, which classified distal radius fractures according to the AO/OTA classification system. The triplet neural network incorporated the triplet margin loss function and a semi-hard triplet sampling strategy and was trained separately on posteroanterior (PA) and lateral radiograph projections.

Despite the triplet neural network achieving high training F1-scores of up to 97% for the PA projection, the models exhibited limited generalisation, pressing the need for additional data or refined augmentation strategies. Independent evaluations of PA and lateral projection models revealed complementary strengths, which could be integrated into ensemble modelling strategies. The findings demonstrate the feasibility of triplet neural networks for distal radius fracture classification but emphasise the necessity of future work to address generalisation challenges, including generative adversarial networks for data synthesis and ensemble models for improved diagnostic accuracy.

Office Carbon Dioxide Level Prediction Using Model Confidence and Signal Resolution

Indoor air quality (IAQ) is considered to have major health and wellness implications, with indoor air pollution (IAP) estimated to have tenfold the negative impact on people relative to outdoor pollution. It is estimated that people spend an average of 80%-90% of their time indoors. It is therefore in the interest of overall population health to develop robust IAQ monitoring and control systems. This study focuses on the prediction of indoor CO2 concentrations using machine learning algorithms.

Indoor CO2 concentrations can change rapidly, and thus a real-time monitoring system is likely inadequate for maintaining healthy IAQ conditions. Wavelets can be used to filter signals, thereby removing noise and retaining essential information from the original signal, though the wrong filter at suboptimal levels could lead to signal distortion and information loss. To minimize this risk, dynamic signal resolution can be used in training. In this project, a method that implements various wavelets at varying decomposition levels is employed, using the outputs to train an ensemble of LSTMs and select the most confident models for prediction. For comparative analysis, a predictive model based on fixed signal resolution was also developed.

The implementation of the dynamic signal resolution model framework required nearly 3 times the amount of execution time as the fixed resolution, with no additional performance improvements. Instead, dynamic signal resolution resulted in limited prediction capability in areas of high CO2 concentrations, confirming potential risks of information loss possible with signal filtering. The fixed resolution model demonstrated superior performance, reporting a MAE, MAPE, RMSE and R2 of 1.02 ppm, 0.3%, 2.365 ppm and 0.99 respectively, while the dynamic signal resolution model's metrics deteriorated to 14.96 ppm, 2.7%, 27.48 and 0.91.

Data-Driven Predictive Maintenance for Enhanced Reliability of Continuous Miners in Underground Coal Mining

The global coal mining industry faces increasing challenges due to deteriorating operational conditions and ageing machinery. As mining companies strive to optimise their processes, there is a significant opportunity to enhance maintenance strategies to ensure machines operate at the lowest possible cost. This research assignment explores the application of data science in analysing electrical data from continuous miners to identify anomalies and alert maintenance personnel of potential failures before they occur.

By employing both conventional machine learning and deep learning techniques, the aim is to determine the most effective approach for predictive maintenance. This study represents a pioneering effort in South Africa, focusing on the application of Markov chains for anomaly detection in the coal mining sector. By leveraging the Markov property and integrating it with the Mahalanobis distance, the research developed a robust framework that enhances anomaly identification. This dual approach not only enriches data science analytical capabilities but also introduces innovative perspectives in industrial maintenance, bridging traditional clustering techniques with advanced statistical methods and opening new avenues for enhanced anomaly detection.

Advancing the Argument for Shallow Models: A Comparative Analysis against Deep Learning Approaches

The rapid adoption of deep learning in various domains has led to an overreliance on complex architectures, often at the expense of simpler models that may be equally effective. This trend raises concerns about unnecessary computational costs, reduced interpretability, and increased carbon footprints, particularly in cases where shallow models could provide comparable results. This research assignment aims to evaluate the necessity of deep learning models by conducting a comparative analysis against shallow models, seeking to determine under what circumstances simpler models are preferable.

The research employs a mixed-methods approach, combining scientometric analysis, an extensive literature review, and selected case studies. The study critically assesses the performance of shallow versus deep models across various applications, focussing on criteria such as accuracy, computational efficiency, and scalability. The findings reveal that shallow models, when properly optimised, can achieve performance levels comparable to those of deep learning models in several contexts, while offering benefits in terms of lower computational demands and greater interpretability. The study concludes that deep learning is not always the best choice, advocating for a more thoughtful selection of models based on the specific needs of the application.

Data Science Approaches for Addressing Missing Values in the Transcriptome of Plasmodium falciparum

The development of new antimalarial drugs and vaccines relies heavily on understanding the genetics of Plasmodium falciparum. Transcriptomic data, a valuable resource for such insights, is often documented in 'omics datasets, but these datasets are often plagued by missing values, which significantly hinder downstream biological analysis. Accurate imputation of these missing values is imperative for the analysis of 'omics datasets and the discovery of novel antimalarial drugs.

This research assignment investigates missing value imputation techniques to identify a suitable method for accurate imputation of missing values in a transcriptomic dataset. Various approaches, including single imputation, multiple imputation, machine learning imputation, and deep learning imputation, are explored. Single imputation methods often fail to capture the complex relationships inherent in gene expression data, so advanced methods such as MICE, expectation maximisation, k-means, fuzzy c-means, KNN, self-organising maps (SOM), DBSCAN, feedforward neural networks, autoencoders, and generative adversarial imputation networks are investigated for their suitability. The selected imputation method is evaluated on datasets with varying percentages and mechanisms of missing data, assessed using RMSE and MAE, and its impact on downstream clustering is examined.

The SOM is selected as the imputation method. The imputation results consistently yield RMSE and MAE values lower than the standard deviation of the data, indicating the errors fall within the acceptable range given the natural variability of gene expression data. Subsequent k-means clustering performed on the imputed data showed that imputation did not affect the quality of the clusters, underscoring that SOM imputation adequately preserves the biological structure of the data.

An Automated Computer Vision System to Measure Excavator Productivity

This study develops an excavator productivity model to measure and optimise construction productivity using computer vision techniques, addressing inefficiencies in construction operations and improving excavator performance. The model analyses video input with computer vision, focusing on near-optimal real-time tracking techniques. Object detection algorithms, including YOLO and Faster R-CNN, were initially explored for accurate tracking of excavator movements on construction sites. Results indicated that YOLO offered superior generalisation and performance, yielding more accurate bounding box coordinates for tracking excavators.

The dataset developed included resized and labelled excavator videos, uniformly processed for consistent colour and format. Each video was divided into three-second intervals, annotated by activity. To measure productivity, a two-phase activity recognition model was developed. Initially, a VGG16 feature extractor combined with a simple LSTM model classified the excavator as static or moving, which achieved 100% accuracy in movement detection.

The second phase involved designing an advanced activity recognition model to classify specific excavator tasks, including soil pick-up, hauling, and drop-off, focusing on task duration analysis and process optimisation. With 300 to 400 labelled videos containing three-second activity segments, the accuracy of the model ranged between 80% to 100%, depending on the similarity of the test data to the training environment. Despite challenges such as lighting variations and insufficient data quality, the model demonstrates potential in tracking excavator activities, with future efforts aiming to expand the model to other machinery and enhance real-time performance.

2024

December 2024 Graduation

Active Learning in Bagging Ensembles

This study investigates the integration of dynamic pattern selection (DPS) and ensemble learning (EL) to enhance the performance of feed-forward neural networks trained with gradient descent backpropagation, particularly addressing the bias-variance dilemma while reducing computational complexity. DPS, introduced by Röbel (1994), is an active learning technique that incrementally adds patterns with the highest errors to the training data, aiming to achieve similar generalization results as standard backpropagation with less computational expense. Bagging based EL combines the predictions of multiple models trained on resampled subsets of the original data to improve generalization performance, albeit with increased computational demands.

In this research, DPS and EL were applied independently and in combination to neural networks, evaluated on four classification problems and two regression problems. The experiments tested four scenarios: standard NNs, NN that only applied EL, only DPS, and a combination of both (referred to as EL AL NN). The results demonstrated that DPS achieved similar performance to standard backpropagation while reducing computational cost. Specifically, for the iris and hepatitis classification problems, DPS showed better generalization, possibly due to reduced overfitting.

EL improved generalization across all classification and regression problems, confirming its effectiveness despite higher computational complexity. When combining DPS with EL, the study found that for two of the four classification problems and both regression problems, the EL AL NN matched the generalization of EL while reducing computational complexity. However, for the iris and wine classification problems, the EL AL NN did not generalize as well, with a reduction in the generalization factor below one, suggesting overfitting as a possible cause.

Adaptive Machine Learning for the Optimization of a Water Treatment Clarification System

Digitalization is a major topic of discussion within industries presently, with the aim to provide descriptive, diagnostic, predictive and prescriptive feedback to all levels of business involvement. The water treatment industry is no exception; to ensure live responses to ever-changing feed water conditions, often under-utilized data sources can be incorporated into an intelligent system for optimal control.

The subject of this study was a clarification system employed as pre-treatment for the production of purified water from organically rich wastewater. The controls of the concerned system before the study were mostly static and linear in nature, with a large degree of human interaction required to reach a sub-optimal target of minimum overflow turbidity based on feed turbidity. It was therefore desired to develop a system that models the process and optimizes the overflow quality by adjusting the feed coagulant and flocculant dosages on a continuous basis as a prescriptive feedback system.

Features were selected based on expert knowledge, and R was used for data handling and analysis. Raw data was ingested from an MSSQL database and Excel files, assessed for quality, and combined into input features including dosages, tank level, turbidity, pH, temperature and flow rates across 4 parallel clarifiers. Of the 3 tree-based models tested, the random forest model was found to be optimal with a testing RMSE of 0.0761 units, compared to a median target of 0.22 ±0.09 units. An XGBoost model was then used to optimize the fitness function through grid search based on cost-benefit principles, yielding median relative improvements of 36.4 ±19.7% for overflow quality, 28.6 ±14.1% for coagulant dosage and 7.71 ±15.4% for flocculant dosage in simulation, verified through live testing as 49.1 ±32.3%, 28.6 ±6.34% and 8.52 ±10.4% respectively.

Online retraining and exploration were also tested in simulation. Retraining was triggered once predictive accuracy exceeded 0.1 units over a 1-day moving average, occurring on average once every 3.12 ±8.99 days. Exploration, performed by randomly selecting and mutating candidate solutions, reduced the existing linear correlation between dosages from 0.83 to 0.14 units, with a 50% increase in improvement variability, though only 29–62% of the exploited improvements were realised in practice.

Automated Screening of Chronic Sinusitis from Voice Recordings using Machine Learning

Chronic sinusitis is a common illness that affects millions of individuals worldwide. Currently, the screening of chronic sinusitis involves evaluating patient symptoms, conducting endoscopic examination, or using medical imaging methods such as computed tomography or magnetic resonance imaging. Symptoms-based diagnosis is often inaccurate, endoscopic examinations are invasive, and imaging is expensive and exposes patients to ionising radiation. Alternatively, chronic sinusitis can potentially be automatically screened for using voice recordings, but this has not been investigated before.

This research assignment proposes the use of machine learning to automatically distinguish between the speech of patients with chronic sinusitis and that of healthy individuals. The dataset used entails voice recordings of patients who underwent tonsillectomy, septoplasty, functional endoscopic sinus surgery, and minor unrelated surgeries. The data was down-sampled, denoised using a pre-emphasis filter, and unvoiced segments removed. Audio features including Mel-frequency cepstral coefficients, spectral contrast, Mel-spectrogram, spectral centroid, flatness, bandwidth and roll-off were extracted, reduced in dimensionality with PCA, and used to train logistic regression, k-nearest neighbours, decision tree, XGBoost, random forest, support vector machine and deep neural network models.

The DNN model outperformed all others and was selected for further evaluation, achieving an accuracy of 0.67 ± 0.0089 on the train set and 0.63 ± 0.0089 on the test set — comparable with findings from other voice-based diagnosis studies. This demonstrates that chronic sinusitis can be detected from voice recordings using machine learning, though the moderate test accuracy suggests room for improvement through dataset size, preprocessing, feature selection, or model choice.

Predicting Patient Outcomes Based on Adverse Drug Events Using Graph Neural Networks

This research explores the application of graph neural networks (GNNs) in pharmacovigilance, particularly in predicting adverse drug events (ADEs) using data from the FDA adverse event reporting system (FAERS). The study begins with an in-depth analysis of the graph data model, representing complex relationships between patients, drugs, reactions, and outcomes. The GNN architecture, specifically a graph multi-layer perceptron (graph MLP), is configured and trained on this graph-structured data to enhance ADE prediction accuracy, with F1 score, precision, recall, and accuracy used to assess performance against baseline methods.

The results demonstrate that the GNN model, when properly configured, outperforms conventional approaches in several key metrics, offering deeper insights into drug safety and patient outcomes. The research highlights the potential of GNNs in improving clinical decision-making, strengthening regulatory frameworks, and advancing personalized medicine, while acknowledging limitations such as data quality and model interpretability. This study contributes to the growing body of knowledge on graph-based models in healthcare, showcasing their applicability in real-world pharmacovigilance practices.

March 2024 Graduation

Convolutional Neural Network Filter Selection Using Genetic Algorithms

Genetic algorithms and computer vision are two areas of machine learning that have shown great promise in solving complex problems. Convolutional neural networks (CNNs) are especially adept at computer vision tasks, but can consist of millions of parameters, mostly stored in filters, making model size a barrier to wider adoption. Filter selection and pruning assesses filters in a CNN and removes the least important ones to reduce model size. This project proposes using a genetic algorithm to optimise this process, allowing multiple filter selection methods to be applied concurrently and pruned adaptively.

When applying the proposed algorithm, 90.91% model compression was achieved at the cost of a 0.13 percentage-point accuracy drop for a network trained on audio data. On Fashion-MNIST, 91.37% compression was achieved with a 0.39 percentage-point accuracy drop. On CIFAR-10, 86.06% compression was achieved while accuracy actually increased by 2.37 percentage points.

These results show the utility of the algorithm and its ability to compress networks adaptively across different architectures and datasets. This study reveals that genetic algorithms can be applied successfully to prune filters from convolutional neural networks and provides the underpinnings for a comprehensive genetic algorithm capable of pruning filters from any given CNN architecture.

The Value of Zero-Rating Internet Services to Provide Essential Services to Low-Income Communities

This research assignment explored user interests and usage patterns on a zero-rated internet platform, MoyaApp, in South Africa to determine the value of zero-rating essential services in low-income communities. The study focused on how users interact with categories such as grants, education, jobs, and information services like weather and electricity, using temporal association rule mining and other statistical methods to analyze usage patterns.

The findings revealed that many low-income users initially registered on MoyaApp to access grant services, then gradually explored other essential services over time and became regular platform users. The researcher proposed expanding the jobs category for varying education levels, targeting grant users with information services to encourage engagement, and using the study's findings to develop a recommendation engine for suggesting relevant services.

In conclusion, this research assignment demonstrated that providing zero-rated internet services, or reverse billing data, to low-income communities can be an effective strategy to enhance access to essential services and bridge the digital divide.

Intelli-Bone: Automated Fracture Detection and Classification in Radiographs Using Transfer Learning

Suspected fractures are one of the most common reasons for patients to visit the emergency department, where radiographs are often assessed by staff without specialised orthopaedic expertise, leading to a high rate of diagnostic errors. Given this problem, there is an opportunity to use AI to assist with fracture diagnosis. This research assignment uses the AO/OTA fracture classification system and evaluates Faster R-CNN, YOLOv8n, YOLOv8l, and RetinaNet object detection models for accurate fracture location and classification.

A secondary problem addressed is data scarcity: the target dataset (DIRAD) consists of only 776 images. Transfer learning was used to overcome this, pretraining models on larger datasets such as COCO and the GRAZPEDWRI-DX paediatric wrist X-ray dataset before training on the target dataset.

The research shows that pretraining on larger datasets leads to superior performance on scarce datasets, and pretraining on a dataset from a similar domain, such as GRAZPEDWRI-DX, leads to even better results — improving mean average precision at IoU 50 by an average of 33.6% compared to randomly initialised weights. The best performing model, YOLOv8l, achieved a mAP50 of 59.7% on the DIRAD dataset.

Evolutionary Multi-Objective Optimisation Algorithms for a Multi-Objective Truck and Drone Scheduling Problem

This research addresses the complexities of last-mile delivery, burdened by high costs, environmental concerns, and increasing consumer demand for quick service. By integrating drones with traditional truck delivery, this study explores optimizing delivery routes while reducing delivery times and operational costs, introducing a multi-objective travelling salesman problem with drone interception (TSPDi) that minimizes both delivery time and distance.

The non-dominated sorting genetic algorithm II (NSGA-II) and strength Pareto evolutionary algorithm 2 (SPEA2) were adapted for the TSPDi problem, with a custom population initialisation function, a heuristic mutation method, and a mechanism for ensuring unique solutions across parent and archive populations. Empirical results showed NSGA-II outperforming SPEA2 on larger datasets with many delivery nodes, while SPEA2 held a slight advantage on smaller datasets.

Further comparison against prior single-objective algorithms showed the new multi-objective approach performed similarly on smaller datasets (10–20 nodes) on the delivery time metric, but outperformed prior algorithms on larger datasets (50–500 nodes) for both delivery time and truck distance, as expected since single-objective algorithms were not designed to optimise both simultaneously.

Evolving Encapsulated Neural Network Blocks Using a Genetic Algorithm

The advent of deep learning has led to the ever-increasing size and complexity of neural networks. This project investigates the viability of using a genetic-based evolutionary algorithm to automate the discovery of reusable, modular subnetworks ("blocks") within CNNs for image classification, inspired by architectural elements in ResNet and GoogLeNet.

A framework to represent CNN architectures was developed, drawing on neuroevolution of augmenting topologies (NEAT), and a genetic algorithm using mutation, speciation and crossover was adapted to evolve a population of 100 CNN blocks over 30 generations, guided by a fitness function balancing complexity and performance. Five repetitions were compared against randomly generated blocks and manually designed blocks such as ResNet and Inception.

The genetic algorithm proved effective at producing highly optimal solutions compared to random procedures, and a small sample of the best evolved blocks proved highly competitive against manually designed counterparts. This study validates the concept of using evolutionary algorithms for neural network block generation, offering new avenues for neuroevolution and overcoming limitations in manual design processes.

Machine Learning for Aquaponic System Mortality Prediction and Planting Area Optimisation

Aquaponics is a sustainable farming method that combines aquaculture with hydroponics. Machine learning and the internet of things (IoT) can be used to improve the profitability and efficiency of aquaponic plants. This project proposes a machine learning-based IoT system for aquaponics that can predict fish mortality and optimize crop growing areas, collecting data on water quality, fish behaviour, and plant growth to train the underlying models.

The proposed system has the potential to improve the profitability and efficiency of aquaponic plants, which could lead to wider adoption of aquaponics as a sustainable farming method.

Spatio-Temporal Modelling of Road Traffic Fatalities in Western Cape

Road Traffic Accidents are a problem in South Africa. Responding to the World Health Organisation's Decade of Action for Road Safety, the Western Cape sought new techniques and initiated the application of Data Science and Machine Learning tools to act as a decision support system. This project seeks to develop a machine learning model capable of predicting in time and space the probability of a road fatal event, aggregating relevant features into an H3 grid whereby patterns in fatal events are learned.

Traditional machine learning techniques and deep learning techniques are used to learn the relationship between the aggregated features and road fatal events, with the aim of outperforming the historical average models currently used in industry. This is the first attempt at using machine learning techniques to model road traffic fatalities in South Africa and the Western Cape.

Using Tree-Based Machine Learning Models to Improve Upon the Least-Squares Method of Quantifying Mineralogy Using Bulk Chemical Compositional Data

Geometallurgy is an interdisciplinary science that utilises geological and metallurgical data to optimise ore-to-metal processing routes. Information about an ore body's chemistry and quantitative mineralogy is usually obtained through costly and time-consuming drill core logging. Element-to-mineral conversion (EMC) uses bulk rock compositional data to calculate mineral grade quantities, traditionally solved using a least-squares approach (LS-EMC), which is limited when there are more minerals than measurable elements.

This study investigated alternative data-science-based methods to LS-EMC. Three tree-based machine learning algorithms — Decision Tree, Random Forest, and Extra Trees — were trained to predict mineral grade quantities using positional and geochemical data from 135 observations sourced from the Kalahari Manganese Deposit. Their output was compared to LS-EMC estimates against quantitative X-ray diffraction (QXRD) measurements using the R2 statistic.

The Extra Trees regressor correlated most strongly with QXRD measurements, achieving R2 scores above 0.5 for six of the eight mineral groups, and outperformed the other tree-based models on ungrouped minerals. The results support the conclusion that tree-based machine learning algorithms can improve upon the shortcomings of LS-EMC.

Optimisation Algorithms for a Dynamic Truck and Drone Scheduling Problem

With the increasing popularity of online shopping, the importance of last-mile delivery is growing, coming at a high cost to the retail industry and the environment. With advances in drone technology, combined truck-and-drone delivery strategies have become viable, and improving their routing and scheduling reduces the high cost of last-mile delivery.

This research assignment randomly changes customer node coordinates to simulate a dynamic environment, referring to the problem as the dynamic travelling salesperson problem with drone interception (DTSPDi). It solves the problem using the ant colony system (ACS), MAX-MIN ant system (MMAS), and a modified ACS that transfers pheromone knowledge between time slices, building on prior work on the static version of the problem. Using 30 datasets of different sizes and spatial patterns, the modified ACS-KT outperformed both other algorithms in both time and distance, suggesting it is better suited to handling dynamic environmental changes for this problem.

Review of Big Data Clustering Methods

In an era defined by the challenges of processing vast and complex datasets, this study delves into the evolving landscape of big data clustering. It introduces a novel taxonomy categorizing clustering models into four distinct groups, offering a roadmap for understanding their scalability and efficiency in the face of increasing data volume and complexity.

The methodology is rooted in a series of experiments on chosen clustering methods, metrics, and datasets, followed by an extensive analysis breaking down the selected approaches into their algorithmic components to identify the origins of gains, losses, or trade-offs in performance.

Insights from this research highlighted the scalability and efficiency of models like parallel k-means and mini-batch k-means, both theoretically and empirically, marking them as exemplary for large-scale applications. Conversely, models like selective sampling-based scalable sparse subspace clustering (S5C) and purity-weighted consensus clustering (PWCC) showed limitations in scaling to big data. The project concludes with a comprehensive performance summary and lays the foundation for a centralized database for clustering research.

Clustering Free Text Procurement Data

Mining companies are increasingly leveraging advanced data analytics to inform process debottlenecking and improve throughput and operating costs. One company grapples with 50% of its group-wide procurement spend stored as unstructured text data, hindering in-depth cost analysis due to variations in describing the same items, which is impractical to aggregate manually given the volume of monthly records.

This research assignment explored techniques such as Tfidf feature selection, LSA, and word embedding feature transformation, leveraging the company's procurement database. Comparing k-means and agglomerative hierarchical clustering (AHC), AHC performed better, yielding a high silhouette coefficient and passing validation by a domain expert. Clustering results were analysed in Power BI, leading to the conclusion that while traditional text clustering techniques are effective, modern feature selection and dimension reduction approaches are essential for optimal results.

Few-Shot Learning for Passive Acoustic Monitoring of Endangered Species

The Hainan gibbon is a primate from the Chinese island-province of Hainan whose population has declined due to poaching, facing extinction. Passive acoustic monitoring captures months of data, but experts spend significant time analysing and identifying bioacoustic signatures. Machine learning can automate this identification, but many algorithms require large amounts of data, which is scarce for endangered species — a limitation few-shot learning aims to address.

This assignment explores image-based classification of audio converted to spectrograms under low data volumes, using a Siamese framework with contrastive-loss and triplet-loss architectures, data augmentation, transfer learning, and reduced image resolution.

The triplet-loss architecture produced the most accurate models, preferring lower-resolution images that reduce computation time and cost, and was not affected by low data volumes — unlike contrastive-loss models, which degraded significantly. The recommended "base CNN" triplet-loss network achieved an accuracy of 99.08% and an F1-score of 0.995, demonstrating a strong ability to identify the Hainan gibbon's bioacoustic signature.

Digitization of Test Pit Log Documents for Development of a Smart Digital Ground Investigation Companion

Geotechnical companies in South Africa have documented ground investigation observations from test pits in PDF format. Given recent technological advancements, there is a growing need to digitize these documents for thorough analysis. Manual digitization is laborious, time-consuming, and prone to errors, so this project explored an automated approach using object detection for document layout analysis and optical character recognition for extracting alphanumeric characters.

An object detection model was developed by fine-tuning a Faster R-CNN pre-trained model in the Detectron2 framework, using a blend of manually annotated and synthetically generated images. The best model (R-101) achieved mAR, mAP, and inference time of 74.3%, 71.0%, and 0.371 seconds/image respectively. Various OCR algorithms were then evaluated, with PaddleOCR outperforming others at a 96% word recognition rate, improved a further 1.2% through spelling correction.

An interactive application was developed for exploring the resulting dataset, including a word cloud visualisation and a semantic search feature fine-tuned using sentence transformers, achieving precision, recall and F1 scores of 68.3%, 65.7% and 67.0% respectively. Suggested further work includes deeper data analysis, improved spelling correction, collecting more documents, and training a question-and-answering model.

Comparison of Machine Learning Models on Financial Time Series Data

The efficient market hypothesis states that financial markets are efficient and investors cannot consistently make excess profits, though academia and investors have shown this does not always hold. This research assignment develops multiple machine learning models, combined with a financial trading strategy using technical indicators, to compare the performance of different algorithms on financial time series data.

Ten years of minute-level ticker data was collected for the USD/ZAR and ZAR/JPY exchange rates, the S&P 500 index, the FTSE 100 index, and the Brent crude oil index. After assessing data quality, a trading strategy combining the 20-period moving average, relative strength index, and average directional index was used for labelling. Twelve models were developed, including logistic regression, SVM, KNN, decision tree, random forest, several recurrent neural network variants (Elman, Jordan, Jordan-Elman), LSTM, time-delay neural network, and two feed-forward neural networks.

The support vector machine performed best overall, with the baseline logistic regression model outperforming all other machine learning models. Random forest and the resilient backpropagation feed-forward network performed third and fourth, with higher recall but lower accuracy than the top models. The recurrent neural network models performed poorly, with Elman and Jordan-Elman the weakest. Non-neural-network models were found to be less computationally complex and less dependent on balanced datasets.

Trends in Infrastructure Delivery from Media Reports

Investment in public infrastructure such as roads and electricity generally leads to economic growth, which in turn helps fight poverty and inequality — making it important to monitor infrastructure condition. The South African Institution of Civil Engineering (SAICE) publishes Infrastructure Report Cards (IRCs), but limited data availability for some sectors hampers their compilation. Online news articles are a promising alternative data source, and natural language processing can automate extracting information from a large volume of them.

In this research assignment, online news articles were collected from nine South African news websites, and topic modelling was applied to group articles about specific infrastructure issues (e.g. potholes, sewage spills) into topics. A large language model then generated a summary for each topic, and a dashboard was designed to visualise the topics and summaries for use by SAICE in identifying and monitoring infrastructure issues.

This research assignment concludes that it is feasible to apply topic modelling to South African news datasets to extract infrastructure-related topics and help address the lack of data in compiling SAICE IRCs, and that large language models can generate usable topic summaries, though these can be further improved.

Investigating Sales Forecasting in the Formal Liquor Market Using Deep Learning Techniques

This research assignment focuses on forecasting sales in the liquor industry, examining the effectiveness of deep learning techniques and a stacked ensemble approach. Time-series forecasting is a widely used technique in fields such as economics, finance, and operations research. A thorough analysis of datasets was conducted to understand their inherent structures, with various algorithms and evaluation metrics used to assess forecasting effectiveness.

The research assignment found that deep learning techniques and ensemble theory can successfully be applied to forecast sales in the liquor industry, with a stacked ensemble approach effective in improving overall performance. The findings have the potential to significantly improve current forecasting implementations while reducing computational complexity and expense, concluding that deep learning and ensemble models offer a promising, more time-efficient avenue for accurate sales forecasting compared to traditional methods.

Automated Localisation and Classification of Trauma Implants in Leg X-Rays Through Deep Learning

Revision surgery often requires orthopedic surgeons to pre-operatively identify failed implants to reduce the complexity and cost of surgery. Surgeons typically examine X-rays for this purpose, a method that is time-consuming and occasionally unsuccessful. This study investigates the use of deep learning to automate the identification of trauma implants in leg X-rays, assessing various object detection and classification models on a dataset of trauma implants, given challenges such as limited data, imbalanced classes, and multiple implants per image.

The results indicate that the optimal solution is a two-model pipeline employing a YOLO object detection model and a DenseNet classification model, where DenseNet classifies implants localised by YOLO. The pipeline achieves a mean average precision (IoU 0.5) of 0.967 for implant localisation and an accuracy of 73.7% for implant classification, providing proof that deep learning models are capable of identifying trauma implants and offering a solution for future related research.

Association Between the Features Used by a Convolutional Neural Network for Skin Cancer Diagnosis and the ABC-Criteria and 7-Point Skin Lesion Malignancy Checklist

Melanoma cases and the associated mortality rate are rising rapidly, making early detection crucial. Traditional dermatologist diagnosis methods are time-consuming and vulnerable to human error, and while convolutional neural networks (CNNs) show promise in improving accuracy, their lack of transparency prevents clinical application. For a CNN to be approved clinically, it must be shown that the features it uses correspond to established clinical indicators — the ABCDE criteria and 7-point skin lesion malignancy checklist.

This research assignment develops a methodology to evaluate whether the features used by a CNN correspond to these clinical checklists. A CNN model was developed, trained, and tested, with the association between checklist features and melanoma established using statistical methods, and the association between checklist features and CNN-extracted features determined using t-SNE and statistical tests. The importance of colour was evaluated using a grayscale dataset, and LIME was used to explain misclassifications and correct classifications.

The selected InceptionResNetV2 model with leaky ReLU activation showed a strong association between nearly all checklist features and melanoma diagnosis, and correspondingly between the CNN's extracted features and those same checklist features — with the exception of vascular structures, brown, red and black. Reduced performance on the grayscale dataset confirmed colour as a feature the CNN uses to detect melanoma. The CNN was robust to general dataset issues but sensitive to hair and immersion fluid, suggesting a need for further preprocessing. The developed methodology successfully determined that the CNN uses ABC-criteria and 7-point checklist features to classify skin lesions.

2023

December 2023 Graduation

A Dynamic Optimisation Approach to Training Feed-Forward Neural Networks That Form Part of an Active Learning Paradigm

Active learning describes a paradigm of continually selecting the most informative patterns to train a model while training progresses. Literature indicates that the parameter search landscape of feed-forward neural networks (FFNNs) that form part of an active learning paradigm does not generalise to the parameter search landscape of FFNNs trained by a static training set, and is theorised to change while the search progresses.

This research assignment investigates the effect of changing the optimiser of a FFNN that forms part of an active learning paradigm from backpropagation to a dynamic optimisation algorithm. The cooperative quantum-behaved particle swarm optimisation (CQPSO) algorithm was implemented to train FFNNs that form part of two active learning paradigms — dynamic pattern selection (DPS) and sensitivity analysis selective learning (SASLA) — across six datasets, with a novel hyperparameter tuning procedure used to ensure efficient optimiser performance for each problem set.

It was found that CQPSO located and tracked the global minimum more effectively than backpropagation on four of six problem sets under DPS, while backpropagation was more effective on four of six problem sets under SASLA. CQPSO's performance was found to depend on the dimensionality of the search space as well as the interdependence of the input training patterns.

Course Recommendation Based on Content Affinity with Browsing Behaviour

A recommender system filters and provides relevant content to a user based on factors such as their historic behaviour during interactions with a system. One such online platform is Physioplus, whose subscribers have specific educational needs and can benefit greatly from targeted responses, but whose current search feature is limited to keywords, static recommendations, and elastic site search without considering historic user visits.

This study builds a better course recommender for Physioplus, taking a user's recent Physiopedia browsing history and providing a tailored, rank-ordered list of the most relevant courses. The recommender is built using a collaborative-based filtering technique with item-based and user-based approaches, complemented by natural language processing and neighbourhood similarity methods.

Using a training and testing dataset from the real-world Physioplus system, and evaluating by comparing recommended versus completed courses, the results show a recall score of 76% and an accuracy rate of 53% in the offline experiment, with the assumption that performance will improve further once the system integrates with the live Physioplus platform.

An Evolutionary Algorithm for the Vehicle Routing Problem with Drones with Interceptions

The use of trucks and drones to address last-mile delivery challenges is a promising research direction, with the variant where a drone can intercept the truck while moving or at the customer location forming the vehicle routing problem with drones with interception (VRPDi). This study proposes an evolutionary algorithm to solve the VRPDi, where multiple truck-and-drone pairs must be scheduled, leaving and returning to a depot together or separately, with the drone intercepting the truck after a delivery or meeting it at the next customer location.

The algorithm was executed on the travelling salesman problem with drones (TSPD) datasets by Bouman et al. (2015), benchmarked against the VRP results on the same dataset, showing improvements in total delivery time between 39% and 60%. Further analysis examined total delivery time, total distance, node delivery scheduling, and diversity during execution, benchmarking results against algorithms by Dillon et al. (2023) and Ernst (2024), the latter adding a maximum drone distance constraint.

The algorithm satisfactorily solved 50- and 100-node problems in reasonable time, with solutions better than those of Dillon et al. (2023) and Ernst (2024) for the same problems, though performance deteriorated considerably as the number of nodes increased, both in solution quality and computation time required.

Metaheuristics for Training Deep Neural Networks

Artificial neural networks (ANNs) are popular among researchers and in commercial settings, and the growing interest has led researchers to explore new ways to improve their performance, including the use of metaheuristics in training. This research assignment theoretically and empirically compares metaheuristics as an alternative to the traditional backpropagation with stochastic gradient descent (SGD) for training deep neural networks (DNNs), considering particle swarm optimisation (PSO), genetic algorithm (GA), and differential evolution (DE).

An in-depth analysis of SGD highlights potential disadvantages in the training process, motivating the exploration of metaheuristics as an alternative. Five experiments were conducted on an image dataset using a convolutional neural network (CNN) to empirically compare backpropagation SGD with the PSO, GA, and DE training algorithms.

The results conclude that SGD performs better than the metaheuristics considered in this study, with potential future work discussed based on these findings.

Diversity Preservation for Decomposition Particle Swarm Optimization as Feed-Forward Neural Network Training Algorithm Under the Presence of Concept Drift

Time series forecasting is an important area of research, and dynamic particle swarm optimisation (PSO) algorithms have been shown to replace traditional backpropagation as a learning algorithm for feed-forward neural networks (FFNNs), outperforming simple recurrent neural networks in some cases. Cooperative PSO variants, such as the decomposition cooperative particle swarm optimisation algorithm, address credit assignment and variable dependency for larger problems, but as particles converge, swarm diversity decays, making adaptation to dynamic environments difficult — directly linking diversity preservation to the ability to adapt in the presence of concept drift.

This research project proposes diversity preservation techniques — random decomposition for dynamic decomposition cooperative PSO (DCPSO) and a diversity-based penalty function for regularization — tested on five nonstationary forecasting problems under various classes of dynamism, using both dynamic and static sub-swarm implementations of DCPSO.

The diversity-based penalty function showed superior performance on training and generalization error for dynamic DCPSO, though without a statistically significant effect on preserving swarm diversity itself. Random decomposition ranked highly across experiments when static PSO algorithms were used as sub-swarms, significantly impacting swarm diversity. Overall, the proposed techniques showed a trade-off between diversity preservation and performance for the dynamic DCPSO algorithms.

March 2023 Graduation

Adaptive Thresholding for Microplot Segmentation

Food security remains a global concern, with wheat making up a substantial share of food consumption and being particularly sensitive to rising temperatures from global warming. The Department of Genetics at Stellenbosch University runs a wheat pre-breeding programme monitoring hundreds of microplots per site using digital high-throughput phenotyping on drone-collected orthomosaic images, with microplot segmentation currently performed via a manually adjusted grid that is time-consuming and does not generalise well across collection iterations.

This research assignment developed the adaptive thresholding procedure (ATP), an automated microplot segmentation method using unsupervised learning that requires minimal user input and no prior knowledge of the microplot layout. The ATP was evaluated on thirteen orthomosaic images from four experimental sites and compared against two manual segmentation procedures using accuracy, intersection over union, and required user input as criteria.

The ATP outperformed the other two methods under favourable conditions, though it struggled to differentiate between vegetation, weeds, and non-vegetation in the presence of weeds. Despite this limitation, the ATP contributes an automated microplot segmentation method requiring minimal user input.

Decision Support Guidelines for Selecting Modern Business Intelligence Platforms in Manufacturing to Support Business Decision Making

As data generation increases and global markets grow more competitive, the role of data analytics and business intelligence (BI) in manufacturing decision-making is significantly increasing, though the manufacturing industry often lags in digitisation and lacks the foundations needed to implement data tools. Selecting an appropriate BI tool is time-consuming and overwhelming given the wide variety of available software, each claiming distinctive, business-essential features.

This research assignment addresses the need for a useful approach to BI tool evaluation and selection, using a thematic analysis of semi-structured interviews with manufacturing professionals to gauge views on BI utilisation, data challenges, essential criteria, and selection approaches in practice.

The research revealed that BI plays a significant role in decision-making and task prioritisation in manufacturing, with respondents valuing different criteria and processes. The findings were used to propose guidelines that elucidate the dimensions to evaluate and provide a nine-step selection process to compare BI software.

An Investigation into the Automatic Behaviour Classification of the African Penguin

Climate change and biodiversity decline have renewed global focus on conservation, with digitization offering an opportunity to improve efforts such as animal behavioural studies, traditionally a manual endeavour requiring mounted sensors or continued human presence, often distorting the behaviour being studied. Modern computerised approaches, such as mounted video cameras, address these drawbacks in a less invasive manner.

This project investigates the applicability of deep learning to behaviour analysis in the endangered African penguin, aiming to develop a model for automatic behaviour classification as a foundation for improved passive monitoring and anomaly detection within a colony. Coordinates detailing animal movement are first extracted, then presented to a classifier, with three case studies considered: single penguins, two individuals, and three individuals.

The case evaluating three individuals on excitement versus normal behaviour achieved an AUC of 72.9%; the case evaluating two individuals on interaction versus no interaction achieved an AUC of 84.2%; and the case evaluating one individual across six behaviours (braying, flapping, preening, resting, standing, walking) achieved an AUC of 82.1%. These results provide a foundation for the design of a passive monitoring system to aid conservation efforts.

Set-Based Particle Swarm Optimization for Medoids-Based Clustering of Stationary and Non-Stationary Data

Data clustering groups similar instances together and is a highly studied field, with population-based algorithms such as particle swarm optimization (PSO) proven effective for it. Set-based particle swarm optimization (SBPSO) is a generic set-based PSO variant that substitutes vector-based mechanisms with set theory, and when applied to clustering, searches for an optimal set of medoids by optimising an internal cluster validation criterion.

This research assignment uses SBPSO to cluster fifteen datasets with diverse characteristics, tuning its hyperparameters for optimal performance and comparing it in depth against seven other tuned clustering algorithms, followed by a sensitivity analysis of the hyperparameters to determine their effect on swarm diversity and other measures.

SBPSO was found to be a viable clustering algorithm, ranking third among the algorithms evaluated, though less effective on datasets with more clusters. A significant trade-off between swarm diversity and clustering ability was discovered, and strategies to address these shortcomings were suggested.

An Extension of the CRISP-DM Framework to Incorporate Change Management to Improve the Adoption of Digital Projects

Digital transformation brings technologies like AI into core business operations, but is challenging to complete successfully — 45% of large digital projects run over budget, and only 44% ever achieve their predicted value, largely due to human factors such as difficulty accessing software and a lack of understanding of the technology.

This project compares five change management models to construct a generalised model, then identifies the change management gaps within the widely used CRISP-DM analytics framework, filling those gaps to construct an extended CRISP-DM framework. This extended framework is then validated against a real-world case study, showing it indicates improvement areas that would likely have improved the project's adoption.

All objectives of the research assignment were achieved, with the validation demonstrating the framework's potential to improve the success rate of digital projects at a lower risk of failure.

An Evaluation of State-of-the-Art Approaches to Short-Term Dynamic Forecasting

Order volume forecasting (OVF) is a strategic tool used by logistics companies to reduce operating costs and improve service delivery. While statistical models have historically been the standard, state-of-the-art (SOTA) approaches that leverage covariates to incorporate auxiliary information have shown promising results, which is critical for short-term forecasts that are inherently more stochastic than long-term ones.

This research paper compares a statistical forecasting approach to a SOTA approach for short-term order volume forecasting, developing an NBEATS model with various exogenous variables and comparing it to an Exponential Smoothing (ETS) model, both forecasting three hours ahead and evaluated using RMSE and MAE.

NBEATS provided a 36.01% improvement on RMSE and a 31.6% improvement on MAE over ETS. Comparing two variations of NBEATS — with and without covariates — showed that adding exogenous variables actually resulted in a 16.15% increase in RMSE and a 14.74% increase in MAE. The results suggest that SOTA approaches provide more consistent and accurate short-term forecasts overall.

Cross-Camera Vehicle Tracking in an Industrial Plant Using Computer Vision and Deep Learning

Buy-back centres divert recyclable material away from landfills but face threats such as fraud, where the amount or grade of waste paper sold is misrepresented for greater income, affecting stock availability, sales volumes, and the sustainability of the recycling ecosystem. To facilitate fraud detection, this research assignment develops a multi-vehicle multi-camera tracking (MVMCT) framework to track vehicle movement throughout a South African paper buy-back centre.

The MVMCT framework helps estimate the amount of material expected at a loading bay prior to stocktaking, flagging suspicious vehicles for investigation when a large discrepancy is found. The Faster R-CNN and DeepSORT detector-tracker pair exhibited superior performance in terms of IDF1 scores, and a siamese network addresses vehicle re-identification across video sequences to manage global ID assignment.

The MVMCT framework achieved an IDF1 score of 0.58, multi-object tracking accuracy of 0.62, and multi-object tracking precision of 0.53, successfully tracking vehicles across all sequences except one with a top-down view, and showing reasonable accuracy counting stationary vehicles at a loading bay.

A Bagging Approach to Training Neural Networks Using Metaheuristics

Stochastic gradient descent is the go-to algorithm for training neural networks, but as networks and datasets grow larger, so does the computational cost. Metaheuristics have been used successfully to train neural networks and are more robust to noisy objective functions.

This research assignment investigates whether metaheuristics — genetic algorithms, differential evolution, evolutionary programming, and particle swarm optimisation — can train a neural network using only a subsample of the training set, proposing different bagging training approaches and evaluating their performance against SGD and metaheuristics trained on the entire dataset, using validation accuracy and generalisation factor to detect overfitting.

The results indicate a subsample of the training set can be used per iteration or generation with similar accuracy and similar or better overfitting performance than training on the complete set, with the best performance achieved using a bagging strategy with an equal sample size per class.

Link Prediction of Clients and Merchants in a Rewards Program Using Graph Neural Networks

Rewards programs help businesses increase client engagement and retention, with a host company acting as an intermediary connecting entities within the program, represented as a graph of interconnected entities. This investigation applies a graph neural network (GNN) technique to identify potential future relationships — a link prediction task — between clients and merchants in a bank's rewards program, using a GraphSAGE encoder and MLP decoder, with GraphSAGE selected for its inductive ability to generalise on unseen data.

A sensitivity analysis indicated the model is sensitive to dropout and learning rate hyperparameters, reflecting the limited attributes and connections available. The fitted model, tested on unseen data, achieved a ROCAUC value of 0.65 — acceptable, though a higher value is desirable — with precision-recall results highlighting the effects of the sparse network and most correct predictions falling in the negative class.

Embedding visualisations revealed two distinct merchant groups and clusters of clients best represented in non-Euclidean space. Among correct positive predictions, female clients accounted for 99%, and the Homeware and Decor Store merchant type accounted for 100%. Overall, the GNN demonstrated the ability to learn representations and detect network topology in the rewards program, with opportunities identified to further enrich the graph.

Evaluating Active Learning Strategies to Reduce the Number of Labelled Medical Images Required to Train a CNN Classifier

CNNs have proven to provide human-comparable performance in computer vision, but rely heavily on large, labelled datasets, which are costly and time-consuming to produce. This study investigates how varied sizes of initially labelled medical images affect the effectiveness of CNN-based active learning, using a framework where data to be labelled is selected based on informativeness rather than randomly.

Two CNN architectures were run on a well-known chest X-ray pneumonia dataset from the Kaggle repository, using active learning based on uncertainty to measure informativeness, with eight simulations run on varying sizes of initial labelled training images and performance assessed using AUC-score metrics.

The simulations demonstrate how active learning can reduce the cost and time required for image labelling. Using DenseNet-121 with least confidence sampling reduced the number of labelled images required by 39% compared to the random sampling baseline.

A Dynamic Optimization Approach to Active Learning in Neural Networks

Artificial neural networks are popular predictive models, and active learning aims to improve their performance through active selection of training instances, potentially also reducing training times. The training process is typically modelled as a static optimisation problem under fixed learning, but under an active training strategy where the training set continuously changes, it should instead be modelled as a dynamic optimisation problem.

This study investigates whether the performance of active learners can be improved using dynamic metaheuristics as learning algorithms, implementing a sensitivity analysis selective learning algorithm and an accelerated learning by active sample selection algorithm, comparing backpropagation, static particle swarm optimisation, and dynamic PSO variants across seven benchmark classification datasets from the UCI repository.

Improved generalisation factor performance was produced for three of the seven classification problems when a dynamic metaheuristic was used in an active learning setting, though overall performance was similar across most configurations. The conclusion drawn was that it is not definitive that dynamic metaheuristics improve active learner performance, since improvements were not consistent across all problems and metrics.

Rule Extraction from Financial Time Series

The ability to predict future events is important across scientific fields, and data mining tools extract relationships among features to understand trends, producing rule sets usable for prediction. For many real-world applications, the shape of a time series can be as useful for prediction as its actual values, though rule induction and extraction techniques have historically had limited success with real-valued time series due to a lack of systematic effort to find relevant trends.

This study explores the benefits of rule extraction and rule induction specifically on financial time series data, reviewing existing approaches before developing and evaluating a rule extraction and induction framework.

The most important finding was the importance of balanced data, which performed significantly better when excessive class distributions were minimised, while differences in predictive performance between the various rule extraction and induction algorithms were not statistically significant.

Binning Continuous-Valued Features Using Meta-Heuristics

Discretization, a widely used data preprocessing step, partitions continuous-valued features into bins, improving a dataset's interpretability and enabling the use of machine learning models that require discrete input data. This report proposes a new discretization algorithm that partitions multivariate classification problems into bins using swarm intelligence, applying particle swarm optimization to find optimal bin boundary values for each continuous-valued feature.

The classification accuracy of the naïve Bayes, C4.5 decision tree, and one-rule classifiers, resulting from the proposed discretizer, is compared against equal width binning, equal frequency binning, and the evolutionary cut-points selection for discretization algorithm on datasets with mixed data types.

The proposed discretizer was outperformed by the evolutionary cut-points selection algorithm when paired with the C4.5 decision tree classifier, and similarly outperformed by the equal-width binning discretizer when paired with the same classifier.

A Genetic Algorithm Approach to Tree Bucking Using Mechanical Harvester Data

Bucking — crosscutting trees into timber logs — produces logs whose value depends on length and small-end diameter, and maximising the value of logs bucked from a tree can be viewed as an optimisation problem, historically solved with dynamic programming. This research assignment solves the problem using a genetic algorithm, asking whether existing bucking on a series of forest stands could have been done more optimally, using data from two mechanical harvesters.

The genetic algorithm outperformed the existing bucking in terms of value. Comparing against dynamic programming on a randomly selected set of trees, the genetic algorithm obtained very similar optimal bucking values. Its hyperparameters — population size, crossover probability, and mutation probability — were estimated using a particle swarm optimisation algorithm wrapped around the genetic algorithm, using another randomly selected set of trees.

The hyperparameters found were used to optimise the total value of each of five stands, and the total value of the optimised stands outperformed the existing bucking by a large margin.

Crop Recommendation System for Precision Farming: Malawi Use Case

Machine learning has scaled rapidly across industries including agriculture, with precision farming introducing ML-powered decision support systems that assist farmers with data-driven recommendations. These technologies have not yet been widely adopted in Malawi due to infrastructure and policy barriers, though recent policy changes and new data centres are drawing agricultural stakeholders toward ICT-based solutions.

This project created a crop recommendation system forecasting the best crop for farmland based on physical, chemical, and meteorological parameters, using unlabelled data from Malawian government departments, formatted three different ways and clustered using K-means into five centroids, labelled by an expert agronomist as conducive to maize, cassava, rice, beans, or sugarcane. Ten classifier algorithms were trained on the three formatted datasets and assessed via 5-fold cross-validation on F1 score and accuracy.

The third formatting technique (one-hot/label encoding, normalisation, and PCA) proved most conducive overall, with K-Nearest Neighbours outperforming other models at 99% F1 and accuracy, fast training speed, and a simple structure. The KNN model was integrated into a test web application as a proof of concept, though further development is required for real-time implementation.

Financial Time Series Modelling Using Gramian Angular Summation Fields

Gramian angular summation fields (GASF) and Markov transition fields (MTF) encode time series into images, enabling computer vision techniques to be applied to time series classification and imputation. This research assignment applies GASF and MTF to financial time series, first collecting and analysing a suitable real-world dataset, addressing data quality issues, then encoding the cleaned series into images and validating the mapping between series and image planes.

The financial time series' characteristics are used to guide the formulation of a modelling problem comparing GASF and MTF against conventional time series modelling and analysis techniques, considering four models spanning time series and image modelling approaches.

The results indicate that time series approaches are better suited to this specific modelling problem, though GASF and MTF do provide promising outcomes when used in combination — allowing a model to learn better features when combined with sequence-based approaches and improving model performance.

Machine Learning-Based Nitrogen Fertilizer Guidelines for Canola in Conservation Agriculture Systems

Soil degradation is a major problem facing South African agriculture, drawing significant policy attention. This research assignment uses machine learning to predict the amount of nitrogen to add to canola to achieve an approximate optimal yield, displayed as a fertiliser recommendation table for farmers, using algorithms including random forest regressor, extra trees regressor, artificial and deep neural networks, k-nearest neighbour, multiple linear regression, and multivariate adaptive regression splines.

Early detection of yield-limiting factors can aid productivity and profit, with yield prediction critical to crop management and economic decisions. The random forest regressor proved most accurate in forecasting yield, demonstrating that machine learning could potentially forecast canola production based on characteristics such as average rainfall, plantation year, residual soil nitrogen from the previous harvest, and monthly rainfall from planting to harvest.

The Use of Historical Tracking Data to Estimate or Predict Vehicle Travel Speeds

York Timbers, an integrated forestry company, maintains an expansive road network of 26,661 segments totalling approximately 10,000 km. To optimise timber delivery from plantations to mill sites, the travel speed of each road segment must be estimated, starting with matching GPS measurements to the self-owned road network based on Euclidean distance, then correcting connectivity errors introduced during map-matching, and calculating average travel speed per segment from matched GPS data.

Since the majority of road segments lack GPS measurements, five different predictive models were developed to estimate their travel speed. The best performance was achieved using a regression tree, reaching a mean absolute error of 10.02 km/h on unseen data.

To further improve accuracy, the study suggests increasing the amount of GPS data used, incorporating other influential data such as weather, and identifying dangerous portions of the road network before implementing a model in production.

A Review and Analysis of Imputation Approaches

Missing data is a common challenge that affects the accuracy of any decision-making process, addressed during data cleaning through deletion or imputation. This research assignment investigates the performance of various statistical imputation methods (mean, hot deck, regression, maximum likelihood, Markov chain Monte Carlo, multiple imputation by chained equations, expectation-maximization with bootstrapping) and machine learning methods (k-nearest neighbor, k-means, self-organizing maps), using an empirical procedure across two experiments — one on clean datasets, one on datasets with outliers — evaluated using RMSE, MAE, percent bias, and predictive accuracy.

Across both experiments, Markov chain Monte Carlo (MCMC) imputation achieved the best overall performance with 75.71% accuracy, followed by kNN imputation at 69.85% accuracy, though kNN introduced a large percent bias into the imputed dataset.

This research concludes that single statistical imputation methods (mean, hot deck, regression) should not be used to replace missing data in any situation, while multiple imputation methods show consistent performance, with MCMC in particular offering high accuracy, low bias, and ease of use.

Crawler Detection Decision Support: A Neural Network with Particle Swarm Optimisation Approach

Website crawlers, first introduced in the early nineties, are used by search engines to collect information about other websites, and can be categorised as good (harmless) or bad (malicious), with bad crawlers potentially crashing websites or inflating traffic indicators when misidentified. Distinguishing human users from good and bad crawler sessions is therefore important for website traffic classification.

This research assignment designs and implements artificial neural networks, trained with particle swarm optimisers, to classify website traffic as human, good crawler, or bad crawler. The problem is first treated as a stationary classification problem assuming constant behavioural characteristics, then as a non-stationary problem exhibiting concept drift, solved using quantum-inspired particle swarm optimisation to account for crawlers changing behaviour over time.

Results demonstrate that the particle-swarm-optimised artificial neural networks can classify website traffic successfully to a reasonable extent in both stationary and non-stationary environments.

A Comparative Study of Different Single-Objective Metaheuristics for Hyper-Parameter Optimisation of Machine Learning Algorithms

Machine learning has evolved into a technology with widespread commercial success, with deep learning's rise driving renewed interest in hyper-parameter optimisation, since derivative-based methods are generally unavailable for the black-box objective functions involved. While hyper-parameter optimisation is conventionally performed manually by domain experts, modern compute resources make algorithmic approaches such as grid search, random search, and Bayesian optimisation increasingly favourable.

This research assignment investigates metaheuristics — genetic algorithms, particle swarm optimisation, and estimation of distribution algorithms — as an alternative to traditional hyper-parameter optimisation techniques, comparing their efficiency on support vector machines, multi-layer perceptrons, and convolutional neural networks across a constructed test suite of datasets and algorithms.

Friedman omnibus tests were used to determine whether a difference in average rank existed across the techniques, with Nemenyi post hoc tests identifying pairwise differences upon rejection of the null hypothesis, alongside investigation of other solution quality metrics such as computational expenditure.

Predicting Employee Burnout Using Machine Learning Techniques

While artificial intelligence techniques are well understood in industries like life insurance or banking, applying them to human capital management has met with varying success, including pitfalls around managing inherent bias in the data and the ethical use of model outputs. Models assisting recruitment or predicting employee attrition have nonetheless been successfully implemented by many organisations.

This research assignment applies multiple classification models and machine learning algorithms to identify employees at risk of burnout, aiming to produce outputs that could ethically and proactively guide wellbeing-related interventions across a business.

The results show that none of the approaches were successful in accurately meeting this objective, with an artificial neural network approach assessed as the most accurate of all models implemented, though none of the implemented approaches exceeded 50% accuracy.

2022

2022 Graduation

Comparison of Machine Learning Models for the Classification of Fluorescent Microscopy Images

Long COVID, the lasting health consequences of a COVID-19 infection, can cause severe and debilitating symptoms such as fatigue and brain fog, caused by microclots that form in the bloodstream, entangle with proteins, and limit oxygen exchange. Diagnosing and identifying individuals suffering from Long COVID is the first step toward alleviating or curing their symptoms, but current identification processes are manual and limited by available manpower.

This research assignment investigates whether machine learning algorithms can classify fluorescent microscopy images as indicative of Long COVID or not, training models on features extracted from the images using computer vision techniques, and comparing the performance of different algorithms.

It was found that logistic regression is a good choice as a classifier, showing strong performance in classifying both the positive and negative classes.

Anomaly Detection in Support of Predictive Maintenance of Coal Mills Using Supervised Machine Learning Techniques

As technology has advanced through successive industrial revolutions, so has the need to maintain machinery, with Industry 4.0 shifting the focus from preventive maintenance on a fixed schedule to predictive maintenance performed only when necessary. This research assignment studies predictive maintenance of coal mills through supervised machine learning, using coal mill data from a case study company.

The assignment identifies and addresses data quality issues in the dataset, prepares the data for machine learning, and builds a model aimed at predicting when failure is most likely to occur, evaluating the feasibility of building such a model with the given data and methodology.

The assignment draws conclusions from these findings and identifies opportunities for future research in predictive maintenance for coal mills.

Comparison of Unsupervised Machine Learning Models for Identification of Financial Time Series Regimes and Regime Changes

Financial markets move through periods of rising value (bull markets) and falling value (bear markets), broadly referred to as regimes, which can extend beyond simple bull-and-bear classifications to any sequence of data exhibiting correlated trends. Detecting when these regimes shift can be of great value to investors, helping improve investment decisions and strengthen portfolios.

This research reviews and compares the viability of different regime shift detection algorithms applied to multivariate financial time series data, using stocks from the Johannesburg Stock Exchange (JSE) to compare the algorithms' performance in terms of regime shift detection accuracy and the profitability of the regimes identified within selected investment strategies.

Detection of Chronic Kidney Disease Using Machine Learning Algorithms

Chronic Kidney Disease (CKD) affects roughly one in ten people globally, and often occurs alongside other chronic illnesses such as diabetes, heart disease, and hypertension, which can hinder its successful and early detection. CKD is clinically detected using laboratory tests such as the Glomerular Filtration Rate and albumin-creatinine ratio, with early detection critical to slowing disease progression and reducing complications.

In developing countries, especially in Africa, CKD prevalence is estimated at 3–4 times that of developed regions, and access to treatment in South Africa is skewed toward those with private healthcare, while roughly 84% of South Africans rely on under-resourced public health systems. This study reviews, develops, and recommends various machine learning classification models for the efficient detection of CKD, using two UCI Machine Learning Repository datasets and a PLOS ONE dataset on CKD in patients at high cardiovascular risk in the United Arab Emirates.

The final aim is to construct a high-performing machine learning model that effectively and accurately learns the hidden correlations in the symptoms exhibited by CKD patients.

Feature Engineering Approaches for Financial Time Series Forecasting Using Machine Learning

This research assignment investigates feature engineering methods for financial time series forecasting, aiming to overcome the noise and non-stationarity that make financial series difficult to forecast. A literature review identified suitable feature engineering and machine learning approaches, which were then tested via a case study comparing forecasting results with and without the identified techniques.

Methods investigated include differencing and log-transforms to address non-stationarity, and moving averages, exponentially weighted moving averages, Fourier transforms, and wavelet transforms to reduce noise, applied as preprocessing steps before training linear regression, support vector regression, multilayer perceptron, and LSTM neural network models to forecast a single day ahead asset price from ten days of previous prices, across four univariate time series signals.

No feature engineering method was found to be universally helpful. Denoising or smoothing improved performance for the SVR, MLP, and LSTM models, though the best technique varied by dataset, while differencing and log-transforms caused models to forecast values near the mean return, giving poor regression metrics but good directional accuracy. The study concludes that gains from feature engineering on price data alone are limited, and recommends future work explore alternative data sources with more predictive power.

Forecasting Armed Conflict Using Long Short-Term Memory Recurrent Neural Networks

Recent studies point to an optimistic future for data-driven social conflict forecasting, arriving alongside the big-data revolution, with models of interest to governments, NGOs, humanitarian agencies, and insurers seeking to reduce the severity of events or intervene before they escalate. This mini-dissertation applies LSTM recurrent neural network modelling to forecast armed conflict events in Afghanistan, using world news data from the Global Database of Events Language and Tone (GDELT) and georeferenced event data from the Uppsala Conflict Data Program (UCDP).

The results show that GDELT data can improve conventional baseline forecasting models to an extent, by incorporating actor and event attributes unique to the conflict, and that news media data can be consolidated with actual recorded deaths in the forecasting model, grounding predictions in real-world outcomes.

Comparison of Machine Learning Models on Different Financial Time Series

Although the efficient market hypothesis suggests market predictions are not consistently profitable, evidence against it has motivated substantial research into the behaviour of financial markets, with recent AI advancements offering new opportunities for forecasting models. This dissertation investigates the ability of different machine learning models to forecast future percentage change for the S&P 500 index, the US 10-year bond yield, the USD/ZAR currency pair, gold futures, and Bitcoin, using closing price data only.

Models investigated include linear regression, ARIMA, support vector regression, multilayer perceptron, recurrent neural network, LSTM, and gated recurrent unit, evaluated using mean square error and a directional accuracy metric, under both single out-of-sample and walk-forward validation techniques.

Linear regression, as the most parsimonious model, performed best overall for individual analysis within each validation technique. Walk-forward validation performed best in terms of MSE for the S&P 500 and US 10-year bond yield, with SVR achieving 52.94% accuracy on the S&P 500 and MLP achieving 51.26% on the bond yield. Single out-of-sample validation performed best for the USD/ZAR pair, gold futures, and Bitcoin, with MLP achieving 51.77% and 53.51% accuracy on the currency pair and gold futures respectively, and linear regression achieving 55.04% on Bitcoin.

Proximal Methods for Seedling Detection and Height Assessment Using RGB Photogrammetry and Machine Learning

Growing demand for planted forests has increased demand for nurseries cultivating well-suited seedlings, but nursery operators face laborious manual assessments based on statistical sampling of only a small percentage of stock. This study proposes a framework for proximal detection and height assessment of seedlings, using smartphone-captured RGB imagery to produce digital surface models and orthomosaic images via photogrammetry.

A RetinaNet object detection model, pre-trained on drone-derived RGB imagery, was retrained via transfer learning on a single seedling tray of 98 seedlings using orthomosaics from the photogrammetry process. Two approaches for sampling seedling height from the digital surface model were evaluated, along with several regression algorithms to refine the sampled height, with the ensemble-based AdaBoost regression algorithm achieving the best performance.

The proposed pipeline detected 98.97% of seedlings at an intersection over union of 76.93%, with only one missed instance, and achieved a final height-refinement RMSE of 17.26mm against test data — sufficient performance to enable improved understanding of stock quantities and growth stage without manual intervention.

Automated Tree Position Detection and Height Estimation from RGB Aerial Imagery Using a Combination of a Local-Maxima Based Algorithm, Deep Learning and Traditional Machine Learning Approaches

Forest mensuration is central to determining the biomass and fiscal value of forest plantations, but terrestrial measurement of tree attributes is laborious, and remote sensing via drones has made forest mensuration more accessible and affordable than airborne laser scanning. This study, focused on a KwaZulu-Natal stand of 4,968 Eucalyptus dunnii trees, used a local maxima algorithm as a baseline for detecting tree crown apexes and estimating heights, then built an ensemble of machine learning models to improve on it.

The hybrid approach integrates object detection (a RetinaNet model from the DeepForest package, further trained via transfer learning on hand-annotated trees), a support vector machine to filter misclassified tree positions, and a multi-layer perceptron to correct bias in height estimates sampled from the canopy height model.

The improvements were notable: tree position mean absolute error improved by 15.68% (from 0.3515m to 0.2964m), tree height RMSE improved by 25.30% (from 0.6435m to 0.4807m), and R2 for height increased by 15.22% (from 0.6662 to 0.7676). While the proportion of trees detected dropped slightly by 3.33%, the number of dead or invalid tree positions detected decreased substantially, indicating a meaningful improvement in the quality of tree positions identified.

Fantasy Premier League Decision Support: A Meta-Learner Approach

In the Fantasy Premier League, managers construct dream-teams of English Premier League players, aiming to maximise points accumulated over a 38-gameweek season while navigating strict team-formulation and transfer constraints. The dream-team formulation problem can be split into an initial team-selection sub-problem and a subsequent player-transfer sub-problem, both expressible as systems of linear equations that, given expected player performance, can be solved via linear programming.

This project designs and implements machine learning algorithms to forecast expected player points for upcoming fixtures, feeding a decision support system that suggests an initial dream-team and subsequent player transfers. Five algorithms, each from a distinct family (linear regression, kernel-based, neural network, decision tree ensemble, and nearest-neighbour), were considered, along with a stacked meta-learner combining their predictions.

A case study on the 2020/21 Fantasy Premier League season validated the quality of the suggested transfers, with the resulting decision support system performing favourably — its suggested transfers would have placed in the top 5.98% of eight million real-world managers that season.

2021

2021 Graduation

Requirements for 3D Stock Assessment of Timber on Landings and Terminals

This project addresses the unreliability of stock assessment systems in the timber supply chain, which leads to inaccurate volume estimations for log piles. The system developed needed to satisfy the practical constraints of the supply chain — including a low-tech data capturing process suited to the vast rural areas the timber supply chain covers — while producing frequent and accurate results.

The method identified was terrestrial structure from motion (SFM), using a consumer-grade camera or smartphone, generating point clouds supplemented with Unity-generated data to increase the volume available. A K-means clustering classification algorithm, using neighbourhood statistics from feature engineering alongside original point cloud features, was developed to distinguish log pile from terrain within the point cloud. Once extracted, an alpha shape generated from the point cloud was used to predict the final log pile volume.

The results show that the developed methodology achieves predicted volumes of an acceptable level for the intended use case, providing evidence for the benefit of computer vision in performing accurate stock assessments in the timber supply chain, while acknowledging that further work is needed to improve accuracy and implement the system.

A Predictive Model for Precision Tree Measurements Using Applied Machine Learning

Accurately determining biological asset values is important for forestry enterprises, yet currently only 5-20% of forest areas are enumerated as a representative sample for an entire compartment, with timber volume and growth projections based on these limited, error-prone statistics. Diameter at breast height (DBH) is the most common measurement used, traditionally captured via laser scanning — an accurate but expensive and non-scalable approach.

This thesis employs monocular depth estimation techniques to extract tree data features from video recordings captured on an ordinary smartphone, working with the South African Forestry Company SOC Limited (SAFCOL) to access a suitable plantation. Fieldwork collected standardised "ground truth" DBH measurements, and video files were processed to extract tree segment patterns, which were then used to train and test various machine learning models.

The models achieved a relative root mean squared error between 14.1% and 18.3%, with a relative bias between 0.08% and 1.13%, indicating a consistent but imperfect prediction result. The resulting spatial representation of tree coordinates closely resembled the ground truth data upon visual inspection, suggesting the proposed computer vision and machine learning workflow can produce DBH estimations that approximate real-world values with a fair degree of accuracy.