Current Projects
- biplotEZ
- BIPLOTS FOR INDUSTRY
- EEG SIMULATIONS
- Applying biplotEZ
- GENERALISED SINGULAR VALUE DECOMPOSITION FOR THREE-WAY DATA
- FAULT DIAGNOSIS IN MULTIVARIATE STATISTICAL PROCESS MONITORING
- Biplots for linguistic patterns
- Biplots for Missing Data
- Biplots for Text and Sentiment Visualisation
- EXPLODING BIPLOTS R PACKAGE
- The correspondence analysis of ordered categorical variables
- The role of symmetric and asymmetric association in correspondence analysis
- Strategies for dealing with overdispersion in contingency tables when performing correspondence analysis
- On the construction of biplots for the visualisation of ordered categorical variables
- The impact of power transformations to reciprocal averaging, canonical correlation analysis and correspondence analysis
- Correspondence analysis and the Cressie-Read family of divergence statistics
- Explainable Fraud Detection Models
- Log-ratio Analysis and Compositional Biplots of Milk Fatty Acids
- moveEZ
- Biplots for Climate studies
Sugnet Lubbe, Johané Nienkemper-Swanepoel, Niël le Roux, Raeesa Ganey, Ruan Buys, Zoë-Mae Adams, Peter Manefeldt
Biplots have proved to be valuable visualisation tools in exploratory data analysis. To date the use of biplots for interdisciplinary applications has been limited since current implementation tools are constraint to expert users. The biplotEZ R package was published on CRAN as a user-friendly package to enable and empower practitioners and researchers of varying skills to apply biplots more widely in many disciplines, especially in the current era of big data. Currently the package makes provision for Principal Component Analysis biplots, Canonical Variate Analysis biplots, Correspondence Analysis biplots and Regression biplots (both with linear regression axes or spline based axes). Work is continuing with respect to Multiple Correspondence Analysis biplots, Analysis of Distance biplots and more.
Niël le Roux, Roelof Coetzer (NWU), Ruan Rossouw (SASOL)
A multivariate reactor performance index (RPI) is developed for complex Multivariate Process Monitoring. The newly proposed RPI integrates subject-matter knowledge with a data driven approach for real time performance monitoring. A new approach to process deviation monitoring on many variables is presented based on the confidence value (α) at a specified -value. This methodology is proposed as a general data driven performance index as it is objective, and very little prior knowledge of the system is required. A performance index visualized on an appropriate and interactive graph is invaluable in the monitoring of multiple similar production processes, as it makes it easy to visually identify production processes not performing as expected.
Niël le Roux, Pieter Schoonees (Erasmus University)
This research focuses on methods for assessing the similarity of brain responses within and across subjects (individuals). Typically, the data comes from fMRI or EEG studies, and concern spatiotemporal measures of brain activity while the subject is exposed to some stimulus. A particular focus of these studies is on naturalistic stimuli, which typically means video content such as television and films. This is an important departure from traditional neuroimaging studies where subjects perform simple tasks multiple times in a highly controlled setting. fMRI offers high spatial resolution through dividing the brain into many voxels, but this comes at the cost of lower temporal resolution as it takes roughly two seconds to complete a single scan of the brain. In contrast, EEG trades spatial resolution for high temporal resolution. In EEG, a limited number of electrodes (e.g., 64) are placed on the scalp to measure activity, but by sacrificing temporal resolution in this way measurements can be made several times a second (typically 256 of 512 times). Our focus is on the statistical analysis of EEG data. To this end we developed an extensive R-based EEG simulation statistical model for generating EEG data in a wide variety of controlled conditions. This allows us to evaluate statistical procedures currently in use in the field of analysing EEG data.
Sugnet Lubbe, Johané Nienkemper-Swanepoel, Niël le Roux, Raeesa Ganey
This project builds upon the biplotEZ project. The focus of this project is the application of multi-dimensional data visualisation s in a broad range of different fields. Collaborative projects in the following application fields have been considered: archaeology, agricultural sciences, chemometrics, industrial applications, finance, sensory profiling, microbiology and wood science. Experience has shown that a collaborative project with ongoing support from MuViSU contributes to deeper insights and interpretation in the application. The knowledge-flow goes both ways: application necessitates new theoretical developments; new theoretical developments lead to better understanding.
Sugnet Lubbe, Raeesa Ganey (WITS)
Where the singular value decomposition decomposes a single matrix into three components, the generalised singular value decomposition decomposes two matrices simultaneously, with a single matrix of right singular vectors. This could be useful to visually represent more than one data matrix simultaneously. A paper is being finalised for submission.
Sugnet Lubbe, Roelof Coetzer (NWU)
While Prof Coetzer was at SASOL we co-supervised Dr André Mostert's PhD at UCT. Dr Mostert sadly passed away from COVID-19 in June 2021. Prof Coetzer and I plan to publish at least two papers from his thesis. Prof Coetzer is organising a session “Methodologies for process monitoring and fault detection in complex industrial processes" at the International Conference of Computational Methods in Science and Engineering (ICCMSE). Since the conference is in hybrid format, Prof Lubbe will remotely present a paper on this work.
Raeesa Ganey, Johané Nienkemper-Swanepoel
This project was inspired by the The Economist article, "What is the world's loveliest language?", and aims to show how a variety of multivariate visualisations, specifically biplots, enhances the interpretations complex interactions between linguistic features and aesthetic perceptions.
Johané Nienkemper-Swanepoel, Mokgeseng Ramaisa
A paper by Johané is currently under review to provide guidance to users to decide on appropriate imputation strategies based on the underlying data characteristics.
Mokgeseng is completing his Masters under the supervision of Dr Nienkemper-Swanepoel. He has developed methodology to extend GPAbin biplots for categorical data to incomplete continuous data using principal component analysis biplots. GPAbin biplots allow the unified visualisation of visualisations from multiple imputations.
Zoë-Mae Adams, Johané Nienkemper-Swanepoel
The overarching aim of this dissertation is to gain insight from text data by summarising the content and visualising the results of sentiment classification. This can be achieved by developing suitable visualisation tools for the optimal representation of sentiment classification.
The following research objectives will be investigated in pursuit of the overarching aim:
- Review of sentiment visualisation literature
- Application of adaptive sentiment lexicon to improve sentiment classification accuracy
Enhancement of the interactive EW-MCA biplot tool
- The inspection of the ordinal nature of sentiment classification categories
- The visualisation of topic modelling results
Ruan Buys
Ruan Buys published the R package bipl5 on CRAN as part of this Masters research. The package provides for reactive biplots rendered in HTML. The traditional biplot view is enhanced by automated translation of the axes and superimposing interclass kernel densities on the axes. Work is continuing on integrating bipl5 with biplotEZ.
Eric Beh and Rosaria Lombardo
For the past 25 years or so we have published extensively on examining the role of orthogonal polynomials on a range of issues concerned with correspondence analysis. These issues include the construction and interpretation of low-dimensional visual depictions of the association, as well as the partition of popular measures of association, and correlation and association models. This is because orthogonal polynomials provide an excellent, simple and a flexible means of incorporating the structure of ordinal categorical variables – all they require is an a priori chosen set of initial scores to reflect the ordinal structure of a variable and a three-term recurrence formula to generate the polynomials. Orthogonal polynomials also enable one to determine “generalised correlations" which include as special cases the traditional linear-by-linear correlation coefficient (that everyone should be familiar with) and sources of non-linear association that may exist between the ordinal variables. Alternative approaches involving scaling categories such that the resulting scores (obtained from reciprocal averaging or by other means) are “forced" to be ordered. Unfortunately, such approaches only considered ordered scores across a single dimension and the resulting visual representation of the association may not properly reflect the nature of the association.
This ongoing project examines the impact of orthogonal polynomials on the structure of the association between two or more categorical variables. Methods of three-way and higher-way decomposition using orthogonal polynomials are very much linked to the Tucker3 decomposition and, more generally, to the suite of decomposition methods that are now part of higher-order singular value decomposition (HOSVD). This project also examines the impact on the interpretation of visual summaries of the association obtained by performing correspondence analysis, where the traditional correspondence plot or biplot may be constructed.
Eric Beh and Rosaria Lombardo
Typically, for the analysis of a two-way contingency table, the association between the variables is assumed to be structured such that they are both predictor variables. This is because such a structure allows for Pearson's chi-squared statistic to be used to assess the statistical significance of the association. However, there are times when (for practical reasons) it is more reasonable to treat one variable as being a predictor variable and the second variable as the response variable. Such an asymmetric association structured can be formally assessed using the Goodman-Kruskal tau index and visually assessed using non-symmetrical correspondence analysis (NSCA). Many of the features of NSCA remain the same as the traditional “symmetric" approach that uses the Pearson chi-squared statistic at its foundations with the interpretation of a correspondence plot, or biplot, being slightly different – due solely to the asymmetric structure of the variables. This ongoing project examines the features of NSCA, in particular for nominal and ordinal variables as well as variations of correspondence analysis that expand the technique for the analysis of the association between multiple categorical variables.
Eric Beh and Rosaria Lombardo
When a correspondence analysis is applied to a two-way contingency table, it is performed by first decomposing a matrix of standardised residuals using singular value decomposition. The advantage of doing this is that the sum-of-squares of these residuals, and of the squared singular values, is equivalent to Pearson's classic chi-squared statistic. Such residuals, which are treated as being asymptotically normally distributed, arise by assuming that the cell frequencies of the contingency table are Poisson random variables; doing so means that their expectation and variance are equivalent. However there is clear evidence in the statistics literature that suggests that the variance of these residuals exceeds their expectation. Thus, we observe overdispersion in the table. Therefore, this project investigates various strategies can be undertaken to deal with overdispersion and include assuming that the cell counts are from a generalised Poisson, Conway-Maxwell Poisson or negative binomial distribution. Variance stabilising strategies can also be included such as by considering the adjusted standardised residual and the Freeman-Tukey residual. As part of this project, adopting such strategies means that one needs to examine their impact on how to quantify the overall association between the variables, and the interpretation of the low-dimensional visual display that can be generated. Extensions to examining this issue for multiple categorical variables is also under consideration.
Eric Beh and Rosaria Lombardo
For more than 20 years, variants of correspondence analysis have been developed that accommodate for the structure of ordinal categorical variables using orthogonal polynomials. When the visual display from this analysis is the biplot, projections linking the origin to the standard coordinate of each category is a common feature. When a column variable, say, consists of ordered categories, the biplot can be constructed so that their standard coordinate is determined using orthogonal polynomials which require a set of a priori scores that reflect the ordered structure of the categories. When the first two polynomials are used to construct the biplot they produce a configuration of standard coordinates that appear to be parabolic in shape. This project explores the exact nature of this parabolic relationship and examines the various features of this configuration of points. In particular, simple formulae can be derived to determine the focus, vertex, intercepts and directrix of this relationship. Since the use of orthogonal polynomials requires choosing a priori scores to reflect the ordinal nature of the categories of a variable, this project also explores the impact of different scores on these features. Ongoing research in this area means that this project includes examining the relationship between the first-order and higher-order polynomials and the impact such a relationship has on the interpretation of the biplot.
Eric Beh and Rosaria Lombardo
The role of transformations has gained wide attention in the correspondence analysis literature. In particular, the focus of such transformations have focused on the profiles of a two-way contingency table and is largely due to the impact of the work undertaken by Michael Greenacre over a decade ago. While his work examined on the impact of a power transformation of the elements of a contingency table and of a profile, the results from this approach can also be obtained by considering the same power transformations from a reciprocal averaging and canonical correlation perspective. A few questions arise though. For example, what possible range of transformations exist that ensure that the correspondence analysis is depicting the association between categorical variables that remains statistically significant? Also, what happens if transformations other than a power transformation – such as a log transformation or a trigonometric transformation are considered? This projects expands the role of power transformations in correspondence analysis and its related areas, including the impact of such transformations on the interpretation of the resulting low-dimensional visualisations that can be obtained from them.
Eric Beh and Rosaria Lombardo
The foundations of correspondence analysis rests with Pearson's famous chi-squared statistic and provides the numerical groundwork for visualising how categorical variables are associated. It has been recently shown that the Freeman-Tukey statistic can also play an important role and confirmed the advantages of the Hellinger distance that have long been advocated in the literature. Pearson's and the Freeman-Tukey statistics are two of five commonly used special cases of the Cressie-Read family of divergence statistics. Therefore, correspondence analysis can be expanded so this family lies at the heart of how the association is quantified and visualised. The advantage of using the Cressie-Read family of divergence statistics when performing correspondence analysis is that it includes as special cases two variants that have gained some attention in the literature - the Hellinger distance decomposition (HDD) method and log-ratio analysis (LRA). Expanding correspondence analysis in this way also enables for some general features to be obtained – such as coordinate systems, models of association/correlation, and distance measures – and for flexibility to be considered when defining the “best" and “worst" possible visualisation of the association. This project therefore examines the role of the Cressie-Read family of divergence statistics in the correspondence analysis of a two-way contingency table. Possible extensions to this project include expanding it to the analysis of a multi-way contingency table, examining the impact on the visual display (such as the traditional correspondence plot, or the biplot) and exploring whether asymmetric associations can be incorporated into this framework.
Ruan Buys, Sugnet Lubbe
Fraud detection is an important application of classification methodology, typically on big data and in the presence of highly unbalanced classes. In this project we investigate the use of multi-dimensional visualisation for fraud detection.
Susan Laurens & Raeesa Ganey
This study applies Log-ratio analysis to a compositional dataset of 74 milk fatty acids (FAs), aiming to reveal insights about their relationships. The visualisation of the milk FA data using compositional biplots offers valuable insights into their interactions and illustrates how different milk FAs contribute differently across various feeding regimes of the dairy cow.
Raeesa Ganey and Johané Nienkemper-Swanepoel
This package builds on biplotEZ and includes dynamic biplot alternatives in the form of interactive and animated biplots.
To enhance interpretability and engagement, we incorporate dynamic plotting techniques to animate the evolution of biplots over time. These animations sequentially display annual changes in coordinate structures and axis orientations. Such dynamic visualisations support intuitive pattern recognition and serve as a powerful communication tool for conveying complex, multivariate changes in data conditions to both technical and non-technical audiences.
Raeesa Ganey, Johané Nienkemper-Swanepoel, Roelof Coetzer
Climate change manifests as evolving patterns in climate conditions over time. This study focuses on historical observed climate data from African regions, using biplot visualisations to capture monthly measurements and reveal inter-variable associations. A baseline biplot (either a specific year or an average representation) is constructed and subsequent annual biplots are aligned to it using Generalised Orthogonal Procrustes Analysis (GOPA). This enables tracking of changes in coordinate density and axis configuration, offering insights into temporal shifts in climatic relationships. The method facilitates comparative exploration of multivariate climate trends spanning multiple decades.