A Toy Car that Listens: Using Speech Technology to Explore Early Math Assessment

Robert Kellerman

Content Creator

24 March 2026

A final-year mechatronic engineering project at Stellenbosch University explored how speech technology and toy design can work together in early numeracy assessment. The result was an interactive toy car that asks simple math questions, listens to children’s spoken answers, and responds in real time using a low-resource digit recognition system built for child speech.

What if a toy could help us understand how young children are developing early math skills? That question sits at the center of a final-year mechatronic engineering project by Camryn Ylonde Abrahamson, supervised by Professor Herman Kamper. The project brought together speech technology, embedded systems, and mechanical design to build an interactive toy car that asks children simple math questions, listens to their spoken answers, and responds in real time.

The educational problem behind the project is serious. In South Africa, many children struggle with foundational numeracy, which is often not picked up early. Large class sizes can make individual attention difficult, while formal assessments and specialist evaluations are often expensive, stressful, or only used once a problem has already become obvious.

This project explored whether a speech-based toy could offer a more accessible and child-friendly way to support early math assessment. It wasn’t designed to replace teachers or formal clinical assessment. Instead, it was developed as a research prototype that could help make early screening more interactive and easier to use in low-resource settings.

Why Child Speech is Hard for Machines to Process

That sounds simple until the speech part enters the room and starts causing trouble. Most automatic speech recognition systems, which convert speech into text, work best when they’ve been trained on very large amounts of transcribed audio. That creates a problem here for two reasons. Many South African languages don’t have large speech datasets. Child speech is also harder for standard systems to process than adult speech. Children speak with higher pitch, more variation, and less stable articulation, and collecting child speech data comes with clear ethical and practical limits. The report describes this as a double low-resource problem, since both the speakers and the languages come with limited training data.

A Focused Speech Task Instead of Full Speech-To-Text

To deal with that, Abrahamson didn’t build a full speech-to-text system. She narrowed the task to something more focused and realistic for the setting: recognising spoken digits from 0 to 9. The system follows a template-based approach. In plain terms, it compares a child’s recording with a small set of example recordings and predicts which number sounds most similar. Before that comparison happens, the audio is cleaned up and standardised. Recordings are resampled to a common format, trimmed with voice activity detection, which identifies where speech starts and stops, and normalised so that the system pays less attention to differences in loudness or recording conditions.

The project then tested different ways of representing and comparing speech. One option used mel-frequency cepstral coefficients, or MFCCs, which are compact measurements that capture the shape of a speech signal in a way that roughly reflects how humans hear sound. Another used WavLM embeddings, which are learned speech features produced by a large pre-trained model. After comparing alternatives, the strongest system used WavLM Base+ embeddings, dynamic time warping, and k-nearest neighbors. Dynamic time warping is a method that lines up two speech signals even when they’re spoken at different speeds. K-nearest neighbors then classifies the input based on the closest matching examples. In this case, the best result came from comparing each spoken answer with nearby examples and letting the three closest matches vote on the answer.

Fig. Photos of the final car assembly

How the Toy Car Works

The toy car itself was more than a casing for the software. It was designed as the physical interface for the whole system. The car was 3D-printed and built around a Raspberry Pi Zero 2 W, a small single-board computer. It included an OLED display that acted as the car’s face, headlights for visual feedback, and steering and movement mechanisms. A computer-side graphical user interface generated age-appropriate questions such as basic addition, greater-than comparisons, and counting tasks.

The question was converted to speech and played aloud, and the child’s answer was then recorded. When speech stopped for half a second, the system processed the response. If the answer was correct, the computer sent a command to the car to react happily. If it was wrong, the car responded sadly. The result was a complete interaction loop linking spoken language, machine recognition, and a physical toy.

What the Results Showed

The results were promising. The final digit recognition system reached 79.41% accuracy on English child speech and 76.84% on Afrikaans child speech. That was a clear improvement over the Whisper baselines used for comparison, which reached 56.86% on English and 22.11% on Afrikaans.

The Afrikaans result is especially interesting, since the system was tuned mainly on English child speech and still transferred well. The report’s Monte Carlo analysis, which repeated the training-size experiment several times to test consistency, found that performance improved quickly as more training speakers were added and then leveled off at around four speakers. That suggests the method could be adapted to another language with a fairly small number of child recordings.

Final Thoughts

What this project shows, above all, is that a carefully scoped system can do useful work where data is limited. Instead of chasing a general-purpose speech model, the project focused on a narrow classroom task and built the whole chain around it, from signal processing to toy design.

The result is a working prototype that demonstrates how speech technology might support early numeracy assessment in a more natural way for children. Future work could expand the vocabulary beyond single digits, improve the system's handling of unexpected speech, and move more processing directly onto the toy itself. For now, the project stands as a strong example of engineering shaped by a real educational constraint and carried through into a working device.