Mapping Real-World Objects into Virtual Reality to Facilitate Interaction using 6DoF Pose Estimation

Robert Kellerman

Content Creator

15 April 2026

This research explores how real-world objects can be mapped into virtual reality using 6DoF pose estimation and deep learning to create a more immersive, tactile VR experience. By combining convolutional neural networks with low-cost cameras and real-time tracking, the system allows users to see and physically interact with the actual objects they hold inside a virtual environment—bridging the gap between visual and physical perception.
Through testing on various objects and comparing different tracking methods, the study demonstrates strong accuracy and practical feasibility, while also identifying key challenges such as object symmetry, occlusion, and data limitations. The findings highlight a scalable, cost-effective approach to enhancing VR immersion, with applications in education, training, and simulation.

Virtual reality (VR) has been studied for decades, but only recently has it reached the kind of mainstream visibility that fuels intensive R&D. In this thesis, Stefan Pelser—under the supervision of Dr Rensu Theart—investigated how convolutional neural networks (CNNs) can translate real-world objects into the virtual domain.

The core objective of this research being to deliver a VR experience in which sight and touch finally agree. In addition, CNN-driven object mapping promises to bridge that gap, letting users see the very items they are holding and, therefore, feel their true texture, weight, and resistance.

Figure 1: Meta Quest 2.

Methodology, Datasets, and Pose-Estimation Pipeline

The project started with a suite of synthetically generated datasets that followed the LineMOD format, augmented to cover varied lighting, backgrounds, and occlusions. These datasets fed EfficientPose, a state-of-the-art 6DoF pose-estimation network. Then, real-world capture occurred via the headset’s own spatial-mapping sensors, while two Logitech C270 webcams supplied RGB data to EfficientPose in real time.

To benchmark the deep-learning approach, an ArUco-marker pipeline was implemented in parallel. Both pipelines relayed their pose outputs to the VR engine, which rendered each tracked object in the user’s field of view with minimal latency.

Tactile Immersion—Why Mapping Matters

Despite efforts to make VR as close to reality as possible, there remains a significant gap. The current technology does not allow users to physically feel the virtual environment, which impacts the immersion of the VR experience. But visual fidelity alone cannot sustain presence if your hands never “belonged” in the scene.

That’s why plastic controllers remind you of their shape; controller-free hand tracking offers no haptic cues. Mapping the actual mug, wrench, or surgical tool you hold into the headset view restores the missing tactile channel, delivering:

Texture and grain—feel the wood of a hammer handle or the ridges of a game prop.
Weight and inertia—lift a real dumb-bell and see its twin move identically in VR.
Temperature cues—sense a cold metal scalpel during medical simulation.

Because this approach uses inexpensive cameras and household objects, it avoids the cost and complexity of specialized haptic gloves or bespoke peripherals.