Van omslachtige handleiding naar interactieve huiskamer: hoe AI onze fysieke taken vertaalt naar Mixed Reality.

Bram

Verstappen

Stel je voor: je komt thuis van IKEA en wilt je nieuwe elektrische sta-bureau in elkaar zetten. In plaats van constant te bladeren door een handleiding met vage plaatjes, zweeft er een paneel recht boven de bureauvoet die je nu nodig hebt — en een ander paneel wijst precies naar de opening waar je de voet moet inschuiven. Geen verwarring meer, niet langer heen en weer bladeren om te zoeken welke schroef nu exact in welk gat moet.

SPATIAL COMPUTING

Of het nu gaat om het assembleren van een meubel, het repareren van een fiets of complexe industriële processen: we zijn tegenwoordig nog steeds afhankelijk van tweedimensionale handleidingen op papier of op een scherm.

Moderne technologieën zoals Augmented Reality zijn in staat om ervaringen te creëren waarbij de grens met de echte wereld volledig lijkt te verdwijnen. Toch zien we dat zelfs bij de nieuwste spatial computing-apparaten, zoals de Apple Vision Pro, waardevolle kennis en informatie nog steeds wordt getoond in zwevende, tweedimensionale PDF-vensters.

Wanneer we op die manier taken uitvoeren, dwingen we onze hersenen om voortdurend 'context-switches' te maken: we moeten onze aandacht continu verplaatsen tussen de fysieke wereld, het gereedschap in onze handen en de tekst op het scherm. Dit verhoogt de cognitieve belasting, vergroot de kans op fouten én verlengt de tijd die we nodig hebben om een taak uit te voeren.

MAAK KENNIS MET EMBEDLLM

Om die barrière te doorbreken en de mogelijkheden van Spatial Computing maximaal te benutten, onderzochten we EmbedLLM. Dit is het allereerste systeem dat, dankzij AI, in staat is om de inhoud én de visuele presentatie van een statisch document (zoals een PDF of website) automatisch aan te passen aan de fysieke omgeving en het doel van de eindgebruiker.

EmbedLLM werkt via een pipeline van drie stappen. Allereerst brengt de SceneGraphBuilder de fysieke omgeving van de gebruiker in kaart via computervisie. De ruwe 3D-data wordt voorgesteld als een semantische scene graph, een intelligente graaf structuur die begrijpt dat de 'koffiemok' op de 'bureau' staat.

Vervolgens analyseert de ArchitectAgent, aangedreven door een krachtig Large Language, zowel de bronhandleiding als de omgeving. De AI verdeelt de handleiding in logische stappen en bepaalt exact welke informatie aan welk fysiek object gekoppeld moet worden. Dit doet het door nieuwe knopen en relaties toe te voegen aan de graaf van de omgeving.

Tot slot zorgt de Interpreter via een geavanceerd krachten gebaseerd lay-outalgoritme dat de digitale panelen netjes in de 3D-ruimte zweven, zonder elkaar te overlappen of het zicht te blokkeren.

GETEST IN DE PRAKTIJK

We onderzochten de effectiviteit van het systeem via twee gebruikersstudies. In een eerste voorkeursstudie scoorde de automatische ruimtelijke integratie van EmbedLLM significant hoger dan een traditionele smartphone-aanpak. Deelnemers vonden het een verademing dat digitale instructiekaarten direct aan de relevante fietsonderdelen of keukenapparaten gekoppeld waren.

"Het was veel eenvoudiger te begrijpen hoe ik de taak moest uitvoeren."

De daaropvolgende diepgaande praktijkstudie leverde echter een verrassend resultaat op. Deelnemers voerden reparatietaken uit op een fysieke 3D-printer met behulp van EmbedLLM in Virtual Reality. Hoewel ze het systeem subjectief sterk verkozen boven de standaardbenadering, bleek uit de harde data dat zowel de cognitieve belasting als de uitvoeringstijd bij de AI-gestuurde variant hoger lagen.

"Ik vond het AI-systeem zeer interactief, maar ik kende niet alle woorden."

Niet alle deelnemers hadden dezelfde voorkennis over de uit te voeren taak, namelijk 3D-printen. Bovendien voelden sommige deelnemers zich gedwongen om de stappen strikt chronologisch te volgen, in plaats van af en toe al eens een blik op het volgende paneel te werpen. Dat nam volgens hen een stukje flexibiliteit en autonomie weg.

DE TOEKOMST VAN CONTEXTBEWUSTE SPATIAL COMPUTING

Ons onderzoek toont aan dat het automatisch transformeren van traditionele, statische documenten naar contextbewuste XR-instructies technisch haalbaar is. Bovendien worden zulke alternatieve representaties door gebruikers warm onthaald. Tegelijkertijd legt het onderzoek een cruciaal verbeterpunt bloot: een goede spatial computing-interface moet niet alleen de omgeving begrijpen, maar ook de flexibiliteit bewaren die menselijke autonomie ondersteunt.

Toekomstig werk zal EmbedLLM uitbreiden met real-time updates, zodat de menselijke autonomie en flexibiliteit gewaarborgd blijven. Ook onderzoek naar het gebruik van 'text-to-3D'-modellen is een interessant spoor. Zo krijgen gebruikers in de toekomst dynamische 3D-animaties te zien die een taak direct visueel verduidelijken. Dit zou wel eens het begin kunnen zijn van een toekomst zonder die onhandige papieren handleiding.

Bibliografie

[1] Sai Teja Reddy Adapala. Cognitive load limits in large language models: Benchmarking

multi-hop reasoning, 2025.

[2] Faisal M. Alessa, Mohammed H. Alhaag, Ibrahim M. Al-harkan, Mohamed Z. Ramadan,

and Fahad M. Alqahtani. A neurophysiological evaluation of cognitive load during aug-

mented reality interactions in various industrial maintenance and assembly tasks. Sensors,

23(18), 2023.

[3] Doris Aschenbrenner, Florian Leutert, Argun C¸ en¸cen, Jouke Verlinden, Klaus Schilling,

Marc Latoschik, and Stephan Lukosch. Comparing human factors for augmented reality

supported single-user and collaborative repair operations of industrial robots. Frontiers in

Robotics and AI, Volume 6 - 2019, 2019.

[4] Ermanno Bartoli, Dennis Rotondi, Kai O. Arras, and Iolanda Leite. Long-term planning

around humans in domestic environments with 3d scene graphs, 2025.

[5] Majid Behravan and Denis Gracanin. Generative multi-modal artificial intelligence for

dynamic real-time context-aware content creation in augmented reality. In Proceedings of

the 30th ACM Symposium on Virtual Reality Software and Technology, VRST ’24, New

York, NY, USA, 2024. Association for Computing Machinery.

[6] Majid Behravan, Kreˇsimir Matkovi´c, and Denis Graˇcanin. Generative ai for context-

aware 3d object creation using vision-language models in augmented reality. In 2025

IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality

(AIxVR), pages 73–81, 2025.

[7] Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Haupt-

mann. A comprehensive survey of scene graphs: Generation and application. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 45(1):1–26, January 2023.

[8] Jiangong Chen, Xiaoyi Wu, Tian Lan, and Bin Li. Llmer: Crafting interactive extended

reality worlds with json data generated by large language models. IEEE Transactions on

Visualization and Computer Graphics, 31(5):2715–2724, May 2025.

[9] Yi Fei Cheng, Christoph Gebhardt, and Christian Holz. Interactionadapt: Interaction-

driven workspace adaptation for situated virtual reality environments. In Proceedings of

the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23,

New York, NY, USA, 2023. Association for Computing Machinery.

[10] Yifei Cheng, Yukang Yan, Xin Yi, Yuanchun Shi, and David Lindlbauer. Semanticadapt:

Optimization-based adaptation of mixed reality layouts leveraging virtual-physical seman-

tic connections. In The 34th Annual ACM Symposium on User Interface Software and

Technology, UIST ’21, page 282–297, New York, NY, USA, 2021. Association for Comput-

ing Machinery.

[11] Shakiba Davari, Akhil Ajikumar, and Mohsen Moghaddam. Context-aware augmented

reality for human-robot collaboration. In 2025 IEEE International Symposium on Mixed

and Augmented Reality Adjunct (ISMAR-Adjunct), pages 436–437, 2025
[12] Shakiba Davari and Doug A. Bowman. Towards context-aware adaptation in extended

reality: A design space for xr interfaces and an adaptive placement strategy, 2024.

[13] Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. Graph-to-

3d: End-to-end generation and manipulation of 3d scenes using scene graphs. CoRR,

abs/2108.08841, 2021.

[14] Paul Dourish. What we talk about when we talk about context. Personal Ubiquitous

Comput., 8(1):19–30, February 2004.

[15] Gemini Robotics Team et al. Gemini robotics: Bringing ai into the physical world, 2025.

[16] Don Gentner and Jakob Nielsen. The anti-mac interface. Commun. ACM, 39(8):70–82,

August 1996.

[17] Steven Henderson and Steven Feiner. Exploring the benefits of augmented reality docu-

mentation for maintenance and repair. IEEE Transactions on Visualization and Computer

Graphics, 17(10):1355–1368, 2011.

[18] Steven Houben, Jo Vermeulen, Kris Luyten, and Karin Coninx. Co-activity manager:

integrating activity-based collaboration into the desktop interface. In Proceedings of the

International Working Conference on Advanced Visual Interfaces, AVI ’12, page 398–401,

New York, NY, USA, 2012. Association for Computing Machinery.

[19] Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen,

Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li,

Chang Yuan, Yali Wang, Yu Qiao, and Limin Wang. Vinci: A real-time embodied smart

assistant based on egocentric vision-language model, 2024.

[20] Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception

system for 3d scene graph construction and optimization, 2022.

[21] Yue Jiang, Changkong Zhou, Vikas Garg, and Antti Oulasvirta. Graph4gui: Graph neural

networks for representing graphical user interfaces. In Proceedings of the 2024 CHI Con-

ference on Human Factors in Computing Systems, CHI ’24, New York, NY, USA, 2024.

Association for Computing Machinery.

[22] Ue-Hwan Kim, Jin-Man Park, Taek-jin Song, and Jong-Hwan Kim. 3-d scene graph: A

sparse and semantic representation of physical environments for intelligent agents. IEEE

Transactions on Cybernetics, 50(12):4921–4933, December 2020.

[23] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.

Large language models are zero-shot reasoners. In Proceedings of the 36th International

Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA,

2022. Curran Associates Inc.

[24] Kit Yung Lam, Lik Hang Lee, and Pan Hui. A2w: Context-aware recommendation system

for mobile augmented reality web browser. In Proceedings of the 29th ACM International

Conference on Multimedia, MM ’21, page 2447–2455, New York, NY, USA, 2021. Associ-

ation for Computing Machinery.

[25] Jaewook Lee, Filippo Aleotti, Diego Mazala, Guillermo Garcia-Hernando, Sara Vicente,

Oliver James Johnston, Isabel Kraus-Liang, Jakub Powierza, Donghoon Shin, Jon E.

Froehlich, Gabriel Brostow, and Jessica Van Brummelen. Imaginatear: Ai-assisted in-situ

authoring in augmented reality. In Proceedings of the 38th Annual ACM Symposium on

User Interface Software and Technology, UIST ’25, New York, NY, USA, 2025. Association

for Computing Machinery.

[26] Toby Jia-Jun Li, Lindsay Popowski, Tom Mitchell, and Brad A Myers. Screen2vec: Se-

mantic embedding of gui screens and gui components. In Proceedings of the 2021 CHI

Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA,

2021. Association for Computing Machinery.
[27] Xiang Li, Heqian Qiu, Lanxiao Wang, Hanwen Zhang, Chenghao Qi, Linfeng Han, Huiyu

Xiong, and Hongliang Li. Challenges and trends in egocentric vision: A survey. Machine

Intelligence Research, 23(1):1–33, February 2026.

[28] Zhipeng Li, Christoph Gebhardt, Yves Inglin, Nicolas Steck, Paul Streli, and Christian

Holz. Situationadapt: Contextual ui optimization in mixed reality with situation awareness

via llm reasoning. In Proceedings of the 37th Annual ACM Symposium on User Interface

Software and Technology, UIST ’24, New York, NY, USA, 2024. Association for Computing

Machinery.

[29] Feiyu Lu, Leonardo Pavanatto, Shakiba Davari, Lei Zhang, Lee Lisle, and Doug A. Bow-

man. “where did my apps go?” supporting scalable and transition-aware access to every-

day applications in head-worn augmented reality. IEEE Transactions on Visualization and

Computer Graphics, 31(9):6112–6129, 2025.

[30] Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia

Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai,

Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed,

Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu

Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, and Franck Dernoncourt. Gui agents: A

survey, 2025.

[31] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on

point sets for 3d classification and segmentation, 2017.

[32] Jun Rekimoto. Gazellm: Multimodal llms incorporating human visual attention. In Pro-

ceedings of the Augmented Humans International Conference 2025, AHs ’25, page 302–311,

New York, NY, USA, 2025. Association for Computing Machinery.

[33] Evan F. Risko and Sam J. Gilbert. Cognitive offloading. Trends in Cognitive Sciences,

20(9):676–688, 2016.

[34] Dominik Schmidt, Raf Ramakers, Esben W. Pedersen, Johannes Jasper, Sven K¨ohler,

Aileen Pohl, Hannes Rantzsch, Andreas Rau, Patrick Schmidt, Christoph Sterz, Yanina

Yurchenko, and Patrick Baudisch. Kickables: tangibles for feet. In Proceedings of the

SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, page 3143–3152,

New York, NY, USA, 2014. Association for Computing Machinery.

[35] Xingdong Sheng, Shijie Mao, Yichao Yan, and Xiaokang Yang. Review on slam algorithms

for augmented reality. Displays, 84:102806, 2024.

[36] Jingyu Shi, Rahul Jain, Seunggeun Chi, Hyungjun Doh, Hyung-gun Chi, Alexander J.

Quinn, and Karthik Ramani. Caring-ai: Towards authoring context-aware augmented

reality instruction through generative artificial intelligence. In Proceedings of the 2025

CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA,

2025. Association for Computing Machinery.

[37] Tomu Tahara, Takashi Seno, Gaku Narita, and Tomoya Ishikawa. Retargetable ar:

Context-aware augmented reality in indoor scenes based on 3d scene graph. In 2020 IEEE

International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct),

pages 249–255, 2020.

[38] Melanie Tory, Arthur E. Kirkpatrick, M. Stella Atkins, and Torsten Moller. Visualization

task performance with 2d, 3d, and combination displays. IEEE Transactions on Visual-

ization and Computer Graphics, 12(1):2–13, January 2006.

[39] Brygg Ullmer and Hiroshi Ishii. The metadesk: models and prototypes for tangible user

interfaces. In Proceedings of the 10th Annual ACM Symposium on User Interface Software

and Technology, UIST ’97, page 223–232, New York, NY, USA, 1997. Association for

Computing Machinery.
[40] Tom Veuskens, Kris Luyten, and Raf Ramakers. Rataplan: Resilient automation of user

interface actions with multi-modal proxies. Proc. ACM Interact. Mob. Wearable Ubiquitous

Technol., 4(2), June 2020.

[41] Bryan Wang, Gang Li, and Yang Li. Enabling conversational interaction with mobile ui

using large language models. In Proceedings of the 2023 CHI Conference on Human Factors

in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing

Machinery.

[42] Mark Weiser. The computer for the 21 st century. Scientific American, 265(3):94–105,

1991.

[43] Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong,

Woontack Woo, and Jinyoung Yeo. LLM meets scene graph: Can large language models

understand and generate scene graphs? a benchmark and empirical study. In Wanxi-

ang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

(Volume 1: Long Papers), pages 21335–21360, Vienna, Austria, July 2025. Association for

Computational Linguistics.

[44] Zhen Yang, Jinlei Shi, Wenjun Jiang, Yuexin Sui, Yimin Wu, Shu Ma, Chunyan Kang,

and Hongting Li. Influences of augmented reality assistance on performance and cognitive

loads in different stages of assembly task. Frontiers in Psychology, 10, 07 2019.

[45] Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua

Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Large

language model-brained gui agents: A survey, 2025.

[46] Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. A comprehensive survey

on process-oriented automatic text summarization with exploration of llm-based methods,

2025.

[47] Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwon-

joon Lee, and Chen Sun. Antgpt: Can large language models help long-term action antic-

ipation from videos?, 2024.

[48] Zijie Zheng, Yu He, Ge Yu, and Xi Xu. Binoforce: A force-based 3d dynamic label layout

method under binocular viewpoints. Electronics, 14(11), 2025.

Download scriptie (88.35 MB)

Universiteit of Hogeschool

Universiteit Hasselt

Thesis jaar

2026

Promotor(en) en begeleiders

Prof. Dr. Raf Ramakers, Prof. Dr. Kris Luyten, Dhr. Maties Claesen

Thema('s)

Informatica, kennistechnologie en ICT

Kernwoorden

Mixed Reality,

Artificial Intelligence,

virtual reality,

informatica,

informatievisualisatie,

Human Computer Interaction