TY - JOUR
T1 - A Human Intention and Motion Prediction Framework for Applications in Human-Centric Digital Twins
AU - Asad, Usman
AU - Khalid, Azfar
AU - Lughmani, Waqas Akbar
AU - Rasheed, Shummaila
AU - Khan, Muhammad Mahabat
PY - 2025/10/1
Y1 - 2025/10/1
N2 - In manufacturing settings where humans and machines collaborate, understanding and predicting human intention is crucial for enabling the seamless execution of tasks. This knowledge is the basis for creating an intelligent, symbiotic, and collaborative environment. However, current foundation models often fall short in directly anticipating complex tasks and producing contextually appropriate motion. This paper proposes a modular framework that investigates strategies for structuring task knowledge and engineering context-rich prompts to guide Vision–Language Models in understanding and predicting human intention in semi-structured environments. Our evaluation, conducted across three use cases of varying complexity, reveals a critical tradeoff between prediction accuracy and latency. We demonstrate that a Rolling Context Window strategy, which uses a history of frames and the previously predicted state, achieves a strong balance of performance and efficiency. This approach significantly outperforms single-image inputs and computationally expensive in-context learning methods. Furthermore, incorporating egocentric video views yields a substantial 10.7% performance increase in complex tasks. For short-term motion forecasting, we show that the accuracy of joint position estimates is enhanced by using historical pose, gaze data, and in-context examples.
AB - In manufacturing settings where humans and machines collaborate, understanding and predicting human intention is crucial for enabling the seamless execution of tasks. This knowledge is the basis for creating an intelligent, symbiotic, and collaborative environment. However, current foundation models often fall short in directly anticipating complex tasks and producing contextually appropriate motion. This paper proposes a modular framework that investigates strategies for structuring task knowledge and engineering context-rich prompts to guide Vision–Language Models in understanding and predicting human intention in semi-structured environments. Our evaluation, conducted across three use cases of varying complexity, reveals a critical tradeoff between prediction accuracy and latency. We demonstrate that a Rolling Context Window strategy, which uses a history of frames and the previously predicted state, achieves a strong balance of performance and efficiency. This approach significantly outperforms single-image inputs and computationally expensive in-context learning methods. Furthermore, incorporating egocentric video views yields a substantial 10.7% performance increase in complex tasks. For short-term motion forecasting, we show that the accuracy of joint position estimates is enhanced by using historical pose, gaze data, and in-context examples.
UR - https://www.open-access.bcu.ac.uk/16669/
U2 - 10.3390/biomimetics10100656
DO - 10.3390/biomimetics10100656
M3 - Article
SN - 2313-7673
VL - 10
JO - Biomimetics
JF - Biomimetics
IS - 10
ER -