Publication – Page 2 – ITEC Homepage

MMC, Publication

ACM MM’25 Open Source: diveXplore – An Open-Source Software for Modern Video Retrieval with Image/Text Embeddings

diveXplore – An Open-Source Software for Modern Video Retrieval with Image/Text Embeddings

ACM Multimedia 2025

October 27 – October 31, 2025

Dublin, Ireland

[PDF]

Mario Leopold (AAU, Austria), Farzad Tashtarian (AAU, Austria), Klaus Schöffmann (AAU, Austria)

Abstract:Effective video retrieval in large-scale datasets presents a significant challenge, with existing tools often being too complex, lacking sufficient retrieval capabilities, or being too slow for rapid search tasks. This paper introduces diveXplore, an open-source software designed for interactive video retrieval. Due to its success in various competitions like the Video Browser Showdown (VBS) and the Interactive Video Retrieval 4 Beginners (IVR4B), as well as its continued development since 2017, diveXplore is a solid foundation for various kinds of retrieval tasks. The system is built on a three-layer architecture, comprising a backend for offline preprocessing, a middleware with a Node.js and Python server for query handling, and a MongoDB for metadata storage, as well as an Angular-based frontend for user interaction. Key functionalities include free-text search using natural language, temporal queries, similarity search, and other specialized search strategies. By open-sourcing diveXplore, we aim to establish a solid baseline for future research and development in the video retrieval community, encouraging contributions and adaptations for a wide range of use cases, even beyond competitive settings.

September 19, 2025

MMC, Publication

Paper Accepted: GenStream: Semantic Streaming Framework for Generative Reconstruction of Human-centric Media

GenStream: Semantic Streaming Framework for Generative Reconstruction of Human-centric Media

ACM Multimedia 2025

October 27 – October 31, 2025

Dublin, Ireland

[PDF]

Emanuele Artioli (AAU, Austria), Daniele Lorenzi (AAU, Austria), Shivi Vats (AAU, Austria), Farzad Tashtarian (AAU, Austria), Christian Timmerer (AAU, Austria)

Abstract: Video streaming dominates global internet traffic, yet conventional pipelines remain inefficient for structured, human-centric content such as sports, performance, or interactive media. Standard codecs re-encode entire frames, foreground and background alike, treating all pixels uniformly and ignoring the semantic structure of the scene. This leads to significant bandwidth waste, particularly in scenarios where backgrounds are static and motion is constrained to a few salient actors. We introduce GenStream, a semantic streaming framework that replaces dense video frames with compact, structured metadata. Instead of transmitting pixels, GenStream encodes each scene as a combination of skeletal keypoints, camera viewpoint parameters, and a static 3D background model. These elements are transmitted to the client, where a generative model reconstructs photorealistic human figures and composites them into the 3D scene from the original viewpoint. This paradigm enables extreme compression, achieving over 99.9% bandwidth reduction compared to HEVC. We partially validate GenStream on Olympic figure skating footage and demonstrate potential high perceptual fidelity under minimal data. Looking forward, GenStream opens new directions in volumetric avatar synthesis, canonical 3D actor fusion across views, personalized and immersive viewing experiences at arbitrary viewpoints, and lightweight scene reconstruction, laying the groundwork for scalable, intelligent streaming in the post-codec era.

August 1, 2025

Publication

Paper Accepted: EuroXR 2025 Demo: STEP-MR: A Subjective Testing and Eye-Tracking Platform for Dynamic Point Clouds in Mixed Reality

Paper Title: STEP-MR: A Subjective Testing and Eye-Tracking Platform for Dynamic Point Clouds in Mixed Reality

Conference Details: EuroXR 2025; Sep 03 – Sep 05, 2025; Winterthur, Switzerland

Authors: Shivi Vats (AAU, Austria), Christian Timmerer (AAU, Austria), Hermann Hellwagner (AAU, Austria)

Abstract:

The use of point cloud (PC) streaming in mixed reality (MR) environments is of particular interest due to the immersiveness and the six degrees of freedom (6DoF) provided by the 3D content. However, this immersiveness requires significant bandwidth. Innovative solutions have been developed to address these challenges, such as PC compression and/or spatially tiling the PC to stream different portions at different quality levels. This paper presents a brief overview of a Subjective Testing and Eye-tracking Platform for dynamic point clouds in Mixed Reality (STEP-MR) for the Microsoft HoloLens 2. STEP-MR was used to conduct subjective tests (described in another work) with 41 participants, yielding over 2000 responses and more than 150 visual attention maps, the results of which can be used, among other things, to improve dynamic (animated) point cloud streaming solutions mentioned above. Building on our previous platform, the new version now enables eye-tracking tests, including calibration and heatmap generation. Additionally, STEP-MR features modifications to the subjective tests’ functionality, such as a new rating scale and adaptability to participant movement during the tests, along with other user experience changes.

July 21, 2025

Publication

ACM MM’25 Demo: SDART: Spatial Dart AR Simulation with Hand-Tracked Input

SDART: Spatial Dart AR Simulation with Hand-Tracked Input

ACM Multimedia 2025

October 27 – October 31, 2025

Dublin, Ireland

[PDF]

Milad Ghanbari (AAU, Austria), Wei Zhou (Cardiff, UK), Cosmin Stejerean (Meta, US), Christian Timmerer (AAU, Austria), Hadi Amirpour (AAU, Austria)

Abstract: We present a physics-driven 3D dart-throwing interaction system for Apple Vision Pro (AVP), developed using Unity 6 engine and running in augmented reality (AR) mode on the device. The system utilizes the PolySpatial and Apple’s ARKit software development kits (SDKs) to ensure hand input and tracking in order to intuitively spawn, grab, and throw virtual darts similar to real darts. The application benefits from physics simulations alongside the innovative no-controller input system of AVP to manipulate objects realistically in an unbounded spatial volume. By implementing spatial distance measurement, scoring logic, and recording user performance, this project enables user studies on quality of experience in interactive experiences. To evaluate the perceived quality and realism of the interaction, we conducted a subjective study with 10 participants using a structured questionnaire. The study measured various aspects of the user experience, including visual and spatial realism, control fidelity, depth perception, immersiveness, and enjoyment. Results indicate high mean opinion scores (MOS) across key dimensions. Link to video: Link

July 16, 2025

Project, Publication

ICCV VQualA’25: VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

ICCV VQualA 2025

October 19 – October 23, 2025

Hawai’i, USA

[PDF]

Hadi Amirpour (AAU, Austria), et al.

Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.

July 16, 2025

Publication

ICCV VQualA’25: VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

ICCV VQualA 2025

October 19 – October 23, 2025

Hawai’i, USA

[PDF]

MohammadAli Hamidi (University of Cagliari, Italy), Hadi Amirpour (AAU, Austria), et al.

Abstract: Face images have become integral to various applications. but real-world capture conditions often lead to degradations such as noise, blur, compression artifacts, and poor lighting. These degradations negatively impact image quality and downstream tasks. To promote advancements in face image quality assessment (FIQA), we introduce the VQualA 2025 Challenge on Face Image Quality Assessment, part of ICCV 2025 Workshops. Participants developed efficient models (≤0.5 GFLOPs, ≤5M parameters) predicting Mean Opinion Scores (MOS) under realistic degradations. Submissions were rigorously evaluated using objective metrics and human perceptual judgments. The challenge attracted 127 participants, resulting in 1519 valid final submissions. Detailed methodologies and results are presented, contributing to practical FIQA solutions.

.

July 16, 2025

Publication

ICCV VQualA’25: A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

ICCV VQualA 2025

October 19 – October 23, 2025

Hawai’i, USA

MohammadAli Hamidi (University of Cagliari, Italy), Hadi Amirpour (AAU, Austria), Luigi Atzori (University of Cagliari, Italy), Christian Timmerer (AAU, Austria),

Abstract:Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.

July 16, 2025

Publication

ACM MM’25 Demo: Depth-Enabled Inspection of Medical Videos

Depth-Enabled Inspection of Medical Videos

ACM Multimedia 2025

October 27 – October 31, 2025

Dublin, Ireland

Hadi Amirpour (AAU, Austria), Doris Putzgruber-Adamitsch (AAU, Austria), Yosuf El-Shabrawi (Kabeg, Austria), Klaus Schoeffmann (AAU, Austria)

Abstract: Cataract surgery is the most frequently performed surgical procedure worldwide, involving the replacement of a patient’s clouded eye lens with a synthetic intraocular lens to restore visual acuity. Although typically brief, the operation consists of distinct phases that demand precision and extensive training, traditionally constrained by the limitations of real-time observation under a microscope. To enhance learning and procedural accuracy, modern advancements in stereoscopic video capture and head-mounted displays (HMDs) offer a promising solution. This paper demonstrates the application of stereoscopic cataract surgery videos, visualized through Apple Vision Pro (AVP) and Meta Quest 3, to provide immersive 3D perspectives that enhance depth perception and spatial awareness. An expert evaluation study with experienced surgeons indicates that stereoscopic visualization significantly improves comprehension of spatial relationships and procedural maneuvers, suggesting its potential to revolutionize surgical education and real-time guidance in ophthalmic surgery. Demo video: Link

July 16, 2025

Project, Publication

Real-Time AI-Driven Avatar Generation for Sign Language in HTTP Adaptive Streaming

The 3rd ACM SIGCOMM Workshop on Emerging Multimedia Systems (ACM EMS 2025)

https://conferences.sigcomm.org/sigcomm/2025/workshop/ems/

8 September 2025 // Coimbra, Portugal

Daniele Lorenzi (AAU, Austria), Emanuele Artioli (AAU, Austria), Farzad Tashtarian (AAU, Austria), Christian Timmerer (AAU, Austria)

Abstract: As digital media consumption over the Internet surges globally, ensuring accessibility for all users becomes paramount. For people with hearing impairments, this means providing inclusion beyond classic captioning, which does not convey the full emotional and contextual depth of spoken content. This work addresses this accessibility gap by exploring the use of AI-generated avatars capable of translating speech into sign language in real-time. After defining the multifaceted challenges in this domain, we propose a novel AI-driven task partition to animate avatars for accurate and expressive sign language interpretations in live streaming.

June 26, 2025

Publication

Elsevier Displays: Unlocking Implicit Motion for Evaluating Image Complexity

Unlocking Implicit Motion for Evaluating Image Complexity
Displays

Yixiao Lia (Beihang University, China), Xiaoyuan Yang (Beihang University, China), Yanda Meng (University of Exeter, UK), Hadi Amirpour (AAU, AT), Jiang Liu (Cardiff University, UK), Yuqing Luo (Cardiff University, UK), Hantao Liu (Cardiff University, UK), and Wei Zhou (Cardiff University, UK)

Abstract: Image complexity (IC) plays a critical role in both cognitive science and multimedia computing, influencing visual aesthetics, emotional responses, and tasks such as image classification and enhancement. However, defining and quantifying IC remains challenging due to its multifaceted nature, which encompasses both objective attributes (e.g., detail, structure) and subjective human perception. While traditional methods rely on entropy-based or multidimensional approaches, and recent advances employ machine learning and shallow neural networks, these techniques often fail to fully capture the subjective aspects of IC. Inspired by the fact that the human visual system inherently perceives implicit motion in static images, we propose a novel approach to address this gap by explicitly incorporating hidden motion into IC assessment. We introduce the motion-inspired image complexity assessment metric (MICM) as a new framework for this purpose. MICM introduces a dual-branch architecture: One branch extracts spatial features from static images, while the other generates short video sequences to analyze latent motion dynamics. To ensure meaningful motion representation, we design a hierarchical loss function that aligns video features with text prompts derived from image-to-text models, refining motion semantics at both local (i.e., frame and word) and global levels. Experiments on three public image complexity assessment (ICA) databases demonstrate that our approach, MICM, significantly outperforms state-of-the-art methods, validating its effectiveness. The code will be publicly available upon acceptance of the paper.

June 23, 2025