The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker's multimodal cues. While prior work either rely on limited modalities (e.g. audio, facial information) or employ autoregressive approaches which have some limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker's facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both quantitative and qualitative evaluations. The user study shows that DiffListener generates natural context-aware listener reactions that are well synchronized with the speaker.
Our model is composed of three main components: a fusion network, a discrete diffusion model, and a facial decoder. The fusion network integrates the speaker's 3DMM, audio features, differential 3DMM, and textual information to derive the speaker's representation. Based on this representation, the discrete diffusion model generates the listener's response codebook index sequence. Finally, the facial decoder uses this generated index sequence to produce the listener's facial information.
@article{Jung2025DiffListener,
author = {Siyeol Jung and Taehwan Kim},
title = {DiffListener: Discrete Diffusion Model for Listener Generation},
journal = {ICASSP},
year = {2025},
}