論文リスト¶

Status: stable | Last reviewed: 2026-05-09

ITM プロジェクトで参照する主要論文。すべて WebFetch で実在確認済み。

ターンテイキング基盤¶

arXiv ID	タイトル	著者	発表	備考
2205.09812	Voice Activity Projection: Self-supervised Learning of Turn-taking Events	Ekstedt, Skantze	Interspeech 2022	VAP 原典
2010.10874	TurnGPT: a Transformer-based Language Model for Predicting Turn-taking	Ekstedt, Skantze	EMNLP 2020 Findings	テキストベース
2401.04868	Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection	Inoue et al.	IWSDS 2024	リアルタイム VAP
2403.06487	Multilingual Turn-taking Prediction Using Voice Activity Projection	Inoue et al.	LREC-COLING 2024	英・中・日
2410.15929	Yeah, Un, Oh: Backchannel Prediction with Fine-tuning of VAP	Inoue et al.	NAACL 2025	バックチャネル
2401.14717	Turn-taking and Backchannel Prediction with Acoustic and LLM Fusion	Wang et al.	ICASSP 2024	Amazon
2507.07518	Triadic Multi-party Voice Activity Projection	Elmers et al.	Interspeech 2025	三者会話
2509.23938	Easy Turn: Integrating Acoustic and Linguistic Modalities	—	2025	4状態分類
2603.08216	DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining	Shangeth Rajaa	2026	220ms早期予測

arXiv ID / DOI	タイトル	著者	発表	備考
2506.03980	Voice Activity Projection Model with Multimodal Encoders	Saga, Pelachaud	2025	MM-VAP 関連
10.1587/transinf.2024HCP0002	Multimodal Voice Activity Projection for Turn-taking	Inoue et al.	IEICE 2024	MM-VAP 原典
2505.21043	Visual Cues Enhance Predictive Turn-Taking	—	ACL 2025 Findings	FAU+視線+頭部
2505.12654	MM-F2F: Predicting Turn-Taking and Backchannel	—	ACL 2025 Findings	3モーダル融合
2505.13688	Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations	Heo et al.	Interspeech 2025	視線
10.1145/3577190.3614154	Video-based Respiratory Waveform Estimation in Dialogue (VRWE)	Obi, Funakoshi	ICMI 2023	最重要先行

arXiv ID	タイトル	備考
2503.23439	Speculative End-Turn Detector	投機的二段推論
—	Smart Turn v3 (pipecat-ai)	8M, BSD-2, CPU 12ms

arXiv ID	タイトル	備考
2410.00037	Moshi: a speech-text foundation model for real-time dialogue	Kyutai, 7B, OSS

arXiv ID	タイトル	備考
2405.18014	Coupled Mamba: Enhanced Multi-modal Fusion with Coupled SSM	NeurIPS 2024
2502.13145	Multimodal Mamba: Decoder-only Multimodal SSM via Quadratic to Linear Distillation	—
2409.12031	PhysMamba: Efficient Remote Physiological Measurement	rPPG
2503.10898	Trajectory Mamba: Efficient Attention-Mamba Forecasting	—
2504.07654	ms-Mamba: Multi-scale Mamba for Time-Series	—

arXiv ID	タイトル	備考
2506.09985	V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning	Meta
2603.14482	V-JEPA 2.1: Unlocking Dense Features in Video SSL	—
2506.03885	Video, How Do Your Tokens Merge?	token merging

arXiv ID	タイトル	備考
2410.14045	Human Action Anticipation: A Survey	—
2507.07734	EEvAct: Early Event-Based Action Recognition	spiking NN
2507.13425	CaTFormer: Causal Temporal Transformer for Driving Intention	※元の引用は CaSTFormer だが正式名は CaTFormer
2510.09200	Towards Safer and Understandable Driver Intention Prediction	DAAD-X 略称はアブスト未確認
2603.19533	Pedestrian Crossing Intent Prediction via Psychological Features	—
2603.10061	Decision-Aware Uncertainty Evaluation of VLM-Based Early Action Anticipation for HRI	—
2604.05843	EEG-MFTNet: Cross-Session Motor Imagery Decoding	EEG

出典	タイトル	備考
Interspeech 2016	Respiratory Turn-Taking Cues	Włodarczak, Heldner
Sci Rep 2025 (s41598-025-15776-1)	CNS control of breathing in turn-taking	fMRI 200ms 先行
2006.03790	MTTS-CAN: On-Device Vitals	rPPG, モバイル
2111.12082	PhysFormer	rPPG, CVPR 2022

arXiv ID	タイトル	備考
2510.00982	Spiralformer: Low Latency Encoder for Streaming Speech	—
2504.02302	Causal Self-supervised Pretrained Frontend with Predictive Code	speech separation
2503.04721	Full-Duplex-Bench	評価ベンチマーク
2509.14515	From Turn-Taking to Synchronous Dialogue: Full-Duplex Survey	—

出典	タイトル	備考
ACL Anthology 2025.iwsds-1.27	A Survey of Recent Advances on Turn-taking Modeling	Castillo-López et al.
Computer Speech & Language 2021	Turn-taking in conversational systems and HRI: a review	Skantze