Turn Detection and Interruptions

参照元: LiveKit Agents Documentation ロードマップ: 学習ロードマップ

What（何についてか）

AgentSession における会話ターンの管理全体を扱うページ。「いつユーザーが話し終えたか」を検出するターン検出と、「エージェントの発話中にユーザーが割り込んだ」場合の処理の 2 軸で構成される。

Why（なぜ必要か）

音声 AI において自然な会話体験を実現するには、単純な無音検出では不十分だ。ユーザーが「ちょっと考えます…」と言って沈黙しても、それはターン終了ではない。文脈を読んだ判定と、相槌を割り込みと誤検知しない処理が求められる。

How（どう動くか）

ターン検出の処理フロー

graph LR
    A["音声入力"] --> B["VAD\n音あり/なし"]
    B --> C["STT\nテキスト化 + phrase endpointing"]
    C --> D["turn detector model\n文脈でターン終了を判断"]
    D --> E["AgentSession\n応答開始"]

Phrase endpointing とは「この無音・このタイミングは文の区切りか？」を判定する処理で、VAD（音の有無）の上に乗るレイヤーだ。STT プロバイダーが返すシグナル、または turn detector model が担う。

ターン検出の 5 モード

AgentSession の turn_handling パラメータに TurnHandlingOptions でラップして渡す。

モード	説明	向いてるケース
`turn_detector_model`（EOU Model）	open-weights モデルが文脈込みで判断	STT-LLM-TTS パイプライン（推奨）
`realtime_llm`	Realtime モデル組み込みの検出	OpenAI Realtime API 等
`"vad"`	無音だけで判定	多言語対応・シンプル構成
`"stt"`	STT プロバイダーの endpointing を使用	AssemblyAI 等対応プロバイダー
`"manual"`	自動検出オフ、コードで制御	Push-to-talk 等

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import silero
 
session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection=MultilingualModel(),  # or EnglishModel()
    ),
    vad=silero.VAD.load(),
    # ... stt, tts, llm, etc.
)

turn detector model（EOU Model）について：

MultilingualModel（14言語対応）/ EnglishModel の 2 種
オープンウェイツ：重みファイルが公開されておりローカルで推論が走る（API 呼び出し不要・低レイテンシ）
ベースモデル：Qwen2.5-0.5B-Instruct、396 MB、推論レイテンシ 50〜160 ms
STT が必須（テキスト入力で動作するため）
LiveKit Cloud にデプロイされた場合は最適化済み推論サービスが自動使用される

VAD only の設定：

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="vad",
    ),
    vad=silero.VAD.load(),
)

STT endpointing の設定：

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=assemblyai.STT(),  # AssemblyAI 推奨
    vad=silero.VAD.load(),  # 割り込み検出のために VAD も追加
)

"stt" モード単体では割り込み検出が遅れるため、VAD を併用するのが推奨。ターン終了判定は STT、割り込み検出は VAD と役割分担する。

Endpointing 設定：

TurnHandlingOptions の endpointing キーで待機時間を制御できる。

"fixed"（デフォルト）: 常に min_delay で固定
"dynamic"（Python のみ）: セッションの間（ま）の統計をもとに min_delay〜max_delay の範囲で自動調整

Manual turn control（Push-to-talk）

turn_detection="manual" で自動検出を完全に無効化し、以下のメソッドで制御する。

メソッド	意味
`session.commit_user_turn()`	ユーザーターン終了を宣言 → エージェントが応答開始
`session.interrupt()`	エージェントの発話を強制停止
`session.clear_user_turn()`	バッファ中の音声を破棄

Push-to-talk の実装ではフロントエンドから RPC でこれらを呼び出す構成になる。

sequenceDiagram
    participant F as Frontend
    participant S as AgentSession

    F->>S: RPC start_turn
    S->>S: interrupt() + clear_user_turn()
    S->>S: set_audio_enabled(True)
    Note over F,S: ユーザーが話す
    F->>S: RPC end_turn
    S->>S: set_audio_enabled(False)
    S->>S: commit_user_turn()
    S->>S: LLM → TTS → 応答開始

公式サンプル: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py

Noise Cancellation

LiveKit Cloud 限定。Enhanced Noise Cancellation を room options に追加することで VAD・STT の精度が向上し、ターン検出が安定する。

Interruptions（割り込み処理）

エージェント発話中にユーザーが話し始めると、フレームワークが発話を即停止し、会話履歴を「ユーザーが実際に聞いた部分まで」自動トリミングする。これにより LLM が「聞こえていない部分」を前提に返答するのを防ぐ。

Interruption mode

TurnHandlingOptions の interruption キーで制御する。

enabled

True（デフォルト）: ユーザー発話で割り込み可能
False: 割り込み不可。ただし session.interrupt() 直接呼び出しは常に有効

mode（enabled=True 時のみ有効）

モード	説明
`"adaptive"`	LiveKit Cloud + 対応 STT のデフォルト。相槌と真の割り込みを判別する
`"vad"`	音声検出だけで判定。シンプルだが相槌も割り込みと判定しやすい

Adaptive interruption handling は「うんうん」「なるほど」のような相槌でエージェントが止まらないようにする。LiveKit Cloud でのデフォルトになっている理由はここにある。

False Interruptions（誤割り込み）

VAD が音を検出してエージェントを停止させたが、STT でテキストが生成されなかったケース（咳・環境音等）。

パラメータ	説明
`false_interruption_timeout`	この時間（秒）テキストなし → 誤割り込み判定。`None` で無効化
`resume_false_interruption`	`True` なら誤割り込み後に発話を再開（デフォルト `True`）

追加フィルタパラメータ

パラメータ	説明
`discard_audio_if_uninterruptible`	割り込み不可中にバッファされた音声を破棄する
`min_duration`	この秒数未満の音声は割り込みと判定しない
`min_words`	この単語数未満は割り込みと判定しない（STT 有効時のみ）

min_duration は音声レベル、min_words はテキストレベルのフィルタ。誤検知を減らすための多層防御だ。

Session Events

Interruption events

@session.on("user_interruption_detected")
def on_interruption(ev):
    print(f"User interrupted at: {ev.timestamp}")
    print(f"Interruption probability: {ev.probability}")
 
@session.on("agent_false_interruption")
def on_false_interruption(ev):
    print("False interruption detected, resuming speech")

Turn-taking events

from livekit.agents import UserStateChangedEvent, AgentStateChangedEvent
 
@session.on("user_state_changed")
def on_user_state_changed(ev: UserStateChangedEvent):
    # speaking / listening / away
    if ev.new_state == "away":
        print("User is not present")
 
@session.on("agent_state_changed")
def on_agent_state_changed(ev: AgentStateChangedEvent):
    # initializing / idle / listening / thinking / speaking
    if ev.new_state == "thinking":
        print("Agent is processing")

UserState:

state	意味
`speaking`	VAD がユーザーの発話開始を検出
`listening`	VAD がユーザーの発話停止を検出
`away`	一定時間（デフォルト 15 秒）応答なし

AgentState:

state	意味
`initializing`	起動中
`idle`	待機中
`listening`	ユーザー入力待ち
`thinking`	LLM で応答生成中
`speaking`	発話中

Key Concepts

用語	説明
VAD	Voice Activity Detection。音あり/なしを検出する
Phrase endpointing	無音・タイミングが「文の区切り」かを判定する処理
EOU Model	End of Utterance Model。LiveKit の open-weights ターン検出モデル
False interruption	音は検出されたが STT でテキストが出なかった誤検知
Adaptive interruption	相槌と真の割り込みを区別するインテリジェントな割り込み検出
TurnHandlingOptions	ターン検出・割り込み設定をまとめる設定オブジェクト

Koei's Digital Garden

Explorer

Archive

Turn Detection and Interruptions

Turn Detection and Interruptions

What（何についてか）

Why（なぜ必要か）

How（どう動くか）

ターン検出の処理フロー

ターン検出の 5 モード

Manual turn control（Push-to-talk）

Noise Cancellation

Interruptions（割り込み処理）

Interruption mode

False Interruptions（誤割り込み）

追加フィルタパラメータ

Session Events

Interruption events

Turn-taking events

Key Concepts

Graph View

Table of Contents

Backlinks