The model ID to use. Supports stable models like gemini-3.1-flash-live, preview models, and even undocumented alpha/beta variants. Type any model ID — the system will attempt to connect regardless.
The voice persona for speech output. 30 voices available ranging from Puck (upbeat) to Gacrux (mature). Each voice has a distinct personality, pitch, and cadence. You can type any voice name including experimental ones.
Instructions that shape the model's behavior and personality. This is sent once at session start and guides every response. Keep it concise for voice — overly long prompts can add latency.
Controls randomness in the model's output. 0.0 = deterministic and focused. 2.0 = maximum creativity and variation. For voice assistants, 0.5-0.9 is the sweet spot — natural but not chaotic.
Limits the model to sampling from the top cumulative probability mass. 0.1 = only the most likely tokens. 0.95 = broad vocabulary. Works alongside temperature — lower Top P constrains output even at higher temperatures.
Limits the model to sampling from the top K most likely tokens at each step. 1 = greedy (always pick the best). 40 = moderate diversity. Unlike Top P, this is a hard cutoff — useful for preventing rare/odd tokens.
Caps the length of each model response in tokens. Leave empty for the model's default. Lower values force shorter, snappier answers — ideal for voice where brevity matters.
Input: The sample rate of your microphone audio sent to the model. 16000 Hz is standard for speech. Output: The sample rate of audio returned by the model. 24000 Hz gives higher quality playback.
When enabled, the user can interrupt the model mid-speech. The model will stop talking and listen. When disabled, the model finishes speaking before processing new input — useful for long-form explanations.
High: Catches speech quickly, even soft starts — but may false-trigger on background noise. Low: More conservative, ignores faint sounds — better for noisy environments.
High: Responds immediately when you stop talking — snappier but may cut you off mid-pause. Low: Waits longer to confirm you're done — more patient, handles long pauses better.
Milliseconds of silence required before the model starts responding. 200ms = ultra-fast (may interrupt you). 1000ms = very patient. Default 500ms balances speed and turn-taking.
Milliseconds of audio captured before speech is actually detected. This "look-back" ensures the beginning of words isn't clipped. 200ms is a safe default.
v1beta = latest features and models. v1alpha = bleeding-edge, possibly unstable. v1 = stable production API. Some alpha/beta models require specific API versions to work.
Raw JSON objects merged into the WebSocket setup and generation config messages. Use these for undocumented features, alpha flags, or any parameter not exposed in the UI. For power users and bleeding-edge testing.

Configure, test, and deploy voice sessions
Session Config
No saved agents yet. Configure above and click "Save Agent".