# minimax-speech

This document describes the input and output parameters of the `minimax` text-to-speech (TTS) `hd` series models, for reference when using the API.

---

## Synchronous Synthesis

### Endpoint

`https://api-us-ca.umodelverse.ai/v1/t2a_v2`

### Input
| Parameter | Type | Required | Description |
| :-------- | :--- | :------- | :---------- |
| model | string | Yes | Requested model version. Available options:<br>`speech-2.8-hd`<br>`speech-2.6-hd`<br>`speech-02-hd`<br>`speech-2.8-turbo`<br>`speech-2.6-turbo`<br>`speech-02-turbo` |
| text | string | Yes | Text to be synthesized into speech. Length must be less than `10000` characters. If the text exceeds `3000` characters, streaming output is recommended.<br>• Use newline characters to mark paragraph breaks<br>• Pause control: supports custom pause duration between spoken text segments. Insert a `<#x#>` tag in the text, where `x` is the pause duration in seconds, ranging from `[0.01, 99.99]`, with up to two decimal places. Pause tags must be placed between two pronounceable text segments, and multiple pause tags cannot be used consecutively<br>• Filler word / vocalization tags: only supported when using `speech-2.8-hd` or `speech-2.8-turbo`. Supported tags include: `(laughs)` (laughter), `(chuckle)` (chuckle), `(coughs)` (cough), `(clear-throat)` (throat clearing), `(groans)` (groan), `(breath)` (normal breathing), `(pant)` (panting), `(inhale)` (inhale), `(exhale)` (exhale), `(gasps)` (gasp), `(sniffs)` (sniff), `(sighs)` (sigh), `(snorts)` (snort), `(burps)` (burp), `(lip-smacking)` (lip smacking), `(humming)` (humming), `(hissing)` (hissing), `(emm)` (emm), `(sneezes)` (sneeze) |
| stream | boolean | No | Controls whether to enable streaming output. Default is `false`, meaning streaming is disabled |
| stream_options | object | No | Controls `stream` output |
| stream_options.<br>exclude_aggregated_audio | boolean | No | Whether the final `chunk` should exclude the aggregated audio `hex` data. Default is `False`, meaning the final `chunk` contains the complete aggregated audio `hex` data |
| voice_setting | object | No | Voice settings |
| voice_setting.voice_id | string | No | Voice ID of the synthesized audio. If mixed voices are needed, set `timbre_weights` instead and leave this parameter empty. Supports three types: system voices, cloned voices, and generated voices. Voice IDs can be found in the [System Voice List](https://platform.minimaxi.com/docs/faq/system-voice-id) |
| voice_setting.speed | float | No | Speech speed of the synthesized audio. Higher values result in faster speech. Range: `[0.5,2]`, default is `1.0` |
| voice_setting.vol | float | No | Volume of the synthesized audio. Higher values result in louder audio. Range: `(0,10]`, default is `1.0` |
| voice_setting.pitch | int | No | Pitch of the synthesized audio. Range: `[-12,12]`, default is `0`, where `0` means the original voice pitch |
| voice_setting.emotion | enum<string> | No | Controls the emotion of the synthesized speech. Available options: `["happy", "sad", "angry", "fearful", "disgusted", "surprised", "calm", "fluent", "whisper"]`, corresponding to 9 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral/calm, vivid/fluent, whisper<br>`fluent` and `whisper` are only effective for `speech-2.6-hd` and `speech-2.6-turbo`; `speech-2.8-hd` and `speech-2.8-turbo` do not support `whisper` |
| voice_setting.text_normalization | boolean | No | Whether to enable Chinese and English text normalization. Enabling this can improve number reading performance, but slightly increases latency. Default is `false` |
| voice_setting.latex_read | boolean | No | Controls whether to read `latex` formulas aloud. Default is `false`<br>• Chinese only. When enabled, `language_boost` will be set to `Chinese`<br>• Formulas in the request must be wrapped with `$$` at both beginning and end<br>• If formulas contain `"\"`, it must be escaped as `"\\"` |
| audio_setting | object | No | Audio settings |
| audio_setting.sample_rate | int | No | Sample rate of the generated audio. Available options: `[8000,16000,22050,24000,32000,44100]`, default is `32000` |
| audio_setting.bitrate | int | No | Bitrate of the generated audio. Available options: `[32000,64000,128000,256000]`, default is `128000`. This parameter only applies to audio in mp3 format |
| audio_setting.format | enum<string> | No | Format of the generated audio. Default is `mp3`<br>• `wav` is only supported in non-streaming output<br>• Available options: `mp3, pcm, flac, wav` |
| audio_setting.channel | int | No | Number of audio channels. Available options: `[1,2]`, where `1` means mono and `2` means stereo. Default is `1` |
| audio_setting.force_cbr | boolean | No | Controls whether to use constant bitrate (CBR) encoding. Available values: `false`, `true`. When set to `true`, audio will be encoded using constant bitrate<br>Note: this parameter only takes effect when streaming output is enabled and the audio format is `mp3` |
| pronunciation_dict | object | No | Pronunciation settings |
| pronunciation_dict.tone | string[] | No | Defines custom phonetic or pronunciation replacement rules for specific characters or symbols. In Chinese text, tones are represented by numbers:<br>First tone = 1, second tone = 2, third tone = 3, fourth tone = 4, neutral tone = 5<br>Example:<br>`["燕少飞/(yan4)(shao3)(fei1)", "omg/oh my god"]` |
| timber_weights | object[] | No | Voice mixing weights |
| timber_weights.voice_id | string | Yes | Voice ID of the synthesized audio. Must be provided together with the `weight` parameter. Supports system voices, cloned voices, and generated voices. See the full list in the [System Voice List](https://platform.minimaxi.com/docs/faq/system-voice-id) |
| timber_weights.weight | int | Yes | Weight of each voice in the synthesized audio. Must be provided together with `voice_id`. Range: `[1, 100]`. Up to `4` voices can be mixed. A higher weight for a voice means the synthesized result will sound more similar to that voice |
| language_boost | enum<string> | No | Whether to enhance recognition for a specified low-resource language or dialect. Default is `null`; can be set to `auto` to let the model decide automatically. Available options: `Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto` |
| voice_modify | object | No | Voice effect settings. Supported audio formats:<br>Non-streaming: mp3, wav, flac<br>Streaming: mp3 |
| voice_modify.pitch | int | No | Pitch adjustment (deeper / brighter). Range: `[-100,100]`. Values closer to `-100` make the voice deeper; values closer to `100` make the voice brighter |
| voice_modify.intensity | int | No | Intensity adjustment (stronger / softer). Range: `[-100,100]`. Values closer to `-100` make the voice stronger and more forceful; values closer to `100` make the voice softer |
| voice_modify.timbre | int | No | Timbre adjustment (richer / crisper). Range: `[-100,100]`. Values closer to `-100` make the voice fuller; values closer to `100` make the voice crisper |
| voice_modify.sound_effects | enum<string> | No | Sound effect setting. Only one can be selected per request. Available values:<br>`spacious_echo` (spacious echo)<br>`auditorium_echo` (auditorium broadcast)<br>`lofi_telephone` (telephone distortion)<br>`robotic` (electronic/robotic voice) |
| subtitle_enable | boolean | No | Controls whether to enable subtitle service. Default is `false`. This parameter is only effective in non-streaming output scenarios |
| output_format | enum<string> | No | Controls the output result format. Available values: `[url, hex]`, default is `hex`. This parameter only takes effect in non-streaming scenarios; streaming scenarios only support `hex`. Returned `url` is valid for 24 hours |
| aigc_watermark | bool | No | Controls whether to append an audio rhythm watermark at the end of the synthesized audio. Default is `false`. This parameter is only effective for non-streaming synthesis |


### Request Example

```shell
curl --location --globoff 'https://api-us-ca.umodelverse.ai/v1/t2a_v2' \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: application/json' \
--data '{
  "model": "speech-2.8-hd",
  "text": "Are you happy today (laughs)? Of course!",
  "stream": false,
  "voice_setting": {
    "voice_id": "male-qn-qingse",
    "speed": 1,
    "vol": 1,
    "pitch": 0,
    "emotion": "happy"
  },
  "audio_setting": {
    "sample_rate": 32000,
    "bitrate": 128000,
    "format": "mp3",
    "channel": 1
  },
  "pronunciation_dict": {
    "tone": [
      "处理/(chu3)(li3)",
      "危险/dangerous"
    ]
  },
  "subtitle_enable": false
}'
```

### Output

| Parameter                           | Type         | Description |
| :---------------------------------- | :----------- | :---------- |
| data                                | object       | Returned synthesis data object. It may be `null`, so a null check is required |
| data.audio                          | string       | Synthesized audio data, encoded in `hex`. The format is consistent with the output format specified in the request |
| data.subtitle_file                  | string       | Download link for the synthesized subtitle file. The subtitles correspond to the audio file, are sentence-level accurate (each sentence no more than 50 characters), measured in milliseconds, and provided in `json` format |
| data.status                         | int          | Current audio stream status: `1` means synthesis in progress, `2` means synthesis completed |
| trace_id                            | string       | ID of the current session, used to help locate issues during consultation or feedback |
| extra_info                          | object       | Additional information about the audio |
| extra_info.audio_length             | int          | Audio duration (milliseconds) |
| extra_info.audio_sample_rate        | int          | Audio sample rate |
| extra_info.audio_size               | int          | Audio file size (bytes) |
| extra_info.bitrate                  | int          | Audio bitrate |
| extra_info.audio_format             | enum<string> | Format of the generated audio file. Available values: `[mp3, pcm, flac]` |
| extra_info.audio_channel            | int          | Number of audio channels generated: `1` for mono, `2` for stereo |
| extra_info.invisible_character_ratio| number       | Ratio of invalid characters. If invalid characters do not exceed 10% (including 10%), the audio will still be generated normally and the ratio will be returned; if it exceeds 10%, an error will be reported |
| extra_info.usage_characters         | int          | Number of billable characters |
| extra_info.word_count               | int          | Count of pronounced characters, including Chinese characters, digits, and letters, but excluding punctuation |
| base_resp                           | object       | Status code and details of the current request |
| base_resp.status_code               | int          | Status code.<br>`0`: Request successful<br>`1000`: Unknown error<br>`1001`: Timeout<br>`1002`: Rate limit triggered<br>`1004`: Authentication failed<br>`1039`: TPM rate limit triggered<br>`1042`: Invalid characters exceed 10%<br>`2013`: Invalid input parameter information |
| base_resp.status_msg                | string       | Status details |








### Response Example

```json
{
  "data": {
    "audio": "<hex>",
    "status": 2
  },
  "extra_info": {
    "audio_length": 9900,
    "audio_sample_rate": 32000,
    "audio_size": 160323,
    "bitrate": 128000,
    "word_count": 52,
    "invisible_character_ratio": 0,
    "usage_characters": 26,
    "audio_format": "mp3",
    "audio_channel": 1
  },
  "trace_id": "01b8bf9bb7433cc75c18eee6cfa8fe21",
  "base_resp": {
    "status_code": 0,
    "status_msg": "success"
  }
}
```
