Files
ARIA 88e07487ee feat: add streaming support for real-time TTS
- Added generate_stream() method for token-by-token streaming
- Added generate_and_play() method for real-time playback
- Added decode_chunk() to ncodec codec
- First audio chunk in ~180ms (390% faster than non-streaming)
- Updated README with streaming documentation
2026-03-22 04:40:37 +01:00

3.8 KiB

MiraTTS

MiraTTS is a finetune of the excellent Spark-TTS model for enhanced realism and stability performing on par with closed source models. This repository also heavily optimizes Mira with Lmdeploy and boosts quality by using FlashSR to generate high quality audio at over 100x realtime!

https://github.com/user-attachments/assets/262088ae-068a-49f2-8ad6-ab32c66dcd17

Key benefits

  • Incredibly fast: Over 100x realtime by using Lmdeploy and batching.
  • High quality: Generates clear and crisp 48khz audio outputs which is much higher quality then most models.
  • Memory efficient: Works within 6gb vram.
  • Low latency: Latency can be low as 100ms.

Usage

Simple 1 line installation:

uv pip install git+https://github.com/ysharma3501/MiraTTS.git

Running the model(bs=1):

from mira.model import MiraTTS
from IPython.display import Audio
mira_tts = MiraTTS('YatharthS/MiraTTS') ## downloads model from huggingface

file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
text = "Alright, so have you ever heard of a little thing named text to speech? Well, it allows you to convert text into speech! I know, that's super cool, isn't it?"

context_tokens = mira_tts.encode_audio(file)
audio = mira_tts.generate(text, context_tokens)

Audio(audio, rate=48000)

Running the model using batching:

file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
text = ["Hey, what's up! I am feeling SO happy!", "Honestly, this is really interesting, isn't it?"]

context_tokens = [mira_tts.encode_audio(file)]

audio = mira_tts.batch_generate(text, context_tokens)

Audio(audio, rate=48000)

Streaming (Real-time Audio)

Stream audio chunks as they're generated for ultra-low latency (~180ms to first audio):

from mira.model import MiraTTS

mira_tts = MiraTTS('YatharthS/MiraTTS')
context_tokens = mira_tts.encode_audio("reference_file.wav")

# Stream and process chunks in real-time
for audio_chunk in mira_tts.generate_stream(text, context_tokens, chunk_size=50):
    # audio_chunk is a torch tensor (48kHz)
    # Process/play each chunk as it arrives
    process(audio_chunk)

Or use the convenience method for immediate playback (requires sounddevice):

# pip install sounddevice
mira_tts.generate_and_play(text, context_tokens, chunk_size=50)

Parameters:

  • chunk_size: Tokens per chunk (default 50 = ~1 sec audio). Lower = faster first chunk, higher = smoother audio.

Performance:

  • First audio chunk: ~180ms (vs ~870ms for full generation)
  • 390% faster time to first audio

Examples can be seen in the huggingface model

I recommend reading these 2 blogs to better easily understand LLM tts models and how I optimize them

Training

Released training code! You can now train the model to be multilingual, multi-speaker, or support audio events on any local or cloud gpu!

Kaggle notebook: https://www.kaggle.com/code/yatharthsharma888/miratts-training

Colab notebook: https://colab.research.google.com/drive/1IprDyaMKaZrIvykMfNrxWFeuvj-DQPII?usp=sharing

Next steps

  • Release code and model
  • Release training code
  • Support low latency streaming
  • Release native 48khz bicodec

Final notes

Thanks very much to the authors of Spark-TTS and unsloth. Thanks for checking out this repository as well.

Stars would be well appreciated, thank you.

Email: yatharthsharma3501@gmail.com