feat: add streaming support for real-time TTS

- Added generate_stream() method for token-by-token streaming - Added generate_and_play() method for real-time playback - Added decode_chunk() to ncodec codec - First audio chunk in ~180ms (390% faster than non-streaming) - Updated README with streaming documentation
2026-03-22 04:40:37 +01:00
commit 88e07487ee
16 changed files with 4671 additions and 0 deletions
@@ -0,0 +1,89 @@
 Metadata-Version: 2.4
 Name: FastNeuTTS
 Version: 0.0.11
 Summary: High quality and Fast TTS with MiraTTS
 Author-email: Yatharth Sharma <yatharthsharma3501@gmail.com>
 Project-URL: Homepage, https://github.com/ysharma3501/MiraTTS
 Project-URL: Issues, https://github.com/ysharma3501/MiraTTS/issues
 Classifier: Programming Language :: Python :: 3
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 Requires-Dist: lmdeploy
 Requires-Dist: librosa
 Requires-Dist: fastaudiosr @ git+https://github.com/ysharma3501/FlashSR.git
 Requires-Dist: ncodec @ git+https://github.com/ysharma3501/FastBiCodec.git
 Requires-Dist: einops
 Requires-Dist: onnxruntime-gpu
 # MiraTTS
 [MiraTTS](https://huggingface.co/YatharthS/MiraTTS) is a finetune of the excellent [Spark-TTS](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) model for enhanced realism and stability performing on par with closed source models. 
 This repository also heavily optimizes Mira with [Lmdeploy](https://github.com/InternLM/lmdeploy) and boosts quality by using [FlashSR](https://github.com/ysharma3501/FlashSR) to generate high quality audio at over **100x** realtime!
 https://github.com/user-attachments/assets/262088ae-068a-49f2-8ad6-ab32c66dcd17
 ## Key benefits
 - Incredibly fast: Over 100x realtime by using Lmdeploy and batching.
 - High quality: Generates clear and crisp 48khz audio outputs which is much higher quality then most models.
 - Memory efficient: Works within 6gb vram.
 - Low latency: Latency can be low as 100ms.
 ## Usage
 Simple 1 line installation:
 ```
 uv pip install git+https://github.com/ysharma3501/MiraTTS.git
 ```
 Running the model(bs=1):
 ```python
 from mira.model import MiraTTS
 from IPython.display import Audio
 mira_tts = MiraTTS('YatharthS/MiraTTS') ## downloads model from huggingface
 file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
 text = "Alright, so have you ever heard of a little thing named text to speech? Well, it allows you to convert text into speech! I know, that's super cool, isn't it?"
 context_tokens = mira_tts.encode_audio(file)
 audio = mira_tts.generate(text, context_tokens)
 Audio(audio, rate=48000)
 ```
 Running the model using batching: 
 ```python
 file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
 text = ["Hey, what's up! I am feeling SO happy!", "Honestly, this is really interesting, isn't it?"]
 context_tokens = [mira_tts.encode_audio(file)]
 audio = mira_tts.batch_generate(text, context_tokens)
 Audio(audio, rate=48000)
 ```
 Examples can be seen in the [huggingface model](https://huggingface.co/YatharthS/MiraTTS)
 I recommend reading these 2 blogs to better easily understand LLM tts models and how I optimize them
 - How they work: https://huggingface.co/blog/YatharthS/llm-tts-models
 - How to optimize them: https://huggingface.co/blog/YatharthS/making-neutts-200x-realtime
 ## Training
 Released training code! You can now train the model to be multilingual, multi-speaker, or support audio events on any local or cloud gpu!
 Kaggle notebook: https://www.kaggle.com/code/yatharthsharma888/miratts-training
 Colab notebook: https://colab.research.google.com/drive/1IprDyaMKaZrIvykMfNrxWFeuvj-DQPII?usp=sharing
 ## Next steps
 - [x] Release code and model
 - [x] Release training code
 - [ ] Support low latency streaming
 - [ ] Release native 48khz bicodec
 ## Final notes
 Thanks very much to the authors of Spark-TTS and unsloth. Thanks for checking out this repository as well.
 Stars would be well appreciated, thank you.
 Email: yatharthsharma3501@gmail.com
@@ -0,0 +1,10 @@
 README.md
 pyproject.toml
 FastNeuTTS.egg-info/PKG-INFO
 FastNeuTTS.egg-info/SOURCES.txt
 FastNeuTTS.egg-info/dependency_links.txt
 FastNeuTTS.egg-info/requires.txt
 FastNeuTTS.egg-info/top_level.txt
 mira/__init__.py
 mira/model.py
 mira/utils.py
@@ -0,0 +1 @@
@@ -0,0 +1,6 @@
 lmdeploy
 librosa
 fastaudiosr @ git+https://github.com/ysharma3501/FlashSR.git
 ncodec @ git+https://github.com/ysharma3501/FastBiCodec.git
 einops
 onnxruntime-gpu
@@ -0,0 +1 @@
 mira
@@ -0,0 +1,101 @@
 # MiraTTS
 [MiraTTS](https://huggingface.co/YatharthS/MiraTTS) is a finetune of the excellent [Spark-TTS](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) model for enhanced realism and stability performing on par with closed source models. 
 This repository also heavily optimizes Mira with [Lmdeploy](https://github.com/InternLM/lmdeploy) and boosts quality by using [FlashSR](https://github.com/ysharma3501/FlashSR) to generate high quality audio at over **100x** realtime!
 https://github.com/user-attachments/assets/262088ae-068a-49f2-8ad6-ab32c66dcd17
 ## Key benefits
 - Incredibly fast: Over 100x realtime by using Lmdeploy and batching.
 - High quality: Generates clear and crisp 48khz audio outputs which is much higher quality then most models.
 - Memory efficient: Works within 6gb vram.
 - Low latency: Latency can be low as 100ms.
 ## Usage
 Simple 1 line installation:
 ```
 uv pip install git+https://github.com/ysharma3501/MiraTTS.git
 ```
 Running the model(bs=1):
 ```python
 from mira.model import MiraTTS
 from IPython.display import Audio
 mira_tts = MiraTTS('YatharthS/MiraTTS') ## downloads model from huggingface
 file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
 text = "Alright, so have you ever heard of a little thing named text to speech? Well, it allows you to convert text into speech! I know, that's super cool, isn't it?"
 context_tokens = mira_tts.encode_audio(file)
 audio = mira_tts.generate(text, context_tokens)
 Audio(audio, rate=48000)
 ```
 Running the model using batching: 
 ```python
 file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
 text = ["Hey, what's up! I am feeling SO happy!", "Honestly, this is really interesting, isn't it?"]
 context_tokens = [mira_tts.encode_audio(file)]
 audio = mira_tts.batch_generate(text, context_tokens)
 Audio(audio, rate=48000)
 ```
 ## Streaming (Real-time Audio)
 Stream audio chunks as they're generated for ultra-low latency (~180ms to first audio):
 ```python
 from mira.model import MiraTTS
 mira_tts = MiraTTS('YatharthS/MiraTTS')
 context_tokens = mira_tts.encode_audio("reference_file.wav")
 # Stream and process chunks in real-time
 for audio_chunk in mira_tts.generate_stream(text, context_tokens, chunk_size=50):
    # audio_chunk is a torch tensor (48kHz)
    # Process/play each chunk as it arrives
    process(audio_chunk)
 ```
 Or use the convenience method for immediate playback (requires `sounddevice`):
 ```python
 # pip install sounddevice
 mira_tts.generate_and_play(text, context_tokens, chunk_size=50)
 ```
 **Parameters:**
 - `chunk_size`: Tokens per chunk (default 50 = ~1 sec audio). Lower = faster first chunk, higher = smoother audio.
 **Performance:**
 - First audio chunk: ~180ms (vs ~870ms for full generation)
 - 390% faster time to first audio
 Examples can be seen in the [huggingface model](https://huggingface.co/YatharthS/MiraTTS)
 I recommend reading these 2 blogs to better easily understand LLM tts models and how I optimize them
 - How they work: https://huggingface.co/blog/YatharthS/llm-tts-models
 - How to optimize them: https://huggingface.co/blog/YatharthS/making-neutts-200x-realtime
 ## Training
 Released training code! You can now train the model to be multilingual, multi-speaker, or support audio events on any local or cloud gpu!
 Kaggle notebook: https://www.kaggle.com/code/yatharthsharma888/miratts-training
 Colab notebook: https://colab.research.google.com/drive/1IprDyaMKaZrIvykMfNrxWFeuvj-DQPII?usp=sharing
 ## Next steps
 - [x] Release code and model
 - [x] Release training code
 - [x] Support low latency streaming
 - [ ] Release native 48khz bicodec
 ## Final notes
 Thanks very much to the authors of Spark-TTS and unsloth. Thanks for checking out this repository as well.
 Stars would be well appreciated, thank you.
 Email: yatharthsharma3501@gmail.com
@@ -0,0 +1 @@
@@ -0,0 +1 @@
@@ -0,0 +1,209 @@
 import gc
 import re
 import torch
 from itertools import cycle
 from ncodec.codec import TTSCodec
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
 from mira.utils import clear_cache, split_text
 class MiraTTS:
    def __init__(
        self,
        model_dir="YatharthS/MiraTTS",
        tp=1,
        enable_prefix_caching=True,
        cache_max_entry_count=0.2,
        default_chunk_size=50,
    ):
        backend_config = TurbomindEngineConfig(
            cache_max_entry_count=cache_max_entry_count,
            tp=tp,
            dtype="bfloat16",
            enable_prefix_caching=enable_prefix_caching,
        )
        self.pipe = pipeline(model_dir, backend_config=backend_config)
        self.gen_config = GenerationConfig(
            top_p=0.95,
            top_k=50,
            temperature=0.8,
            max_new_tokens=1024,
            repetition_penalty=1.2,
            do_sample=True,
            min_p=0.05,
        )
        self.codec = TTSCodec()
        self.default_chunk_size = default_chunk_size
        # Warm up decoder to reduce TTFA
        self._decoder_warmed = False
    def set_params(
        self,
        top_p=0.95,
        top_k=50,
        temperature=0.8,
        max_new_tokens=1024,
        repetition_penalty=1.2,
        min_p=0.05,
    ):
        """sets sampling parameters for the llm"""
        self.gen_config = GenerationConfig(
            top_p=top_p,
            top_k=top_k,
            temperature=temperature,
            max_new_tokens=max_new_tokens,
            repetition_penalty=repetition_penalty,
            min_p=min_p,
            do_sample=True,
        )
    def c_cache(self):
        clear_cache()
    def split_text(self, text):
        return split_text(text)
    def encode_audio(self, audio_file):
        """encodes audio into context tokens"""
        context_tokens = self.codec.encode(audio_file)
        return context_tokens
    def warmup_decoder(self, context_tokens=None):
        """Warm up the decoder to reduce TTFA on first streaming chunk."""
        if self._decoder_warmed:
            return
        if context_tokens:
            dummy_tokens = "<|speech_token_0|><|speech_token_1|>"
            _ = self.codec.decode_chunk(dummy_tokens, context_tokens)
        else:
            dummy_context = "".join([f"<|context_token_{i}|>" for i in range(10)])
            dummy_tokens = "<|speech_token_0|><|speech_token_1|>"
            _ = self.codec.decode_chunk(dummy_tokens, dummy_context)
        self._decoder_warmed = True
    def generate(self, text, context_tokens):
        """generates speech from input text"""
        formatted_prompt = self.codec.format_prompt(text, context_tokens, None)
        response = self.pipe(
            [formatted_prompt], gen_config=self.gen_config, do_preprocess=False
        )
        audio = self.codec.decode(response[0].text, context_tokens)
        return audio
    def generate_stream(self, text, context_tokens, chunk_size=None):
        """
        Generates speech from input text with streaming output.
        Args:
            text: Input text to synthesize
            context_tokens: Reference audio context tokens
            chunk_size: Number of tokens to decode before yielding audio (default from __init__ or 50 = ~1 sec at 20ms/token)
        Yields:
            Audio chunks as torch tensors (48kHz)
        """
        if chunk_size is None:
            chunk_size = self.default_chunk_size
        self.warmup_decoder(context_tokens)
        formatted_prompt = self.codec.format_prompt(text, context_tokens, None)
        responses = self.pipe.stream_infer(
            [formatted_prompt],
            gen_config=self.gen_config,
            do_preprocess=False,
            stream_response=True,
        )
        accumulated_tokens = []
        for response in responses:
            new_tokens = re.findall(r"speech_token_(\d+)", response.text)
            accumulated_tokens.extend([int(t) for t in new_tokens])
            if len(accumulated_tokens) >= chunk_size:
                num_chunks = len(accumulated_tokens) // chunk_size
                for i in range(num_chunks):
                    start_idx = i * chunk_size
                    end_idx = start_idx + chunk_size
                    chunk_tokens = accumulated_tokens[start_idx:end_idx]
                    token_str = "".join([f"<|speech_token_{t}|>" for t in chunk_tokens])
                    audio_chunk = self.codec.decode_chunk(token_str, context_tokens)
                    yield audio_chunk
                accumulated_tokens = accumulated_tokens[end_idx:]
            if response.finish_reason:
                break
        if accumulated_tokens:
            token_str = "".join([f"<|speech_token_{t}|>" for t in accumulated_tokens])
            audio_chunk = self.codec.decode_chunk(token_str, context_tokens)
            yield audio_chunk
    def batch_generate(self, prompts, context_tokens):
        """
        Generates speech from text, for larger batch size
        Args:
            prompt (list): Input for tts model, list of prompts
            voice (list): Description of voice, list of voices respective to prompt
        """
        formatted_prompts = []
        for prompt, context_token in zip(prompts, cycle(context_tokens)):
            formatted_prompt = self.codec.format_prompt(prompt, context_token, None)
            formatted_prompts.append(formatted_prompt)
        responses = self.pipe(
            formatted_prompts, gen_config=self.gen_config, do_preprocess=False
        )
        generated_tokens = [response.text for response in responses]
        audios = []
        for generated_token, context_token in zip(
            generated_tokens, cycle(context_tokens)
        ):
            audio = self.codec.decode(generated_token, context_token)
            audios.append(audio)
        audios = torch.cat(audios, dim=0)
        return audios
    def generate_and_play(
        self, text, context_tokens, chunk_size=None, samplerate=48000
    ):
        """
        Generates and plays audio in real-time using streaming.
        Requires sounddevice: pip install sounddevice
        Args:
            text: Input text to synthesize
            context_tokens: Reference audio context tokens
            chunk_size: Number of tokens per chunk (default from __init__ or 50 = ~1 sec)
            samplerate: Audio sample rate (default 48000)
        """
        try:
            import sounddevice as sd
        except ImportError:
            raise ImportError(
                "sounddevice required for playback. Install with: pip install sounddevice"
            )
        for audio_chunk in self.generate_stream(
            text, context_tokens, chunk_size=chunk_size
        ):
            sd.play(audio_chunk.cpu().numpy().flatten(), samplerate=samplerate)
        sd.wait()
@@ -0,0 +1,11 @@
 import re
 import gc
 import torch
 def split_text(text):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return sentences
 def clear_cache():
    gc.collect()
    torch.cuda.empty_cache()
@@ -0,0 +1,30 @@
 [build-system]
 requires = ["setuptools>=61.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "FastNeuTTS"
 version = "0.0.11"
 authors = [
  { name="Yatharth Sharma", email="yatharthsharma3501@gmail.com" },
 ]
 description = "High quality and Fast TTS with MiraTTS"
 readme = "README.md"
 requires-python = ">=3.10"
 classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
 ]
 dependencies = [
    "lmdeploy",
    "librosa",
    "fastaudiosr @ git+https://github.com/ysharma3501/FlashSR.git",
    "ncodec @ git+https://github.com/ysharma3501/FastBiCodec.git",
    "einops",
    "onnxruntime-gpu"
 ]
 [project.urls]
 Homepage = "https://github.com/ysharma3501/MiraTTS"
 Issues = "https://github.com/ysharma3501/MiraTTS/issues"