Microsoft VibeVoice is excellent for creating podcasts, even by cloning our own voice
Ctrl+Alt+Future3 Syys 2025

Microsoft VibeVoice is excellent for creating podcasts, even by cloning our own voice

VibeVoice is a novel framework designed to generate expressive, emotional, and lifelike long-form, multi-actor audio, such as podcasts, from text. The model aims to solve the significant challenges of traditional text-to-speech (TTS) systems in terms of scalability, speaker consistency, and natural conversational turns.

The capabilities and special features of the VibeVoice model are as follows:

- Capable of synthesizing conversations with up to four different speakers and generating up to 90 minutes of speech, which exceeds the typical limitations of many previous models.

- Excellent for creating podcasts and similar long-form audio content.

- Allows voice cloning from voice samples. This requires clean, minimal background noise voice samples, at least 3-10 seconds long, but 30 seconds is recommended for better quality.

- Text File Loading: Suitable for loading text scripts from .txt files.

- Flexible configuration: Adjustable with parameters such as temperature, sampling, and guidance scale (cfg_scale).


Two model options:


- VibeVoice-1.5B: Provides faster inference and has a download size of approximately 5 GB, ideal for single speakers and rapid prototyping.


- VibeVoice-7B-Preview: Provides higher quality output, especially for multi-actor conversations, has slower inference and has a download size of approximately 17 GB.


- Technological innovation: One of its fundamental innovations is the use of continuous speech tokenizers (acoustic and semantic) that operate at an extremely low frame rate of 7.5 Hz. These tokenizers achieve a compression ratio of 3200x while maintaining audio fidelity, drastically increasing computational efficiency when processing long sequences.


- LLM-based next-token diffusion framework: The model uses a large-scale language model (LLM, e.g. Qwen2.5) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.


Results and performance: The VIBEVOICE-7B model outperforms most state-of-the-art models in long-discussion speech generation, both subjectively and objectively, showing better realism, richness, and overall preference.


It is important to note that the model works best primarily with English and Chinese text. The VibeVoice model itself is for research purposes and is subject to Microsoft’s license terms.


Links

Microsoft VibeVoice: https://microsoft.github.io/VibeVoice/Technical Report: https://arxiv.org/pdf/2508.19205GitHub: https://github.com/microsoft/VibeVoiceGoogle Colab: https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynbHugging Face VibeVoice-1.5B: https://huggingface.co/microsoft/VibeVoice-1.5BHugging Face VibeVoice-7B-Large: https://huggingface.co/WestZhang/VibeVoice-Large-ptComfyUI: https://github.com/Enemyx-net/VibeVoice-ComfyUIAudacity: https://www.audacityteam.org/

Jaksot(15)

OpenAI gpt-oss: OpenAI's latest development in open source AI models

OpenAI gpt-oss: OpenAI's latest development in open source AI models

We’d like to introduce OpenAI’s latest development in open source AI models: the gpt-oss series. These two open-weight language models, gpt-oss-120b and gpt-oss-20b, have been tested by OpenAI to deli...

3 Syys 202551min

Qwen-Image-Edit: Image editing with artificial intelligence. No need for Photoshop anymore?

Qwen-Image-Edit: Image editing with artificial intelligence. No need for Photoshop anymore?

Today, we will look at an AI model that simplifies image editing: Qwen-Image-Edit. This model builds on the foundation of the original, high-performance Qwen-Image, and brings amazing capabilities in ...

3 Syys 202527min

ByteDance Seed-OSS-36B, a large language model specifically for long context understanding and reasoning

ByteDance Seed-OSS-36B, a large language model specifically for long context understanding and reasoning

Seed-OSS is a set of open-source large-scale language models developed by ByteDance Seed Team, designed to provide powerful capabilities in long-context understanding, reasoning, and agentic tasks. It...

3 Syys 202539min

Deep Cogito - Cogito v2: Free model. Using a unique, iterative self-learning method (IDA)

Deep Cogito - Cogito v2: Free model. Using a unique, iterative self-learning method (IDA)

According to developer Deep Cogito, Cogito v2 is one of the world’s most powerful open-source AI models, available in sizes ranging from 70B to 671B parameters. Thanks to its unique, iterative self-le...

3 Syys 202547min

Mastering Prompt Tricks with Large Language Models

Mastering Prompt Tricks with Large Language Models

In this episode, we dive deep into the art of crafting effective prompts for large language models. Join our hosts as they explore essential techniques to optimize outputs, enhance creativity, and imp...

26 Syys 202410min

AI in Enterprise

AI in Enterprise

The rapid development of AI has outpaced the ability of many organisations to adapt1. This discrepancy presents both challenges and opportunities. While there is growing pressure to utilize AI for its...

13 Syys 20244min