In the rapidly evolving field of artificial intelligence, one critical gap became apparent to me as I was helping my friend’s startup locate data for training their multimodal model: the scarcity of high-quality conversational audio datasets.

Existing datasets often fall short in several key areas:

  1. Content of conversations: Most available datasets consist assistant-user exchanges that don’t capture the complexity and subject matter of real-world dialogues.
  2. Audio-text alignment: There’s a lack of datasets that provide both high-quality audio and accurately transcribed text, perfectly aligned.
  3. Diversity of speakers: Many datasets use a limited number of voices, which can lead to AI models that don’t generalize well across different speakers.
  4. Scalability: The labor-intensive nature of creating human-recorded conversational audio data limits the size and scope of many existing datasets.

It was in response to these specific challenges that I embarked on the journey to create Daily Yap. This dataset aims to provide a starting point to address the shortage of comprehensive conversational audio data, offering researchers and developers a resource for training real-time conversational audio models.

In the following sections, I’ll detail the process of developing Daily Yap, from its conceptual roots to its final form as a dataset of nearly 10,000 audio samples with matching transcripts.

Choosing a Foundation

After extensive research, I selected the Daily Dialog dataset as the foundation for this project. Daily Dialog offered a solid base of conversational topics, providing a springboard from which to build. However, I quickly realized that to create a truly valuable resource, significant enhancements would be necessary.

Refining the Raw Material

The first step in transforming Daily Dialog into Daily Yap involved leveraging the power of GPT-4. I tasked this advanced language model with three primary objectives:

  1. Correcting grammatical and spelling errors in the original transcripts.
  2. Reformatting the text to be more compatible with text-to-speech (TTS) engines. This included expanding abbreviations (e.g., changing “Mr.” to “Mister”) to ensure clearer audio output.
  3. Extending conversations that were deemed too brief, adding depth and complexity to the dialogues.

This process was crucial in preparing the data for the next stage: audio generation.

Bringing Conversations to Life

Selecting the right text-to-speech engine was a critical decision. After experimenting with several options, including ChatTTS, I ultimately chose XTTSv2. This engine stood out for its superior quality, producing more natural-sounding speech than other open-source options.

To enhance the dataset’s versatility, I decided to use a total of eight distinct voices - four male and four female. These were generated by mixing latent representations of my own voice and those of some friends. This diversity was intentional, aimed at creating a balanced dataset that would allow AI models to generalize better across different speaker characteristics.

The Final Product

After weeks of development, refinement, and quality assurance, Daily Yap emerged as a robust dataset consisting of 9,758 samples. With approximately 90 hours of audio and an average sample length of 33 seconds, it offers a rich resource for researchers and developers working on conversational AI, speech recognition, and natural language processing tasks.

Each sample in the dataset includes a JSON-formatted transcript paired with a dual-channel WAV file, allowing for easy speaker separation and multimodal alignment.

Looking to the Future

While Daily Yap represents a significant step forward, I see it as just the beginning. Future iterations could potentially include fully synthetic dialogues, moving beyond the constraints of the original Daily Dialog dataset. This approach could allow for even greater scalability and diversity in the conversations.

Additionally, as text-to-speech technology continues to advance, I plan to explore upgrading the audio generation process to incorporate the latest breakthroughs in synthetic speech.

Conclusion

The creation of Daily Yap was driven by a desire to address the lack of high-quality conversational audio datasets and gain a better understanding of multimodal models. This research has taught me a lot about the architecture of audio models and helped familiarize me with the latest research in the field.

Daily Yap represents a significant step forward in providing researchers and developers with the tools they need to create more natural, more engaging, and more capable conversational AI systems. I’m excited to see how the community will leverage this resource to push the boundaries of what’s possible in multimodal AI and speech recognition.

For those interested in exploring or using the Daily Yap dataset, you can find it on HuggingFace: https://huggingface.co/datasets/jakeBoggs/DailyYap

If any academics want to cite this in a paper, I would be honored and extremely amused. Seeing “Daily Yap” in a works cited section would give me a good laugh and make all of this worth it.