Sam Witteveen: Gemini 2.5 Pro: Mastering Audio Transcription, Diorization & More

Unleash the power of Gemini 2.5 Pro for audio tasks! Discover how it revolutionizes audio transcription, diarization, and question-answering. With its ability to generate 64,000 tokens, it can handle up to 2 hours of audio. Learn about the pricing, supported audio types, and tricks for longer audio. Dive into the code and see how to upload audio, generate transcripts, and summarize podcasts. Don't miss out on this game-changing technology!

Quick Takeaways:

Gemini 2.5 Pro can generate 64,000 tokens, enabling 2-hour audio transcripts.
It's a multimodal model that can handle various audio types.
Pricing is reasonable, and there are tricks for longer audio.
The model can do audio diarization and generate timestamps.
Use code to upload audio, generate transcripts, and summarize podcasts.

Introduction

In this video, the focus is on utilizing the Gemini models, specifically the new Gemini 2.5, for various audio-related tasks. These tasks include generating transcripts, performing diarization, and enabling question-and-answering over audio. This technique has become increasingly useful for summarizing podcasts when time is limited or for conducting automated questioning at scale to extract quotes and key information.

How Gemini 2.5 Changes the Game

Initial Lack of Mention: Initially, Google or DeepMind did not prominently feature audio capabilities in their announcements. The initial blog post barely mentioned audio, simply stating that Gemini is a high-quality multimodal model.
Multimodal Nature: The Gemini models have been multimodal from the start, and last year, it was possible to pass audio into them without issues.
Token Generation Increase: The significant improvement with Gemini 2.5 Pro is its ability to generate 64,000 tokens, compared to the earlier models' 8,000 tokens. This is crucial because transcribing about 15 minutes of audio requires roughly 8,000 tokens. While the previous models could perform some tasks, the main challenge was generating enough tokens for a full podcast transcript. With 64,000 tokens, it's possible to generate transcripts for approximately 2 hours of audio.

Audio Types and Technical Details

Supported Audio Types: The model supports various audio types, such as web files, MP3s, AAC format, and FLAC files.
Token Conversion: Each second of audio is equivalent to 32 tokens in the Gemini model, which means 1920 tokens per minute or around 115,000 tokens per hour of audio. This is important to consider when looking at pricing, especially if the token count exceeds 200,000. In such cases, it may be necessary to build a pipeline to split the audio.
Audio Processing: The model downsamples audio to 16k and converts stereo sources to a single channel. This may not be a problem for most users, but it could be an issue if the analysis requires information about stereo positioning.

Uploading Audio Files

Upload Options: There are two ways to upload audio files. One is to include the file in the prompt, but this is limited to a maximum of 20 megabytes per call. The other option is to use the upload API, which allows for uploading a single file of up to 2 GB and using it in a single call. Multiple files can also be uploaded and used together.
Transcript Generation: Once the audio file is uploaded, it can be passed as part of the contents for generating new content. The model is proficient in generating timestamps during the transcription process. For audio longer than 2 hours, it's possible to specify the start and end times for the transcript.

Code Implementation

Code Setup: The code is based on examples from the Gemini team and has been modified to suit specific requirements. It starts by obtaining the key and setting up a client.
Prompt Customization: The original prompt from a Gemini developer is used as a starting point, but it can be customized according to the user's needs. The author believes in creating or modifying a generic prompt and then using a language model to enhance it.
Audio Diorization: The model is capable of performing audio diorization, which involves determining which speaker is saying what. In podcasts with multiple speakers, the model can often work out the names of the speakers based on how they address each other. If the speaker names are not mentioned, they can be provided as a list, and the model will try to identify them.
Podcast Download: Finding and downloading an MP3 of a podcast can be challenging. The author used Podbay FM to download an MP3 of the My First Million podcast. The MP3 is then uploaded to Google Collab.
Transcript Generation: After uploading the file, the actual call to generate the transcript is relatively simple. The prompt and the uploaded file are passed to the Gemini 2.5 Pro model, and a raw transcript is returned.
Transcript Processing: The raw transcript may not be in the most useful format, so the author created code to process it. The processed transcript includes speaker information and timestamps at specific intervals or when the speaker changes.
Summary Generation: Once the transcript is processed, a prompt can be used to generate a summary in the form of bullet points with timestamps. This summary can be used to quickly review the key points of the podcast.

Handling Long Audio

Token Limitations: For podcasts longer than 2 hours, the token limit may prevent generating a full transcript. In such cases, the prompt can be modified to specify the start and end times for the transcript.
Overlap and Joining: To handle the transition between segments, a small overlap can be used. The first segment can end a few minutes before the second segment starts, and then the transcripts can be joined using fuzzy matching.

Conclusion

Benefits of Gemini 2.5 Pro: Gemini 2.5 Pro offers several advantages for audio processing, including better diarization compared to other models like Whisper. It also allows for free calls through AI studio, making it accessible for various applications.
Future Possibilities: The combination of Gemini 2.5 Pro with text-to-speech systems opens up possibilities for creating podcasts about podcasts or abbreviated versions. However, legal considerations need to be taken into account.
Upcoming Video: In an upcoming video, the author will explore using the same techniques for YouTube videos.

If you found this video useful, please click like and subscribe.

Gemini 2.5 Pro: Mastering Audio Transcription, Diorization & More

Summary

Quick Abstract