Why Whisper doesn't fully transcribe my audio?

2 days ago 7

ARTICLE AD BOX

I'm trying to use openai-whisper Python module to transcribe (already recoreded) audios which can be large files (30 minutes to 2 or 3 hours). But I'm facing an issue: the audio isn't fully transcribe with the large-v3 model. For instance, I'm working on a 30 minutes audios and its not transcribed from 2:00 to 15:00. I checked and there are people talking.

I saw on other posts and blogs that Whisper performance get bader and bader when the audio is longer so I splitted the audio in segments of 15 min with pydub in the following code:

from pydub import AudioSegment def segment_audio_duration(audio_file: str, millisecond_duration: int, output_folder, format: str) -> tuple[int, str]: """ segment the audio into segments during _millisecond_duration_ and directly export them to _output_folder_ returns a tuple containg the number of segment created and the name of a file without the number """ sound = AudioSegment.from_file(audio_file) duration = len(sound) num_chunks = math.ceil(duration / millisecond_duration) basename = audio_file.split(os.sep)[-1] filename = basename.split('.')[0] ext = basename.split('.')[-1] for i in range(num_chunks): temp = sound[i * millisecond_duration:(i+1) * millisecond_duration] temp.export(f"{output_folder}{os.sep}{filename}_part{i+1}.{format}", format=format) return (num_chunks, f"{filename}_part") def transcript_audio(audio_path: str, model, language: str = "fr", gpu_usable: bool = False) -> str: """ Simple auxiliary function to _get_transcription_ function :param audio_path: path of the audi file :type audio_path: str :param model: used model to transcribe :param language: language of the audio :type language: str :return: raw content transcription :rtype: str """ try: result = model.transcribe( audio_path, temperature=0.0, language=language, fp16=gpu_usable ) except Exception as e: raise Exception(f"Unable to retrieve the transcription of {audio_path} ({e})") return result["text"]

I can't use anything else than this Python module to transcribe my audios. For the moment:

I succeed to fully transcribe audios with the other large models with audios segments of 15 minutes

I succeed to fully transcribe audios with large-v3 model only when audios segments last less than 5 minutes (4.9 minutes is okay but 5 no)

I already checked if all my audios segmented were transcribed and they all are

Anyone has an idea why I have this issue ? And how to solve it ? And why it works with other large models and not the large-v3 one ?

Thanks in advance !

(PS: please don't judge my code too harshly, I'm still pretty new)

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Why Whisper doesn't fully transcribe my audio?

ARTICLE AD BOX

Related

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

How do I resolve the ConnectionResetError and CondaHTTPError when attempting to update conda despite multiple retries and Anaconda reinstalls?

Make a Python process that communicates with itself over a PTY

LEFT SIDEBAR AD