ARTICLE AD BOX
I'm trying to use openai-whisper Python module to transcribe (already recoreded) audios which can be large files (30 minutes to 2 or 3 hours). But I'm facing an issue: the audio isn't fully transcribe with the large-v3 model. For instance, I'm working on a 30 minutes audios and its not transcribed from 2:00 to 15:00. I checked and there are people talking.
I saw on other posts and blogs that Whisper performance get bader and bader when the audio is longer so I splitted the audio in segments of 15 min with pydub in the following code:
from pydub import AudioSegment def segment_audio_duration(audio_file: str, millisecond_duration: int, output_folder, format: str) -> tuple[int, str]: """ segment the audio into segments during _millisecond_duration_ and directly export them to _output_folder_ returns a tuple containg the number of segment created and the name of a file without the number """ sound = AudioSegment.from_file(audio_file) duration = len(sound) num_chunks = math.ceil(duration / millisecond_duration) basename = audio_file.split(os.sep)[-1] filename = basename.split('.')[0] ext = basename.split('.')[-1] for i in range(num_chunks): temp = sound[i * millisecond_duration:(i+1) * millisecond_duration] temp.export(f"{output_folder}{os.sep}{filename}_part{i+1}.{format}", format=format) return (num_chunks, f"{filename}_part") def transcript_audio(audio_path: str, model, language: str = "fr", gpu_usable: bool = False) -> str: """ Simple auxiliary function to _get_transcription_ function :param audio_path: path of the audi file :type audio_path: str :param model: used model to transcribe :param language: language of the audio :type language: str :return: raw content transcription :rtype: str """ try: result = model.transcribe( audio_path, temperature=0.0, language=language, fp16=gpu_usable ) except Exception as e: raise Exception(f"Unable to retrieve the transcription of {audio_path} ({e})") return result["text"]I can't use anything else than this Python module to transcribe my audios. For the moment:
I succeed to fully transcribe audios with the other large models with audios segments of 15 minutes
I succeed to fully transcribe audios with large-v3 model only when audios segments last less than 5 minutes (4.9 minutes is okay but 5 no)
I already checked if all my audios segmented were transcribed and they all are
Anyone has an idea why I have this issue ? And how to solve it ? And why it works with other large models and not the large-v3 one ?
Thanks in advance !
(PS: please don't judge my code too harshly, I'm still pretty new)
