Pages can open cyrillic .docx, but docx2txt, docx2python, mammoth and pypandoc can't. How should I read it?

1 day ago 2
ARTICLE AD BOX

I'm trying to open .docx files with Cyrillic texts inside. I'm working on Mac m1, I've scraped these files. Pages can read them correctly. But when I try to open and read them using python's libraries I've got almost the same error.

import docx2txt text = docx2txt.process("test1.docx") print(text)

zipfile.BadZipFile: File is not a zip file

import pypandoc text = pypandoc.convert_file('your_file.docx', 'plain') print(text)

RuntimeError: Pandoc died with exitcode "63" during conversion: couldn't unpack docx container: Did not find end of central directory signature

import docx def read_cyrillic_docx(file_path): doc = docx.Document(file_path) full_text = [para.text for para in doc.paragraphs] return '\n'.join(full_text) text = read_cyrillic_docx('test1.docx') print(text)

docx.opc.exceptions.PackageNotFoundError: Package not found at 'test1.docx'

How can I open these files using python? Thanks in advance!

Read Entire Article