Pages can open cyrillic .docx, but docx2txt, docx2python, mammoth and pypandoc can't. How should I read it?

1 day ago 2

ARTICLE AD BOX

I'm trying to open .docx files with Cyrillic texts inside. I'm working on Mac m1, I've scraped these files. Pages can read them correctly. But when I try to open and read them using python's libraries I've got almost the same error.

import docx2txt text = docx2txt.process("test1.docx") print(text)

zipfile.BadZipFile: File is not a zip file

import pypandoc text = pypandoc.convert_file('your_file.docx', 'plain') print(text)

RuntimeError: Pandoc died with exitcode "63" during conversion: couldn't unpack docx container: Did not find end of central directory signature

import docx def read_cyrillic_docx(file_path): doc = docx.Document(file_path) full_text = [para.text for para in doc.paragraphs] return '\n'.join(full_text) text = read_cyrillic_docx('test1.docx') print(text)

docx.opc.exceptions.PackageNotFoundError: Package not found at 'test1.docx'

How can I open these files using python? Thanks in advance!

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Pages can open cyrillic .docx, but docx2txt, docx2python, mammoth and pypandoc can't. How should I read it?

ARTICLE AD BOX

Related

Is it possible to map matplotlib polygon vertices into an array?

Trying to find Mantissa in IEEE 754 in python

Plot objects it vispy on top of an image

LEFT SIDEBAR AD