python – convert documents (doc, docx, odt, pdf) to plain text without Libreoffice

Written by

I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats. So here is a code snippet to do just that. I’m using some non python Linux programs and python libs. Notably absent is Libreoffice which would take care of a ton of formats. Libreoffice is however heavyweight and clunky to use. These programs will convert much faster. First let’s get some dependencies.

PDF – pdfminer. http://stackoverflow.com/a/20905381/443457
doc – antiword
docx – python-docx
odt – odt2txt

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx.

Comments

2 responses to “python – convert documents (doc, docx, odt, pdf) to plain text without Libreoffice”

June 23, 2014

David Hubbard

I just wanted to say thank you for this example. It is the only pdf to text function I could fine online using the pdfminer library that works! Many thanks! If you are ever in Chicago and you drink, email me, I will meet you and buy as many drinks as you like!

LikeLike

Reply
October 3, 2014

Nursultan

Hello! I facing some problems: ImportError: No module named 1xm1. Help me please.

LikeLike

Reply

python – convert documents (doc, docx, odt, pdf) to plain text without Libreoffice

Share this:

Comments

2 responses to “python – convert documents (doc, docx, odt, pdf) to plain text without Libreoffice”

Leave a comment Cancel reply

More posts

Divergent thoughts on AI

openapi-typescript with Angular Resource

Monitor network endpoints with Python asyncio and aiohttp

Deploy Saleor E-commerce with Kubernetes and Helm