python – convert documents (doc, docx, odt, pdf) to plain text without Libreoffice
I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats. So here is a code snippet to do just that. I’m using some non python Linux programs and python libs. Notably absent is Libreoffice which would take care of a ton of formats. Libreoffice is however heavyweight and clunky to use. These programs will convert much faster. First let’s get some dependencies.
PDF – pdfminer. http://stackoverflow.com/a/20905381/443457