As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.

This package provides two primary facilities for doing this, the command line interface

textract path/to/file.extension

or the python package

# some python file
import textract
text = textract.process("path/to/file.extension")

Currently supporting

Please recommend other file types by either mentioning them on the issue tracker or by contributing

Indices and tables