As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.
This package provides two primary facilities for doing this, the command line interface
or the python package
# some python file import textract text = textract.process("path/to/file.extension")
textract supports a growing list of file types for text extraction. If you don’t see your favorite file type here, Please recommend other file types by either mentioning them on the issue tracker or by contributing a pull request.
.csvvia python builtins
.emlvia python builtins
.jsonvia python builtins
.mp3via sox, SpeechRecognition, and pocketsphinx
.odtvia python builtins
.oggvia sox, SpeechRecognition, and pocketsphinx
.txtvia python builtins
.wavvia SpeechRecognition and pocketsphinx