textract¶
As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.
This package provides two primary facilities for doing this, the command line interface
textract path/to/file.extension
or the python package
# some python file
import textract
text = textract.process("path/to/file.extension")
Currently supporting¶
textract supports a growing list of file types for text extraction. If you don’t see your favorite file type here, Please recommend other file types by either mentioning them on the issue tracker or by contributing a pull request.
.csv
via python builtins.doc
via antiword.docx
via python-docx2txt.eml
via python builtins.epub
via ebooklib.gif
via tesseract-ocr.jpg
and.jpeg
via tesseract-ocr.json
via python builtins.html
and.htm
via beautifulsoup4.mp3
via sox, SpeechRecognition, and pocketsphinx.msg
via msg-extractor.odt
via python builtins.ogg
via sox, SpeechRecognition, and pocketsphinx.pdf
via pdftotext (default) or pdfminer.six.png
via tesseract-ocr.pptx
via python-pptx.ps
via ps2text.rtf
via unrtf.tiff
and.tif
via tesseract-ocr.txt
via python builtins.wav
via SpeechRecognition and pocketsphinx.xlsx
via xlrd.xls
via xlrd