textract¶
As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.
This package provides two primary facilities for doing this, the command line interface
textract path/to/file.extension
or the python package
# some python file
import textract
text = textract.process("path/to/file.extension")
Currently supporting¶
- .doc via antiword
- .docx via python-docx
- .eml via python builtins
- .epub via ebooklib
- .gif via tesseract-ocr
- .jpg and .jpeg via tesseract-ocr
- .json via python builtins
- .html via beautifulsoup4
- .odt via python builtins
- .pdf via pdftotext (default) or pdfminer
- .png via tesseract-ocr
- .pptx via python-pptx
- .ps via ps2text
- .txt via python builtins
- .xls via xlrd
- .xlsx via xlrd
Please recommend other file types by either mentioning them on the issue tracker or by contributing