textract¶

As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.

This package provides two primary facilities for doing this, the command line interface

textract path/to/file.extension

or the python package

# some python file
import textract
text = textract.process("path/to/file.extension")

Currently supporting¶

.doc via antiword
.docx via python-docx
.eml via python builtins
.epub via ebooklib
.gif via tesseract-ocr
.jpg and .jpeg via tesseract-ocr
.json via python builtins
.html via beautifulsoup4
.odt via python builtins
.pdf via pdftotext (default) or pdfminer
.png via tesseract-ocr
.pptx via python-pptx
.ps via ps2text
.txt via python builtins
.xls via xlrd
.xlsx via xlrd

Please recommend other file types by either mentioning them on the issue tracker or by contributing

textract¶

Currently supporting¶

Related projects¶

Indices and tables¶