As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.
This package provides two primary facilities for doing this, the command line interface
or the python package
# some python file import textract text = textract.process("path/to/file.extension")
- .doc via antiword
- .docx via python-docx
- .eml via python builtins.
- .html via beautifulsoup4
- .pptx via python-pptx
- .pdf via pdftotext (default) or pdfminer
- .txt via python builtins.
This package is built on top of several python packages and other source libraries. In particular, this package has a dependency on lxml that depends on some other libraries to be installed. On Ubuntu/Debian, you will need to run:
apt-get install python-dev libxml2-dev libxslt1-dev antiword poppler-utils
pip install textract