.. textract documentation master file, created by sphinx-quickstart on Fri Jul 4 11:09:09 2014. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. textract ================================ As undesirable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc---so-called "dark data"---that would be valuable for further textual analysis and visualization. While :ref:`several packages ` exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup. This package provides two primary facilities for doing this, the :ref:`command line interface ` .. code-block:: bash textract path/to/file.extension or the :ref:`python package ` .. code-block:: python # some python file import textract text = textract.process("path/to/file.extension") .. _supporting: Currently supporting -------------------- textract supports a growing list of file types for text extraction. If you don't see your favorite file type here, please recommend other file types by either mentioning them on the `issue tracker `_ or by :ref:`contributing a pull request `. * ``.csv`` via python builtins * ``.tsv`` and ``.tab`` via python builtins * ``.doc`` via `antiword`_ * ``.docx`` via `python-docx2txt`_ * ``.eml`` via python builtins * ``.epub`` via `ebooklib`_ * ``.gif`` via `tesseract-ocr`_ * ``.jpg`` and ``.jpeg`` via `tesseract-ocr`_ * ``.json`` via python builtins * ``.html`` and ``.htm`` via `beautifulsoup4`_ * ``.mp3`` via `sox`_, `SpeechRecognition`_, and `pocketsphinx`_ * ``.msg`` via `msg-extractor`_ * ``.odt`` via python builtins * ``.ogg`` via `sox`_, `SpeechRecognition`_, and `pocketsphinx`_ * ``.pdf`` via `pdftotext`_ (default) or `pdfminer.six`_ * ``.png`` via `tesseract-ocr`_ * ``.pptx`` via `python-pptx`_ * ``.ps`` via `ps2ascii`_ * ``.rtf`` via `unrtf`_ * ``.tiff`` and ``.tif`` via `tesseract-ocr`_ * ``.txt`` via python builtins * ``.wav`` via `SpeechRecognition`_ and `pocketsphinx`_ * ``.xlsx`` via `xlrd `_ * ``.xls`` via `xlrd `_ .. this is a list of all the packages that textract uses for extraction .. _antiword: http://www.winfield.demon.nl/ .. _beautifulsoup4: http://beautiful-soup-4.readthedocs.org/en/latest/ .. _ebooklib: https://github.com/aerkalov/ebooklib .. _msg-extractor: https://github.com/mattgwwalker/msg-extractor .. _pdfminer.six: https://github.com/goulu/pdfminer .. _pdftotext: http://poppler.freedesktop.org/ .. _pocketsphinx: https://github.com/cmusphinx/pocketsphinx/ .. _ps2ascii: https://www.ghostscript.com/doc/current/Use.htm .. _python-docx2txt: https://github.com/ankushshah89/python-docx2txt .. _python-pptx: https://python-pptx.readthedocs.org/en/latest/ .. _SpeechRecognition: https://pypi.python.org/pypi/SpeechRecognition/ .. _sox: http://sox.sourceforge.net/ .. _tesseract-ocr: https://code.google.com/p/tesseract-ocr/ .. _unrtf: http://www.gnu.org/software/unrtf/ .. _related-projects: Related projects ---------------- Of course, textract isn't the first project with the aim to provide a simple interface for extracting text from any document. But this is, to the best of my knowledge, the only project that is written in python (a language commonly chosen by the natural language processing community) and is :ref:`method agnostic about how content is extracted `. I'm sure that there are other similar projects out there, but here is a small sample of similar projects: * `Apache Tika `_ has `very similar, if not identical, aims as textract `_ and has impressive coverage of a wide range of file formats. It is written in java. * `textract (node.js) `_ has similar aims as this textract package (including an identical name! great minds...). It is written in node.js. * `pandoc `_ is intended to be a document conversion tool (a much more difficult task!), but it does have `the ability to convert to plain text `_. It is written in Haskell. Contents: .. toctree:: :maxdepth: 2 command_line_interface python_package installation contributing changelog Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`