Python package¶
This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:
import textract
text = textract.process('path/to/file.extension')
to obtain text from a document.
For completeness, we also include here the documentation for specific file extension parsers as well as a few other essential bits in the textract.exceptions and textract.shell module.
textract.parsers.doc_parser module¶
- class textract.parsers.doc_parser.Parser[source]¶
Bases: textract.parsers.utils.ShellParser
Extract text from doc files using antiword.
textract.parsers.docx_parser module¶
- class textract.parsers.docx_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract text from docx file using python-docx.
textract.parsers.eml_parser module¶
- class textract.parsers.eml_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract text from email messages in .eml format. This gets the subject and all text from the contents.
textract.parsers.epub_parser module¶
- class textract.parsers.epub_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract text from epub using python epub library
textract.parsers.gif_parser module¶
textract.parsers.html_parser module¶
textract.parsers.jpg_parser module¶
textract.parsers.json_parser module¶
- class textract.parsers.json_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract all of the string values of a json file (no keys as those are, in some sense, markup). This is useful for parsing content from mongodb dumps, for example.
textract.parsers.odt_parser module¶
textract.parsers.pdf_parser module¶
- class textract.parsers.pdf_parser.Parser[source]¶
Bases: textract.parsers.utils.ShellParser
Extract text from pdf files using either the pdftotext method (default) or the pdfminer method.
textract.parsers.png_parser module¶
textract.parsers.pptx_parser module¶
- class textract.parsers.pptx_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract text from pptx file using python-pptx
textract.parsers.ps_parser module¶
- class textract.parsers.ps_parser.Parser[source]¶
Bases: textract.parsers.utils.ShellParser
Extract text from postscript files using pstotext command.
textract.parsers.tesseract module¶
Process an image file using tesseract.
- class textract.parsers.tesseract.Parser[source]¶
Bases: textract.parsers.utils.ShellParser
Extract text from various image file formats using tesseract-ocr
textract.parsers.txt_parser module¶
- class textract.parsers.txt_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Parse .txt files
textract.parsers.utils module¶
This module includes a bunch of convenient base classes that are reused in many of the other parser modules.
- class textract.parsers.utils.BaseParser[source]¶
Bases: object
The BaseParser abstracts out some common functionality that is used across all document formats. Specifically, it owns the responsibility of handling all unicode and byte-encoding problems.
Inspiration from http://nedbatchelder.com/text/unipain.html
- class textract.parsers.utils.ShellParser[source]¶
Bases: textract.parsers.utils.BaseParser
The ShellParser extends the BaseParser to make it easy to run external programs from the command line with Fabric-like behavior.
textract.parsers.xls_parser module¶
textract.parsers.xlsx_parser module¶
- class textract.parsers.xlsx_parser.Parser[source]¶
Bases: textract.parsers.utils.BaseParser
Extract text from Excel files (.xls/xlsx).
textract.cli module¶
Use argparse to handle command-line arguments.
textract.exceptions module¶
- exception textract.exceptions.CommandLineError[source]¶
Bases: exceptions.Exception
The traceback of all CommandLineError’s is supressed when the errors occur on the command line to provide a useful command line interface.
- exception textract.exceptions.ExtensionNotSupported(ext)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised with unsupported extensions
- exception textract.exceptions.MissingFileError(filename)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when the file can not be located at the specified path.
- exception textract.exceptions.ShellError(command, exit_code, stdout, stderr)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when a shell.run returns a non-zero exit code (meaning the command failed).
- exception textract.exceptions.UnknownMethod(method)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when the specified –method on the command line is unknown.