Python package¶
This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:
import textract
text = textract.process('path/to/file.extension')
to obtain text from a document.
For completeness, we also include here the documentation for specific file extension parsers as well as a few other essential bits in the textract.exceptions and textract.shell module.
textract.parsers.doc_parser module¶
textract.parsers.docx_parser module¶
textract.parsers.eml_parser module¶
textract.parsers.gif_parser module¶
textract.parsers.html_parser module¶
textract.parsers.jpg_parser module¶
textract.parsers.json_parser module¶
textract.parsers.odt_parser module¶
textract.parsers.pdf_parser module¶
- textract.parsers.pdf_parser.extract(filename, method='', **kwargs)[source]¶
Extract text from pdf files using method.
textract.parsers.png_parser module¶
textract.parsers.pptx_parser module¶
textract.parsers.ps_parser module¶
textract.parsers.tesseract module¶
textract.parsers.txt_parser module¶
textract.cli module¶
textract.exceptions module¶
- exception textract.exceptions.CommandLineError[source]¶
Bases: exceptions.Exception
The traceback of all CommandLineError’s is supressed when the errors occur on the command line to provide a useful command line interface.
- exception textract.exceptions.ExtensionNotSupported(ext)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised with unsupported extensions
- exception textract.exceptions.MissingFileError(filename)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when the file can not be located at the specified path.
- exception textract.exceptions.ShellError(command, exit_code)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when a shell.run returns a non-zero exit code (meaning the command failed).
- exception textract.exceptions.UnknownMethod(method)[source]¶
Bases: textract.exceptions.CommandLineError
This error is raised when the specified –method on the command line is unknown.