Python package

This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:

import textract
text = textract.process('path/to/file.extension')

to obtain text from a document.

For completeness, we also include here the documentation for specific file extension parsers as well as a few other essential bits in the textract.exceptions and textract.shell module.

textract.parsers.doc_parser module

textract.parsers.doc_parser.extract(filename, **kwargs)[source]

Extract text from doc files using antiword.

textract.parsers.docx_parser module

textract.parsers.docx_parser.extract(filename, **kwargs)[source]

Extract text from docx file using python-docx.

textract.parsers.eml_parser module

textract.parsers.eml_parser.extract(filename, **kwargs)[source]

Extract text from email messages in .eml format. This gets the subject and all text from the contents.

textract.parsers.gif_parser module

textract.parsers.html_parser module

textract.parsers.html_parser.extract(filename, **kwargs)[source]

Extract text from html file using beautifulsoup4. Filter text to only show the visible parts of the page. Insipration from here.

textract.parsers.jpg_parser module

textract.parsers.json_parser module

textract.parsers.json_parser.extract(filename, **kwargs)[source]

Extract all of the string values of a json file (no keys as those are, in some sense, markup). This is useful for parsing content from mongodb dumps, for example.

textract.parsers.json_parser.get_text(deserialized_json)[source]

Recursively get text from subcomponents of a deserialized json. To enforce the same order on the documents, make sure to read keys of deserialized_json in a consistent (alphabetical) order.

textract.parsers.odt_parser module

class textract.parsers.odt_parser.OpenDocumentTextFile(filepath)[source]
textToString(element)[source]
toString()[source]

Converts the document to a string.

textract.parsers.odt_parser.extract(filename, **kwargs)[source]

Extract text from open document files.

textract.parsers.pdf_parser module

textract.parsers.pdf_parser.extract(filename, method='', **kwargs)[source]

Extract text from pdf files using method.

textract.parsers.pdf_parser.extract_pdfminer(filename)[source]

Extract text from pdfs using pdfminer.

textract.parsers.pdf_parser.extract_pdftotext(filename)[source]

Extract text from pdfs using the pdftotext command line utility.

textract.parsers.png_parser module

textract.parsers.pptx_parser module

textract.parsers.pptx_parser.extract(filename, **kwargs)[source]

Extract text from pptx file using python-pptx

textract.parsers.ps_parser module

textract.parsers.ps_parser.extract(filename, **kwargs)[source]

Extract text from postscript files using pstotext command.

textract.parsers.tesseract module

textract.parsers.tesseract.extract(filename, **kwargs)[source]

Extract text from various image file formats using tesseract-ocr

textract.parsers.txt_parser module

textract.parsers.txt_parser.extract(filename, **kwargs)[source]

Extract text from a .txt file

textract.cli module

textract.cli.get_parser()[source]

Initialize the parser for the command line interface and bind the autocompletion functionality

textract.exceptions module

exception textract.exceptions.CommandLineError[source]

Bases: exceptions.Exception

The traceback of all CommandLineError’s is supressed when the errors occur on the command line to provide a useful command line interface.

render(msg)[source]
exception textract.exceptions.ExtensionNotSupported(ext)[source]

Bases: textract.exceptions.CommandLineError

This error is raised with unsupported extensions

exception textract.exceptions.MissingFileError(filename)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when the file can not be located at the specified path.

exception textract.exceptions.ShellError(command, exit_code)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when a shell.run returns a non-zero exit code (meaning the command failed).

failed_message()[source]
is_uninstalled()[source]
uninstalled_message()[source]
exception textract.exceptions.UnknownMethod(method)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when the specified –method on the command line is unknown.

textract.shell module

textract.shell.run(command)[source]

Run the specified shell command using Fabric-like behavior.