Python package

This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:

import textract
text = textract.process('path/to/file.extension')

to obtain text from a document.

For completeness, we also include here the documentation for specific file extension parsers as well as a few other essential bits in the textract.exceptions and textract.shell module.

textract.parsers.doc_parser module

class textract.parsers.doc_parser.Parser[source]

Bases: textract.parsers.utils.ShellParser

Extract text from doc files using antiword.

extract(filename, **kwargs)[source]

textract.parsers.docx_parser module

class textract.parsers.docx_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from docx file using python-docx.

extract(filename, **kwargs)[source]

textract.parsers.eml_parser module

class textract.parsers.eml_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from email messages in .eml format. This gets the subject and all text from the contents.

extract(filename, **kwargs)[source]

textract.parsers.epub_parser module

class textract.parsers.epub_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from epub using python epub library

extract(filename, **kwargs)[source]

textract.parsers.gif_parser module

textract.parsers.html_parser module

class textract.parsers.html_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from html file using beautifulsoup4. Filter text to only show the visible parts of the page. Insipration from here.

extract(filename, **kwargs)[source]

textract.parsers.jpg_parser module

textract.parsers.json_parser module

class textract.parsers.json_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract all of the string values of a json file (no keys as those are, in some sense, markup). This is useful for parsing content from mongodb dumps, for example.

extract(filename, **kwargs)[source]
get_text(deserialized_json)[source]

Recursively get text from subcomponents of a deserialized json. To enforce the same order on the documents, make sure to read keys of deserialized_json in a consistent (alphabetical) order.

textract.parsers.odt_parser module

class textract.parsers.odt_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from open document files.

extract(filename, **kwargs)[source]
text_to_string(element)[source]
to_string()[source]

Converts the document to a string.

textract.parsers.pdf_parser module

class textract.parsers.pdf_parser.Parser[source]

Bases: textract.parsers.utils.ShellParser

Extract text from pdf files using either the pdftotext method (default) or the pdfminer method.

extract(filename, method='', **kwargs)[source]
extract_pdfminer(filename)[source]

Extract text from pdfs using pdfminer.

extract_pdftotext(filename)[source]

Extract text from pdfs using the pdftotext command line utility.

textract.parsers.png_parser module

textract.parsers.pptx_parser module

class textract.parsers.pptx_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from pptx file using python-pptx

extract(filename, **kwargs)[source]

textract.parsers.ps_parser module

class textract.parsers.ps_parser.Parser[source]

Bases: textract.parsers.utils.ShellParser

Extract text from postscript files using pstotext command.

extract(filename, **kwargs)[source]

textract.parsers.tesseract module

Process an image file using tesseract.

class textract.parsers.tesseract.Parser[source]

Bases: textract.parsers.utils.ShellParser

Extract text from various image file formats using tesseract-ocr

extract(filename, **kwargs)[source]

textract.parsers.txt_parser module

class textract.parsers.txt_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Parse .txt files

extract(filename, **kwargs)[source]

textract.parsers.utils module

This module includes a bunch of convenient base classes that are reused in many of the other parser modules.

class textract.parsers.utils.BaseParser[source]

Bases: object

The BaseParser abstracts out some common functionality that is used across all document formats. Specifically, it owns the responsibility of handling all unicode and byte-encoding problems.

Inspiration from http://nedbatchelder.com/text/unipain.html

decode(text)[source]

Decode text using the chardet package

encode(text, encoding)[source]

Encode the text in encoding byte-encoding. This ignores code points that can’t be encoded in byte-strings.

extract(filename, **kwargs)[source]
process(filename, encoding, **kwargs)[source]

Process filename and encode byte-string with encoding.

class textract.parsers.utils.ShellParser[source]

Bases: textract.parsers.utils.BaseParser

The ShellParser extends the BaseParser to make it easy to run external programs from the command line with Fabric-like behavior.

run(command)[source]

Run command and return the subsequent stdout and stderr.

temp_filename()[source]

Return a unique tempfile name.

textract.parsers.xls_parser module

textract.parsers.xlsx_parser module

class textract.parsers.xlsx_parser.Parser[source]

Bases: textract.parsers.utils.BaseParser

Extract text from Excel files (.xls/xlsx).

extract(filename, **kwargs)[source]

textract.cli module

Use argparse to handle command-line arguments.

textract.cli.get_parser()[source]

Initialize the parser for the command line interface and bind the autocompletion functionality

textract.exceptions module

exception textract.exceptions.CommandLineError[source]

Bases: exceptions.Exception

The traceback of all CommandLineError’s is supressed when the errors occur on the command line to provide a useful command line interface.

render(msg)[source]
exception textract.exceptions.ExtensionNotSupported(ext)[source]

Bases: textract.exceptions.CommandLineError

This error is raised with unsupported extensions

exception textract.exceptions.MissingFileError(filename)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when the file can not be located at the specified path.

exception textract.exceptions.ShellError(command, exit_code, stdout, stderr)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when a shell.run returns a non-zero exit code (meaning the command failed).

failed_message()[source]
is_uninstalled()[source]
uninstalled_message()[source]
exception textract.exceptions.UnknownMethod(method)[source]

Bases: textract.exceptions.CommandLineError

This error is raised when the specified –method on the command line is unknown.