Python package¶

This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:

import textract
text = textract.process('path/to/file.extension')

to obtain text from a document. You can also pass keyword arguments to textract.process, for example, to use a particular method for parsing a pdf like this:

import textract
text = textract.process('path/to/a.pdf', method='pdfminer')

or to specify a particular output encoding (input encodings are inferred using chardet):

import textract
text = textract.process('path/to/file.extension', encoding='ascii')

Additional options¶

Some parsers also enable additional options which can be passed in as keyword arguments to the textract.process function. Here is a quick table of available options that are available to the different types of parsers:

parser	option	description
gif	language	Specify the language for OCR-ing text with tesseract
jpg	language	Specify the language for OCR-ing text with tesseract
png	language	Specify the language for OCR-ing text with tesseract
pdf	language	For use when `method='tesseract'`, specify the language
tiff	language	Specify the language for OCR-ing text with tesseract

As an example of using these additional options, you can extract text from a Norwegian PDF using Tesseract OCR like this:

text = textract.process(
    'path/to/norwegian.pdf',
    method='tesseract',
    language='nor',
)

A look under the hood¶

When textract.process('path/to/file.extension') is called, textract.process looks for a module called textract.parsers.extension_parser that also contains a Parser.

textract.parsers.process(filename, encoding='utf_8', **kwargs)[source]¶: This is the core function used for extracting text. It routes the filename to the appropriate parser and returns the extracted text as a byte-string encoded with encoding.

Importantly, the textract.parsers.extension_parser.Parser class must inherit from textract.parsers.utils.BaseParser.

class textract.parsers.utils.BaseParser[source]¶

Bases: object

The BaseParser abstracts out some common functionality that is used across all document Parsers. In particular, it has the responsibility of handling all unicode and byte-encoding.

decode(text)[source]¶: Decode text using the chardet package.

encode(text, encoding)[source]¶: Encode the text in encoding byte-encoding. This ignores code points that can’t be encoded in byte-strings.

extract(filename, **kwargs)[source]¶: This method must be overwritten by child classes to extract raw text from a filename. This method can return either a byte-encoded string or unicode.

process(filename, encoding, **kwargs)[source]¶: Process filename and encode byte-string with encoding. This method is called by textract.parsers.process() and wraps the BaseParser.extract() method in a delicious unicode sandwich.

Many of the parsers rely on command line utilities to do some of the parsing. For convenience, the textract.parsers.utils.ShellParser class includes some convenience methods for streamlining access to the command line.

class textract.parsers.utils.ShellParser[source]¶

Bases: textract.parsers.utils.BaseParser

The ShellParser extends the BaseParser to make it easy to run external programs from the command line with Fabric-like behavior.

run(command)[source]¶: Run command and return the subsequent stdout and stderr as a tuple. If the command is not successful, this raises a textract.exceptions.ShellError.

temp_filename()[source]¶: Return a unique tempfile name.

A few specific examples¶

There are quite a few parsers included with textract. Rather than elaborating all of them, here are a few that demonstrate how parsers work.

class textract.parsers.epub_parser.Parser[source]¶

Bases: textract.parsers.utils.BaseParser

Extract text from epub using python epub library

extract(filename, **kwargs)[source]¶

class textract.parsers.doc_parser.Parser[source]¶

Bases: textract.parsers.utils.ShellParser

Extract text from doc files using antiword.

extract(filename, **kwargs)[source]¶