Python package¶
This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:
import textract
text = textract.process('path/to/file.extension')
to obtain text from a document. You can also pass keyword arguments to
textract.process
, for example, to use a particular method for
parsing a pdf like this:
import textract
text = textract.process('path/to/a.pdf', method='pdfminer')
or to specify a particular output encoding (input encodings are inferred using chardet):
import textract
text = textract.process('path/to/file.extension', encoding='ascii')
Additional options¶
Some parsers also enable additional options which can be passed in as keyword
arguments to the textract.process
function. Here is a quick table of
available options that are available to the different types of parsers:
parser | option | description |
---|---|---|
gif | language | Specify the language for OCR-ing text with tesseract |
jpg | language | Specify the language for OCR-ing text with tesseract |
png | language | Specify the language for OCR-ing text with tesseract |
language | For use when method='tesseract' , specify the language |
|
tiff | language | Specify the language for OCR-ing text with tesseract |
As an example of using these additional options, you can extract text from a Norwegian PDF using Tesseract OCR like this:
text = textract.process(
'path/to/norwegian.pdf',
method='tesseract',
language='nor',
)
A look under the hood¶
When textract.process('path/to/file.extension')
is called,
textract.process
looks for a module called
textract.parsers.extension_parser
that also contains a Parser
.
-
textract.parsers.
process
(filename, encoding='utf_8', **kwargs)[source]¶ This is the core function used for extracting text. It routes the
filename
to the appropriate parser and returns the extracted text as a byte-string encoded withencoding
.
Importantly, the textract.parsers.extension_parser.Parser
class
must inherit from textract.parsers.utils.BaseParser
.
-
class
textract.parsers.utils.
BaseParser
[source]¶ Bases:
object
The
BaseParser
abstracts out some common functionality that is used across all document Parsers. In particular, it has the responsibility of handling all unicode and byte-encoding.-
encode
(text, encoding)[source]¶ Encode the
text
inencoding
byte-encoding. This ignores code points that can’t be encoded in byte-strings.
-
extract
(filename, **kwargs)[source]¶ This method must be overwritten by child classes to extract raw text from a filename. This method can return either a byte-encoded string or unicode.
-
process
(filename, encoding, **kwargs)[source]¶ Process
filename
and encode byte-string withencoding
. This method is called bytextract.parsers.process()
and wraps theBaseParser.extract()
method in a delicious unicode sandwich.
-
Many of the parsers rely on command line utilities to do some of the
parsing. For convenience, the textract.parsers.utils.ShellParser
class includes some convenience methods for streamlining access to the
command line.
-
class
textract.parsers.utils.
ShellParser
[source]¶ Bases:
textract.parsers.utils.BaseParser
The
ShellParser
extends theBaseParser
to make it easy to run external programs from the command line with Fabric-like behavior.
A few specific examples¶
There are quite a few parsers included with textract
. Rather than
elaborating all of them, here are a few that demonstrate how parsers
work.
-
class
textract.parsers.epub_parser.
Parser
[source]¶ Bases:
textract.parsers.utils.BaseParser
Extract text from epub using python epub library
-
class
textract.parsers.doc_parser.
Parser
[source]¶ Bases:
textract.parsers.utils.ShellParser
Extract text from doc files using antiword.