Command line interface

textract

Command line tool for extracting text from any document.

usage: textract [-h]
                [-e {aliases,ascii,base64_codec,big5,big5hkscs,bz2_codec,charmap,cp037,cp1006,cp1026,cp1140,cp1250,cp1251,cp1252,cp1253,cp1254,cp1255,cp1256,cp1257,cp1258,cp424,cp437,cp500,cp720,cp737,cp775,cp850,cp852,cp855,cp856,cp857,cp858,cp860,cp861,cp862,cp863,cp864,cp865,cp866,cp869,cp874,cp875,cp932,cp949,cp950,euc_jis_2004,euc_jisx0213,euc_jp,euc_kr,gb18030,gb2312,gbk,hex_codec,hp_roman8,hz,idna,iso2022_jp,iso2022_jp_1,iso2022_jp_2,iso2022_jp_2004,iso2022_jp_3,iso2022_jp_ext,iso2022_kr,iso8859_1,iso8859_10,iso8859_11,iso8859_13,iso8859_14,iso8859_15,iso8859_16,iso8859_2,iso8859_3,iso8859_4,iso8859_5,iso8859_6,iso8859_7,iso8859_8,iso8859_9,johab,koi8_r,koi8_u,latin_1,mac_arabic,mac_centeuro,mac_croatian,mac_cyrillic,mac_farsi,mac_greek,mac_iceland,mac_latin2,mac_roman,mac_romanian,mac_turkish,mbcs,palmos,ptcp154,punycode,quopri_codec,raw_unicode_escape,rot_13,shift_jis,shift_jis_2004,shift_jisx0213,string_escape,tactis,tis_620,undefined,unicode_escape,unicode_internal,utf_16,utf_16_be,utf_16_le,utf_32,utf_32_be,utf_32_le,utf_7,utf_8,utf_8_sig,uu_codec,zlib_codec}]
                [-m METHOD] [-o OUTPUT] [-O OPTION] [-v]
                filename
Positional arguments:
filename Filename to extract text.
Options:
-e=utf_8, --encoding=utf_8
 

Specify the encoding of the output.

Possible choices: aliases, ascii, base64_codec, big5, big5hkscs, bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251, cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424, cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856, cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866, cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213, euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz, idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004, iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10, iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7, iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic, mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish, mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape, rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis, tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be, utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec, zlib_codec

-m=, --method= Specify a method of extraction for formats that support it
-o=-, --output=-
 Output raw text in this file
-O, --option Add arbitrary options to various parsers of the form KEYWORD=VALUE. A full list of available KEYWORD options is available at http://bit.ly/textract-options
-v, --version show program’s version number and exit

Note

To make the command line interface as usable as possible, autocompletion of available options with textract is enabled by @kislyuk’s amazing argcomplete package. Follow instructions to enable global autocomplete and you should be all set. As an example, this is also configured in the virtual machine provisioning for this project.