PK koIm$! textract-v1.5.0/.buildinfo# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config:
tags:
PK koIZT textract-v1.5.0/objects.inv# Sphinx inventory version 2
# Project: textract
# Version: 1.5.0
# The remainder of this file is compressed using zlib.
xڵn y
Үm=0[ "6}A>9Y?;.Be
Ժ|$k^&(#ajlLZ0HkkJŎoaD^C
aI2@A.>וYWu)Ǣs GGcЫ*21St<]/H+*_S߳bU\Bk!|.Iy)Z4VY5黡~,5PxHՀOƼR%G$XE"1" v.`K9mP"z@Nad >$+3[4<2t@?joVG@*06шzHL׀k& I
~6ljzgP~VS.Ch\Qo7PK koIww
w
textract-v1.5.0/index.html
As undesireable as it might be, more often than not there is extremely
useful information embedded in Word documents, PowerPoint
presentations, PDFs, etc—so-called “dark data”—that would be
valuable for further textual analysis and visualization. While
several packages exist for extracting content from
each of these formats on their own, this package provides a single
interface for extracting content from any type of file, without any
irrelevant markup.
textract supports a growing list of file types for text extraction. If
you don’t see your favorite file type here, Please recommend other
file types by either mentioning them on the issue tracker or by
contributing a pull request.
Of course, textract isn’t the first project with the aim to provide a
simple interface for extracting text from any document. But this is,
to the best of my knowledge, the only project that is written in
python (a language commonly chosen by the natural language processing
community) and is method agnostic about how content is extracted. I’m sure that there are other similar projects out
there, but here is a small sample of similar projects:
Specify a method of extraction for formats that support it
-o=-, --output=-
Output raw text in this file
-O, --option
Add arbitrary options to various parsers of the form KEYWORD=VALUE. A full list of available KEYWORD options is available at http://bit.ly/textract-options
-v, --version
show program’s version number and exit
Note
To make the command line interface as usable as possible,
autocompletion of available options with textract is enabled by
@kislyuk’s amazing argcomplete package. Follow
instructions to enable global autocomplete
and you should be all set. As an example, this is also configured
in the virtual machine provisioning for this project.
This package is organized to make it as easy as possible to add new
extensions and support the continued growth and coverage of
textract. For almost all applications, you will just have to do
something like this:
to obtain text from a document. You can also pass keyword arguments to
textract.process, for example, to use a particular method for
parsing a pdf like this:
Some parsers also enable additional options which can be passed in as keyword
arguments to the textract.process function. Here is a quick table of
available options that are available to the different types of parsers:
parser
option
description
gif
language
Specify the language for OCR-ing text with tesseract
jpg
language
Specify the language for OCR-ing text with tesseract
pdf
language
For use when method='tesseract', specify the language
pdf
layout
With method='pdftotext' (default), preserve the layout
png
language
Specify the language for OCR-ing text with tesseract
tiff
language
Specify the language for OCR-ing text with tesseract
As an example of using these additional options, you can extract text from a
Norwegian PDF using Tesseract OCR like this:
When textract.process('path/to/file.extension') is called,
textract.process looks for a module called
textract.parsers.extension_parser that also contains a Parser.
This is the core function used for extracting text. It routes the
filename to the appropriate parser and returns the extracted
text as a byte-string encoded with encoding.
Importantly, the textract.parsers.extension_parser.Parser class
must inherit from textract.parsers.utils.BaseParser.
The BaseParser abstracts out some common functionality
that is used across all document Parsers. In particular, it has
the responsibility of handling all unicode and byte-encoding.
Many of the parsers rely on command line utilities to do some of the
parsing. For convenience, the textract.parsers.utils.ShellParser
class includes some convenience methods for streamlining access to the
command line.
One of the main goals of textract is to make it as easy as possible to
start using textract (meaning that installation should be as quick and
painless as possible). This package is built on top of several python
packages and other source libraries. Assuming you are using pip or
easy_install to install textract, the python packages
are all installed by default with textract. The source libraries are a
separate matter though and largely depend on your operating system.
There are two steps required to run this package on
Ubuntu/Debian. First you must install some system packages using the
apt-get
package manager before installing textract from pypi.
These steps rely on you having homebrew installed
as well as the cask plugin (brewinstallcaskroom/cask/brew-cask). The basic idea is to first install
XQuartz before
installing a bunch of system packages before installing textract from
pypi.
pstotext is
not currently a part of homebrew so .ps extraction must be
enabled by manually installing from source.
Note
Depending on how you have python configured on your system with
homebrew, you may also need to install the python
development header files for textract to properly install.
Don’t see your operating system installation instructions here?¶
My apologies! Installing system packages is a bit of a drag and its
hard to anticipate all of the different environments that need to be
accomodated (wouldn’t it be awesome if there were a system-agnostic
package manager or, better yet, if python could install these system
dependencies for you?!?!). If you’re operating system doesn’t have
documenation about how to install the textract dependencies, please
contribute a pull request with:
A new section in here with the appropriate details about how to
install things. In particular, please give instructions for how to
install the following libraries before running pipinstalltextract:
tesseract-ocr
is required by the .jpg, .png and .gif parser.
sox
is required by the .mp3 and .ogg parser.
You need to install ffmpeg, lame, libmad0 and libsox-fmt-mp3,
before building sox, for these filetypes to work.
Add a requirements file to the requirements directory
of the project with the lower-cased name of your operating system
(e.g. requirements/windows) so we can try to keep these things
up to date in the future.
The overarching goal of this project is to make it as easy as possible
to extract raw text from any document for the purposes of most natural
language processing tasks. In practice, this means that this project
should preferentially provide tools that correctly produce output that
has words in the correct order but that whitespace between words,
formatting, etc is totally irrelevant. As the various parsers mature,
I fully expect the output to become more readable to support
additional use cases, like extracting text to appear in web pages.
Importantly, this project is committed to being as agnostic about how
the content is extracted as it is about the means in which the text is
analyzed downstream. This means that textract should support
multiple modes of extracting text from any document and provide
reasonably good defaults (defaulting to tools that tend to produce the
correct word sequence).
Another important aspect of this project is that we want to have
extremely good documentation. If you notice a type-o, error, confusing
statement etc, please fix it!
Contribute! There are several open issues that provide
good places to dig in. Check out the contribution guidelines
and send pull requests; your help is greatly appreciated!
Depending on your development preferences, there are lots of ways to
get started developing with textract:
./provision/travis-mock.sh
./provision/debian.sh
# optionally run some of the steps in these scripts, but you# may want to be selective about what you do as they alter global# environment states
./provision/python.sh
./provision/development.sh
On the virtual machine, make sure everything is working by running
the suite of functional tests:
nosetests
These functional tests are designed to be run on an Ubuntu 12.04
LTS server, just like the virtual machine and the server that runs
the travis-ci test suite. There are some other tests that have been
added along the way in the Travis configuration. For
your convenience, you can run all of these tests with:
This project uses semantic versioning to
track version numbers, where backwards incompatible changes
(highlighted in bold) bump the major version of the package.
')
.appendTo($('#searchbox'));
}
},
/**
* init the domain index toggle buttons
*/
initIndexTable : function() {
var togglers = $('img.toggler').click(function() {
var src = $(this).attr('src');
var idnum = $(this).attr('id').substr(7);
$('tr.cg-' + idnum).toggle();
if (src.substr(-9) == 'minus.png')
$(this).attr('src', src.substr(0, src.length-9) + 'plus.png');
else
$(this).attr('src', src.substr(0, src.length-8) + 'minus.png');
}).css('display', '');
if (DOCUMENTATION_OPTIONS.COLLAPSE_INDEX) {
togglers.click();
}
},
/**
* helper function to hide the search marks again
*/
hideSearchWords : function() {
$('#searchbox .highlight-link').fadeOut(300);
$('span.highlighted').removeClass('highlighted');
},
/**
* make the url absolute
*/
makeURL : function(relativeURL) {
return DOCUMENTATION_OPTIONS.URL_ROOT + '/' + relativeURL;
},
/**
* get the current relative url
*/
getCurrentURL : function() {
var path = document.location.pathname;
var parts = path.split(/\//);
$.each(DOCUMENTATION_OPTIONS.URL_ROOT.split(/\//), function() {
if (this == '..')
parts.pop();
});
var url = parts.join('/');
return path.substring(url.lastIndexOf('/') + 1, path.length - 1);
}
};
// quick alias for translations
_ = Documentation.gettext;
$(document).ready(function() {
Documentation.init();
});
PK ZoI8c c % textract-v1.5.0/_static/websupport.js/*
* websupport.js
* ~~~~~~~~~~~~~
*
* sphinx.websupport utilties for all documentation.
*
* :copyright: Copyright 2007-2016 by the Sphinx team, see AUTHORS.
* :license: BSD, see LICENSE for details.
*
*/
(function($) {
$.fn.autogrow = function() {
return this.each(function() {
var textarea = this;
$.fn.autogrow.resize(textarea);
$(textarea)
.focus(function() {
textarea.interval = setInterval(function() {
$.fn.autogrow.resize(textarea);
}, 500);
})
.blur(function() {
clearInterval(textarea.interval);
});
});
};
$.fn.autogrow.resize = function(textarea) {
var lineHeight = parseInt($(textarea).css('line-height'), 10);
var lines = textarea.value.split('\n');
var columns = textarea.cols;
var lineCount = 0;
$.each(lines, function() {
lineCount += Math.ceil(this.length / columns) || 1;
});
var height = lineHeight * (lineCount + 1);
$(textarea).css('height', height);
};
})(jQuery);
(function($) {
var comp, by;
function init() {
initEvents();
initComparator();
}
function initEvents() {
$(document).on("click", 'a.comment-close', function(event) {
event.preventDefault();
hide($(this).attr('id').substring(2));
});
$(document).on("click", 'a.vote', function(event) {
event.preventDefault();
handleVote($(this));
});
$(document).on("click", 'a.reply', function(event) {
event.preventDefault();
openReply($(this).attr('id').substring(2));
});
$(document).on("click", 'a.close-reply', function(event) {
event.preventDefault();
closeReply($(this).attr('id').substring(2));
});
$(document).on("click", 'a.sort-option', function(event) {
event.preventDefault();
handleReSort($(this));
});
$(document).on("click", 'a.show-proposal', function(event) {
event.preventDefault();
showProposal($(this).attr('id').substring(2));
});
$(document).on("click", 'a.hide-proposal', function(event) {
event.preventDefault();
hideProposal($(this).attr('id').substring(2));
});
$(document).on("click", 'a.show-propose-change', function(event) {
event.preventDefault();
showProposeChange($(this).attr('id').substring(2));
});
$(document).on("click", 'a.hide-propose-change', function(event) {
event.preventDefault();
hideProposeChange($(this).attr('id').substring(2));
});
$(document).on("click", 'a.accept-comment', function(event) {
event.preventDefault();
acceptComment($(this).attr('id').substring(2));
});
$(document).on("click", 'a.delete-comment', function(event) {
event.preventDefault();
deleteComment($(this).attr('id').substring(2));
});
$(document).on("click", 'a.comment-markup', function(event) {
event.preventDefault();
toggleCommentMarkupBox($(this).attr('id').substring(2));
});
}
/**
* Set comp, which is a comparator function used for sorting and
* inserting comments into the list.
*/
function setComparator() {
// If the first three letters are "asc", sort in ascending order
// and remove the prefix.
if (by.substring(0,3) == 'asc') {
var i = by.substring(3);
comp = function(a, b) { return a[i] - b[i]; };
} else {
// Otherwise sort in descending order.
comp = function(a, b) { return b[by] - a[by]; };
}
// Reset link styles and format the selected sort option.
$('a.sel').attr('href', '#').removeClass('sel');
$('a.by' + by).removeAttr('href').addClass('sel');
}
/**
* Create a comp function. If the user has preferences stored in
* the sortBy cookie, use those, otherwise use the default.
*/
function initComparator() {
by = 'rating'; // Default to sort by rating.
// If the sortBy cookie is set, use that instead.
if (document.cookie.length > 0) {
var start = document.cookie.indexOf('sortBy=');
if (start != -1) {
start = start + 7;
var end = document.cookie.indexOf(";", start);
if (end == -1) {
end = document.cookie.length;
by = unescape(document.cookie.substring(start, end));
}
}
}
setComparator();
}
/**
* Show a comment div.
*/
function show(id) {
$('#ao' + id).hide();
$('#ah' + id).show();
var context = $.extend({id: id}, opts);
var popup = $(renderTemplate(popupTemplate, context)).hide();
popup.find('textarea[name="proposal"]').hide();
popup.find('a.by' + by).addClass('sel');
var form = popup.find('#cf' + id);
form.submit(function(event) {
event.preventDefault();
addComment(form);
});
$('#s' + id).after(popup);
popup.slideDown('fast', function() {
getComments(id);
});
}
/**
* Hide a comment div.
*/
function hide(id) {
$('#ah' + id).hide();
$('#ao' + id).show();
var div = $('#sc' + id);
div.slideUp('fast', function() {
div.remove();
});
}
/**
* Perform an ajax request to get comments for a node
* and insert the comments into the comments tree.
*/
function getComments(id) {
$.ajax({
type: 'GET',
url: opts.getCommentsURL,
data: {node: id},
success: function(data, textStatus, request) {
var ul = $('#cl' + id);
var speed = 100;
$('#cf' + id)
.find('textarea[name="proposal"]')
.data('source', data.source);
if (data.comments.length === 0) {
ul.html('
No comments yet.
');
ul.data('empty', true);
} else {
// If there are comments, sort them and put them in the list.
var comments = sortComments(data.comments);
speed = data.comments.length * 100;
appendComments(comments, ul);
ul.data('empty', false);
}
$('#cn' + id).slideUp(speed + 200);
ul.slideDown(speed);
},
error: function(request, textStatus, error) {
showError('Oops, there was a problem retrieving the comments.');
},
dataType: 'json'
});
}
/**
* Add a comment via ajax and insert the comment into the comment tree.
*/
function addComment(form) {
var node_id = form.find('input[name="node"]').val();
var parent_id = form.find('input[name="parent"]').val();
var text = form.find('textarea[name="comment"]').val();
var proposal = form.find('textarea[name="proposal"]').val();
if (text == '') {
showError('Please enter a comment.');
return;
}
// Disable the form that is being submitted.
form.find('textarea,input').attr('disabled', 'disabled');
// Send the comment to the server.
$.ajax({
type: "POST",
url: opts.addCommentURL,
dataType: 'json',
data: {
node: node_id,
parent: parent_id,
text: text,
proposal: proposal
},
success: function(data, textStatus, error) {
// Reset the form.
if (node_id) {
hideProposeChange(node_id);
}
form.find('textarea')
.val('')
.add(form.find('input'))
.removeAttr('disabled');
var ul = $('#cl' + (node_id || parent_id));
if (ul.data('empty')) {
$(ul).empty();
ul.data('empty', false);
}
insertComment(data.comment);
var ao = $('#ao' + node_id);
ao.find('img').attr({'src': opts.commentBrightImage});
if (node_id) {
// if this was a "root" comment, remove the commenting box
// (the user can get it back by reopening the comment popup)
$('#ca' + node_id).slideUp();
}
},
error: function(request, textStatus, error) {
form.find('textarea,input').removeAttr('disabled');
showError('Oops, there was a problem adding the comment.');
}
});
}
/**
* Recursively append comments to the main comment list and children
* lists, creating the comment tree.
*/
function appendComments(comments, ul) {
$.each(comments, function() {
var div = createCommentDiv(this);
ul.append($(document.createElement('li')).html(div));
appendComments(this.children, div.find('ul.comment-children'));
// To avoid stagnating data, don't store the comments children in data.
this.children = null;
div.data('comment', this);
});
}
/**
* After adding a new comment, it must be inserted in the correct
* location in the comment tree.
*/
function insertComment(comment) {
var div = createCommentDiv(comment);
// To avoid stagnating data, don't store the comments children in data.
comment.children = null;
div.data('comment', comment);
var ul = $('#cl' + (comment.node || comment.parent));
var siblings = getChildren(ul);
var li = $(document.createElement('li'));
li.hide();
// Determine where in the parents children list to insert this comment.
for(i=0; i < siblings.length; i++) {
if (comp(comment, siblings[i]) <= 0) {
$('#cd' + siblings[i].id)
.parent()
.before(li.html(div));
li.slideDown('fast');
return;
}
}
// If we get here, this comment rates lower than all the others,
// or it is the only comment in the list.
ul.append(li.html(div));
li.slideDown('fast');
}
function acceptComment(id) {
$.ajax({
type: 'POST',
url: opts.acceptCommentURL,
data: {id: id},
success: function(data, textStatus, request) {
$('#cm' + id).fadeOut('fast');
$('#cd' + id).removeClass('moderate');
},
error: function(request, textStatus, error) {
showError('Oops, there was a problem accepting the comment.');
}
});
}
function deleteComment(id) {
$.ajax({
type: 'POST',
url: opts.deleteCommentURL,
data: {id: id},
success: function(data, textStatus, request) {
var div = $('#cd' + id);
if (data == 'delete') {
// Moderator mode: remove the comment and all children immediately
div.slideUp('fast', function() {
div.remove();
});
return;
}
// User mode: only mark the comment as deleted
div
.find('span.user-id:first')
.text('[deleted]').end()
.find('div.comment-text:first')
.text('[deleted]').end()
.find('#cm' + id + ', #dc' + id + ', #ac' + id + ', #rc' + id +
', #sp' + id + ', #hp' + id + ', #cr' + id + ', #rl' + id)
.remove();
var comment = div.data('comment');
comment.username = '[deleted]';
comment.text = '[deleted]';
div.data('comment', comment);
},
error: function(request, textStatus, error) {
showError('Oops, there was a problem deleting the comment.');
}
});
}
function showProposal(id) {
$('#sp' + id).hide();
$('#hp' + id).show();
$('#pr' + id).slideDown('fast');
}
function hideProposal(id) {
$('#hp' + id).hide();
$('#sp' + id).show();
$('#pr' + id).slideUp('fast');
}
function showProposeChange(id) {
$('#pc' + id).hide();
$('#hc' + id).show();
var textarea = $('#pt' + id);
textarea.val(textarea.data('source'));
$.fn.autogrow.resize(textarea[0]);
textarea.slideDown('fast');
}
function hideProposeChange(id) {
$('#hc' + id).hide();
$('#pc' + id).show();
var textarea = $('#pt' + id);
textarea.val('').removeAttr('disabled');
textarea.slideUp('fast');
}
function toggleCommentMarkupBox(id) {
$('#mb' + id).toggle();
}
/** Handle when the user clicks on a sort by link. */
function handleReSort(link) {
var classes = link.attr('class').split(/\s+/);
for (var i=0; iThank you! Your comment will show up '
+ 'once it is has been approved by a moderator.');
}
// Prettify the comment rating.
comment.pretty_rating = comment.rating + ' point' +
(comment.rating == 1 ? '' : 's');
// Make a class (for displaying not yet moderated comments differently)
comment.css_class = comment.displayed ? '' : ' moderate';
// Create a div for this comment.
var context = $.extend({}, opts, comment);
var div = $(renderTemplate(commentTemplate, context));
// If the user has voted on this comment, highlight the correct arrow.
if (comment.vote) {
var direction = (comment.vote == 1) ? 'u' : 'd';
div.find('#' + direction + 'v' + comment.id).hide();
div.find('#' + direction + 'u' + comment.id).show();
}
if (opts.moderator || comment.text != '[deleted]') {
div.find('a.reply').show();
if (comment.proposal_diff)
div.find('#sp' + comment.id).show();
if (opts.moderator && !comment.displayed)
div.find('#cm' + comment.id).show();
if (opts.moderator || (opts.username == comment.username))
div.find('#dc' + comment.id).show();
}
return div;
}
/**
* A simple template renderer. Placeholders such as <%id%> are replaced
* by context['id'] with items being escaped. Placeholders such as <#id#>
* are not escaped.
*/
function renderTemplate(template, context) {
var esc = $(document.createElement('div'));
function handle(ph, escape) {
var cur = context;
$.each(ph.split('.'), function() {
cur = cur[this];
});
return escape ? esc.text(cur || "").html() : cur;
}
return template.replace(/<([%#])([\w\.]*)\1>/g, function() {
return handle(arguments[2], arguments[1] == '%' ? true : false);
});
}
/** Flash an error message briefly. */
function showError(message) {
$(document.createElement('div')).attr({'class': 'popup-error'})
.append($(document.createElement('div'))
.attr({'class': 'error-message'}).text(message))
.appendTo('body')
.fadeIn("slow")
.delay(2000)
.fadeOut("slow");
}
/** Add a link the user uses to open the comments popup. */
$.fn.comment = function() {
return this.each(function() {
var id = $(this).attr('id').substring(1);
var count = COMMENT_METADATA[id];
var title = count + ' comment' + (count == 1 ? '' : 's');
var image = count > 0 ? opts.commentBrightImage : opts.commentImage;
var addcls = count == 0 ? ' nocomment' : '';
$(this)
.append(
$(document.createElement('a')).attr({
href: '#',
'class': 'sphinx-comment-open' + addcls,
id: 'ao' + id
})
.append($(document.createElement('img')).attr({
src: image,
alt: 'comment',
title: title
}))
.click(function(event) {
event.preventDefault();
show($(this).attr('id').substring(2));
})
)
.append(
$(document.createElement('a')).attr({
href: '#',
'class': 'sphinx-comment-close hidden',
id: 'ah' + id
})
.append($(document.createElement('img')).attr({
src: opts.closeCommentImage,
alt: 'close',
title: 'close'
}))
.click(function(event) {
event.preventDefault();
hide($(this).attr('id').substring(2));
})
);
});
};
var opts = {
processVoteURL: '/_process_vote',
addCommentURL: '/_add_comment',
getCommentsURL: '/_get_comments',
acceptCommentURL: '/_accept_comment',
deleteCommentURL: '/_delete_comment',
commentImage: '/static/_static/comment.png',
closeCommentImage: '/static/_static/comment-close.png',
loadingImage: '/static/_static/ajax-loader.gif',
commentBrightImage: '/static/_static/comment-bright.png',
upArrow: '/static/_static/up.png',
downArrow: '/static/_static/down.png',
upArrowPressed: '/static/_static/up-pressed.png',
downArrowPressed: '/static/_static/down-pressed.png',
voting: false,
moderator: false
};
if (typeof COMMENT_OPTIONS != "undefined") {
opts = jQuery.extend(opts, COMMENT_OPTIONS);
}
var popupTemplate = '\
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line