The overarching goal of this project is to make it as easy as possible to extract raw text from any document for the purposes of most natural language processing tasks. In practice, this means that this project should preferentially provide tools that correctly produce output that has words in the correct order but that whitespace between words, formatting, etc is totally irrelevant. As the various parsers mature, I fully expect the output to become more readable to support additional use cases, like extracting text to appear in web pages.
Importantly, this project is committed to being as agnostic about how
the content is extracted as it is about the means in which the text is
analyzed downstream. This means that
textract should support
multiple modes of extracting text from any document and provide
reasonably good defaults (defaulting to tools that tend to produce the
correct word sequence).
Another important aspect of this project is that we want to have extremely good documentation. If you notice a type-o, error, confusing statement etc, please fix it!
Fork and clone the project:
git clone https://github.com/YOUR-USERNAME/textract.git
Depending on your development preferences, there are lots of ways to get started developing with textract:
Developing in a native Ubuntu environment¶
Install all the necessary system packages:
./provision/travis-mock.sh ./provision/debian.sh # optionally run some of the steps in these scripts, but you # may want to be selective about what you do as they alter global # environment states ./provision/python.sh ./provision/development.sh
On the virtual machine, make sure everything is working by running the suite of functional tests:
These functional tests are designed to be run on an Ubuntu 12.04 LTS server, just like the virtual machine and the server that runs the travis-ci test suite. There are some other tests that have been added along the way in the Travis configuration. For your convenience, you can run all of these tests with:
Developing with Vagrant virtual machine¶
vagrant plugin install iniparse vagrant up && vagrant provision
vagrant sshing to the virtual machine, note that the
PATHenvironment variables have been altered in this virtual machine so that any changes you make to textract in development are automatically incorporated into the command.
See step 4 in the Ubuntu development environment. Current build status: