Contributing
The overarching goal of this project is to make it as easy as possible to extract raw text from any document for the purposes of most natural language processing tasks. In practice, this means that this project should preferentially provide tools that correctly produce output that has words in the correct order but that whitespace between words, formatting, etc is totally irrelevant. As the various parsers mature, I fully expect the output to become more readable to support additional use cases, like extracting text to appear in web pages.
Importantly, this project is committed to being as agnostic about how
the content is extracted as it is about the means in which the text is
analyzed downstream. This means that textract should support
multiple modes of extracting text from any document and provide
reasonably good defaults (defaulting to tools that tend to produce the
correct word sequence).
Another important aspect of this project is that we want to have extremely good documentation. If you notice a type-o, error, confusing statement etc, please fix it!
Quick start
Fork and clone the project:
git clone https://github.com/YOUR-USERNAME/textract.git
Install dependencies and run tests:
make sync uv run pytest
Install system dependencies for your platform. See the installation guide for platform-specific instructions (macOS, Ubuntu/Debian, Windows, FreeBSD).
Contribute! There are several open issues that provide good places to dig in and send pull requests; your help is greatly appreciated!
Contribution workflow
Any and all contributions are welcome and appreciated. To make it easy to keep things organized, this project uses the general guidelines for the fork-branch-pull request model for github. Briefly, this means:
Make sure your fork’s
masterbranch is up to date:git remote add deanmalmgren https://github.com/deanmalmgren/textract.git git checkout master git pull deanmalmgren/master
Start a feature branch with a descriptive name about what you’re trying to accomplish:
git checkout -b csv-support
Make commits to this feature branch (
csv-support, in this case) in a way that other people can understand with good commit messages to explain the changes you’ve made:code textract/parsers/csv_parser.py git add textract/parsers/csv_parser.py git commit -m 'feat: added csv_parser'
Open a PR on GitHub with your changes, explain the motivation for the PR, any manual testing, and link any relevant issues:
git push origin csv-support chrome http://github.com/deanmalmgren/textract/compare
Common contributions: support for new file type
This project has really taken off, much more so than I would have
thought (thanks everybody!). To help new contributors, I thought I’d
jot down some notes for one of the more common contributions—how to
add support for hitherto unsupported file type .abc123:
write a
Parserclass intextract/parsers/abc123_parser.pythat inherits fromtextract.parsers.utils.BaseParserortextract.parsers.utils.ShellParserand implements theextract(self, filename, **kwargs)method.add a test file in
tests/abc123/raw_text.abc123and generate the expected output. For most file types that use pure Python libraries, run textract on it:textract tests/abc123/raw_text.abc123 > tests/abc123/raw_text.txt
For file types that require external tools (PDF OCR, image OCR, PostScript), add the file to
tests/Makefileand run:cd tests && make
Then add the basic test suite by creating a file called
tests/test_abc123.pywith content that looks something like this:# tests/test_abc123.py import unittest import base class Abc123TestCase(unittest.TestCase, base.BaseParserTestCase): extension = 'abc123'
now you should be able to run tests on your parser with
uv run pytest tests/test_abc123.pyor the tests for every parser withuv run pytest.if your package relies on any external sources, be sure to add them in
pyproject.toml(for python packages) or document system dependencies and update the installation documentation accordingly indocs/installation.rst.add documentation about the awesome new file format this is being supported in
docs/index.rstfinally, make sure the entire test suite passes by running
uv run pytestand fix any lingering problems.
Style guidelines
As a general rule of thumb, the goal of this package is to be as readable as possible to make it easy for novices and experts alike to contribute to the source code in meaningful ways. Pull requests that favor cleverness or optimization over readability are less likely to be incorporated.
To make this notion of “readability” more concrete, here are a few stylistic guidelines that are inspired by other projects and we generally recommend:
write functions and methods that can fit on a screen or two of a standard terminal — no more than approximately 40 lines.
unless it makes code less readable, adhere to PEP 8 style recommendations — use an appropriate amount of whitespace.
code comments should be about *what* is being done, not *how* it is being done — that should be self-evident from the code itself.
Development Setup
Install dependencies:
make sync
This runs uv sync and, on macOS, re-signs any compiled C extensions that
may have an invalid code signature due to Python 3.14 build tooling. See
apple codesigning and scikit-build-core
for upstream context.
Install system dependencies (macOS, using Homebrew):
brew install antiword tesseract ghostscript poppler sox unrtf
Install system dependencies (Ubuntu/Debian):
apt-get install antiword tesseract-ocr ghostscript poppler-utils sox libsox-fmt-mp3 unrtf
Install system dependencies (Windows, using Chocolatey):
choco install tesseract ghostscript sox.portable poppler -y
Note
The canonical list of system dependencies is in .github/actions/setup/action.yml.
Releasing
Versioning follows semantic versioning. The version is declared in two
places: pyproject.toml (version) and textract/__init__.py (VERSION). Use the
helper script to update both atomically:
python scripts/bump_version.py 1.7.0
Update the changelog in docs/changelog.rst — add a new section above the previous release
with a short summary of notable changes. See existing entries for the format.
Tag and push:
git add pyproject.toml textract/__init__.py docs/changelog.rst
git commit -m "chore: release 1.7.0"
git tag v1.7.0
git push --follow-tags
CI runs tests, creates a GitHub Release with auto-generated notes from merged PRs, and publishes to PyPI.
