.. _contributing: Contributing ============ The overarching goal of this project is to make it as easy as possible to extract raw text from any document for the purposes of most natural language processing tasks. In practice, this means that this project should preferentially provide tools that correctly produce output that has words in the correct order but that whitespace between words, formatting, etc is totally irrelevant. As the various parsers mature, I fully expect the output to become more readable to support additional use cases, like `extracting text to appear in web pages `_. Importantly, this project is committed to being as agnostic about how the content is extracted as it is about the means in which the text is analyzed downstream. This means that ``textract`` should support multiple modes of extracting text from any document and provide reasonably good defaults (defaulting to tools that tend to produce the correct word sequence). Another important aspect of this project is that we want to have extremely good documentation. If you notice a type-o, error, confusing statement etc, please fix it! .. _contributing-quick-start: Quick start ----------- 1. `Fork `_ and clone the project: .. code-block:: bash git clone https://github.com/YOUR-USERNAME/textract.git 2. Install dependencies and run tests: .. code-block:: bash make sync uv run pytest 3. Install system dependencies for your platform. See the :ref:`installation ` guide for platform-specific instructions (macOS, Ubuntu/Debian, Windows, FreeBSD). 4. Contribute! There are several `open issues `_ that provide good places to dig in and send pull requests; your help is greatly appreciated! Current build status: |Build Status| .. |Build Status| image:: https://travis-ci.org/deanmalmgren/textract.png :target: https://travis-ci.org/deanmalmgren/textract Contribution workflow --------------------- Any and all contributions are welcome and appreciated. To make it easy to keep things organized, this project uses the `general guidelines `_ for the fork-branch-pull request model for github. Briefly, this means: 1. Make sure your fork's ``master`` branch is up to date: .. code-block:: bash git remote add deanmalmgren https://github.com/deanmalmgren/textract.git git checkout master git pull deanmalmgren/master 2. Start a feature branch with a descriptive name about what you're trying to accomplish: .. code-block:: bash git checkout -b csv-support 3. Make commits to this feature branch (``csv-support``, in this case) in a way that other people can understand with good commit messages to explain the changes you've made: .. code-block:: bash code textract/parsers/csv_parser.py git add textract/parsers/csv_parser.py git commit -m 'feat: added csv_parser' 4. Open a PR on GitHub with your changes, explain the motivation for the PR, any manual testing, and link any relevant issues: .. code-block:: bash git push origin csv-support chrome http://github.com/deanmalmgren/textract/compare Common contributions: support for new file type ------------------------------------------------ This project has really taken off, much more so than I would have thought (thanks everybody!). To help new contributors, I thought I'd jot down some notes for one of the more common contributions---how to add support for hitherto unsupported file type ``.abc123``: * write a ``Parser`` class in ``textract/parsers/abc123_parser.py`` that inherits from ``textract.parsers.utils.BaseParser`` or ``textract.parsers.utils.ShellParser`` and implements the ``extract(self, filename, **kwargs)`` method. * add a test file in ``tests/abc123/raw_text.abc123`` and generate the expected output. For most file types that use pure Python libraries, run textract on it: .. code-block:: shell textract tests/abc123/raw_text.abc123 > tests/abc123/raw_text.txt For file types that require external tools (PDF OCR, image OCR, PostScript), add the file to ``tests/Makefile`` and run: .. code-block:: shell cd tests && make Then add the basic test suite by creating a file called ``tests/test_abc123.py`` with content that looks something like this: .. code-block:: python # tests/test_abc123.py import unittest import base class Abc123TestCase(unittest.TestCase, base.BaseParserTestCase): extension = 'abc123' now you should be able to run tests on your parser with ``uv run pytest tests/test_abc123.py`` or the tests for every parser with ``uv run pytest``. * if your package relies on any external sources, be sure to add them in ``pyproject.toml`` (for python packages) or document system dependencies and update the installation documentation accordingly in ``docs/installation.rst``. * add documentation about the awesome new file format this is being supported in ``docs/index.rst`` * finally, make sure the entire test suite passes by running ``uv run pytest`` and fix any lingering problems. Style guidelines ---------------- As a general rule of thumb, the goal of this package is to be as readable as possible to make it easy for novices and experts alike to contribute to the source code in meaningful ways. Pull requests that favor cleverness or optimization over readability are less likely to be incorporated. To make this notion of "readability" more concrete, here are a few stylistic guidelines that are inspired by other projects and we generally recommend: - write functions and methods that can `fit on a screen or two of a standard terminal `_ --- no more than approximately 40 lines. - unless it makes code less readable, adhere to `PEP 8 `_ style recommendations --- use an appropriate amount of whitespace. - `code comments should be about *what* is being done, not *how* it is being done `_ --- that should be self-evident from the code itself. Development Setup ----------------- Install dependencies:: make sync This runs ``uv sync`` and, on macOS, re-signs any compiled C extensions that may have an invalid code signature due to Python 3.14 build tooling. See `apple codesigning and scikit-build-core `_ for upstream context. Install system dependencies (macOS, using `Homebrew `_):: brew install antiword tesseract ghostscript poppler sox unrtf Install system dependencies (Ubuntu/Debian):: apt-get install antiword tesseract-ocr ghostscript poppler-utils sox libsox-fmt-mp3 unrtf Install system dependencies (Windows, using `Chocolatey `_):: choco install tesseract ghostscript sox.portable poppler -y .. note:: The canonical list of system dependencies is in `.github/actions/setup/action.yml `_. Releasing --------- Versioning follows `semantic versioning `_. The version is declared in two places: ``pyproject.toml`` (``version``) and ``textract/__init__.py`` (``VERSION``). Use the helper script to update both atomically: .. code-block:: bash python scripts/bump_version.py 1.7.0 **Update the changelog** in ``docs/changelog.rst`` — add a new section above the previous release with a short summary of notable changes. See existing entries for the format. **Tag and push:** .. code-block:: bash git add pyproject.toml textract/__init__.py docs/changelog.rst git commit -m "chore: release 1.7.0" git tag v1.7.0 git push --follow-tags CI runs tests, creates a GitHub Release with auto-generated notes from merged PRs, and publishes to PyPI.