Installation
One of the main goals of textract is to make it as easy as possible to
start using textract (meaning that installation should be as quick and
painless as possible). This package is built on top of several python
packages and other source libraries. Python packages are installed
automatically when using pip or uv. The source libraries are a
separate matter though and largely depend on your operating system.
Modern Python tooling options:
# One-off execution
uvx textract path/to/file.pdf
# Install as tool
uv tool install textract
Ubuntu / Debian
First install system packages using apt-get, then install textract from PyPI:
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils ghostscript tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
pip install textract
# or with uv
uv pip install textract
Note
It may also be necessary to install zlib1g-dev on Docker
instances of Ubuntu. See issue #19 for details
macOS
These steps rely on you having Homebrew installed. First install XQuartz and system packages, then install textract from PyPI:
brew install --cask xquartz
brew install antiword ghostscript poppler sox tesseract unrtf swig
pip install textract
# or with uv
uv pip install textract
Note
ghostscript provides ps2ascii for .ps file extraction.
Note
Depending on how you have python configured on your system with homebrew, you may also need to install the python development header files for textract to properly install.
Note
pocketsphinx and Python 3.14+ on macOS: No prebuilt wheel exists yet
for Python 3.14 on macOS ARM64. When installed from source,
scikit-build-core’s post-build steps invalidate the code signature, causing
macOS to kill the process with SIGKILL (Code Signature Invalid) when the
extension loads. If you see a RuntimeError mentioning an invalid code
signature when using method="sphinx", re-sign the extension:
python -c "
import subprocess
from pathlib import Path
import importlib.util
spec = importlib.util.find_spec('pocketsphinx')
for loc in (spec.submodule_search_locations or []):
for so in Path(loc).glob('_pocketsphinx.cpython-*-darwin.so'):
subprocess.run(['codesign', '-s', '-', '-f', str(so)], check=True)
print(f'Re-signed: {so}')
"
This is a known upstream issue; once a Python 3.14 macOS wheel is published to PyPI the workaround will no longer be needed.
Windows
Install Chocolatey then install system packages:
choco install tesseract ghostscript sox.portable poppler -y
pip install textract
# or with uv
uv pip install textract
Note
Two parsers are not supported on Windows:
.mp3/.ogg: sox.portable does not includelibmad. SoX dynamically loadslibmad.dllfor MP3 decoding but does not ship it due to patent restrictions — see the SoX mailing list discussion. Use Linux or macOS wherelibsox-fmt-mp3/madare available..rtf: unrtf is a GNU project with no Windows port and no Chocolatey package. RTF extraction is only available on Linux (apt install unrtf) and macOS (brew install unrtf).
FreeBSD
First install system packages using pkg, then install textract from PyPI:
pkg install lang/python38 devel/py-pip textproc/libxml2 textproc/libxslt textproc/antiword textproc/unrtf \
graphics/poppler print/pstotext graphics/tesseract audio/flac multimedia/ffmpeg audio/lame audio/sox \
graphics/jpeg-turbo
pip install textract
# or with uv
uv pip install textract
Reference: CI System Dependencies
The canonical list of system dependencies is maintained in the GitHub Actions workflow at .github/actions/setup/action.yml. This is what CI uses and is kept up-to-date with each platform’s requirements.
Don’t see your operating system installation instructions here?
My apologies! Installing system packages is a bit of a drag and its hard to anticipate all of the different environments that need to be accommodated (wouldn’t it be awesome if there were a system-agnostic package manager or, better yet, if python could install these system dependencies for you?!?!). If you’re operating system doesn’t have documentation about how to install the textract dependencies, please contribute a pull request with:
A new section in here with the appropriate details about how to install things. In particular, please give instructions for how to install the following libraries before running
pip install textract:libxml2 2.6.21 or later is required by the
.docxparser which uses lxml via python-docx.libxslt 1.1.15 or later is required by the
.docxparser which users lxml via python-docx.python header files are required for building lxml.
antiword is required by the
.docparser (note: no longer actively maintained).pdftotext (part of poppler) is optionally required by the
.pdfparser (there is a pure python fallback that works if pdftotext isn’t installed).ps2ascii (part of ghostscript) is required by the
.psparser.tesseract-ocr is required by the
.jpg,.pngand.gifparser.sox is required by the
.mp3and.oggparser. You need to install ffmpeg, lame, libmad0 and libsox-fmt-mp3, before building sox, for these filetypes to work.
Add a requirements file to the requirements directory of the project with the lower-cased name of your operating system (e.g.
requirements/windows) so we can try to keep these things up to date in the future.