Change Log

This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.

NEXT RELEASE

TBD

2.0.0

  • Major modernization of textract absorbing the textract-py3 fork (`#543`_ by `@KyleKing`_)

    • Implement GitHub Actions CI/CD and improve test coverage

    • Fix minor inconsistencies on Windows

    • Resolve dependency specification errors by migrating to uv

    • Update some minimum dependency versions and sets minimum Python to 3.9

    • Migrate to PyPi Trusted Publisher for releases

  • Multiple Fixes:

  • Documentation Updates

1.6.5

  • switched epub parsing to MIT license compatible package (#411 by @jhale1805)

1.6.4

  • several bug fixes, including:

1.6.1

  • several bug fixes, including:

    • fixing the readthedocs build (#150)

1.6.0

  • Let the user provide file extension as an argument when the file name has no extension (#148 by @motazsaad)

  • Added ability to parse audio with pocketsphinx (#122 by @barrust)

  • Added ability to parse .psv and .tsv files (#141)

  • several bug fixes, including:

    • checking for the importability of a parser rather than the presense of the file (#136 by @AusIV)

    • manage versions with bumpversion (#146)

    • properly reporting on missing external dependencies (#139 by @AusIV)

    • pin chardet to version 2.1.1 to avoid decode errors (#107)

    • avoid unicode decode error with html parser (#147 by @suned)

    • enabling autocomplete and improving error handling (#149)

1.5.0

1.4.0

  • added layout preservation option for pdftotext pdf extractor (#93 by @ankushshah89)

  • added simple support for extensionless filenames, treating them as plain .txt files (#85)

  • several bug fixes, including:

    • now extracting the text in tables from docx files at the end of the text extraction (#92 by @jsmith-mploir)

    • faster testing framework by only rebuilding test data when needed (#90)

    • fixed .html and .epub parsers to deal with beautifulsoup4 upgrades

    • using official msg-extractor now that it has a native setup.py

    • updated tests for .html, .ogg, .wav, and .mp3 file types to be consistent with more recent versions of the underlying packages.

1.3.0

1.2.0

  • support for .tiff files (#81)

  • added support for other languages for tesseract (#76 by @anderser)

  • added --option/-O flag to pass arbitrary arguments for things like languages into textract

  • several bug fixes, including:

    • fix bug with doing OCR on multi-page pdfs and removing temporary directory (#82 by @pudo)

    • correctly accounting for whitespace in .odt documents (#79 by @evfredericksen)

    • standardizing testing environment to be compatible with different versions of third-party command line tools (#78)

1.1.0

  • support for .wav, .mp3, and .ogg files (#56 and #62 by @arvindch)

  • support for .csv files (#64)

  • support for scanned .pdf files with tesseract (#66 by @pudo)

  • support for .htm files (#69)

  • several bug fixes, including:

    • .odt parser now correctly extracts text in order (#61 by @levivm)

    • fixed Docker development environment compatability with the Vagrant VM environment (#73 by @ShawnMilo)

  • several internal improvements, including:

    • improvements in the python documentation (#70)

    • improved html output with reduced whitespace around inline elements in output text (#58 by @eiotec)

1.0.0

  • standardized encoding of output with -e/--encoding option (#39)

  • support for .xls and .xlsx files (#42 and #55 by @levivm)

  • support for .epub files (#40 by @kokxx)

  • several bug fixes, including:

    • removing tesseract version info from output of image parsers (#48)

    • problems with spaces in filenames (#53)

    • concurrancy problems with tesseract (#44 by @ShawnMilo, #41 by @christomitov)

  • several internal improvements, including:

    • switching to using class-based parsers to abstract away the common functionality between different parser classes (#39)

    • switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (#49)

    • including support for Docker-based testing (#46 by @ShawnMilo)

0.5.1

  • several bug fixes, including:

    • documentation fixes

    • shell commands hanging on large files (#33)

0.5.0

  • support for .json files (#13 by @anthonygarvan)

  • support for .odt files (#29 by @christomitov)

  • support for .ps files (#25)

  • support for .gif, .jpg, .jpeg, and .png files (#30 by @christomitov)

  • several bug fixes, including:

    • improved fallback handling in .pdf parser if the pdftotext command line utility isn’t installed (#26)

    • improved documentation for installation instructions on non-Ubuntu operating systems (#21, #26)

  • several internal improvements, including:

    • cleaned up implementation of extension parsers to avoid magic

0.4.0

  • support for .html files (#7)

  • support for .eml files (#4)

  • automated the documentation for the python package using sphinx-apidoc in docs/Makefile (#9)

0.3.0

  • support for .txt files, haha (#8)

  • fixed installation bug with not properly including requirements files in the manifest

0.2.0

  • support for .doc files (#2)

  • support for .pdf files (#3)

  • several bug fixes, including:

    • fixing tab complete bug no file paths (#6)

    • fixing tests to make sure the work properly on travis-ci

0.1.0

  • Initial release, support for .docx and .pptx