Change Log

This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.

NEXT RELEASE

TBD

Minor nitpick to remove unused line (`#559`_ by `@KyleKing`_)

2.0.0

Major modernization of textract absorbing the textract-py3 fork (`#543`_ by `@KyleKing`_)
- Implement GitHub Actions CI/CD and improve test coverage
- Fix minor inconsistencies on Windows
- Resolve dependency specification errors by migrating to uv
- Update some minimum dependency versions and sets minimum Python to 3.9
- Migrate to PyPi Trusted Publisher for releases
Multiple Fixes:
- Return null on empty stream (`#422`_ by `@TheElementalOfDestruction`_)
- Enable encoding detection for the txt parser (`#456`_ by `@LoicGrobol`_)
- Catch ShellError from pdf2txt.py (`#495`_ by `@dhrim`_)
- Support python3.12 (`#502`_ by `@branchv`_)
Documentation Updates
- Add colons to issue template for consistency (`#520`_ by `@mcp292`_)
- Fix a few typos (`#430`_ by `@timgates42`_)

1.6.5

switched epub parsing to MIT license compatible package (#411 by @jhale1805)

1.6.4

several bug fixes, including:
- fixing dependency declarations (#162 by @lillypad)

1.6.1

several bug fixes, including:
- fixing the readthedocs build (#150)

1.6.0

Let the user provide file extension as an argument when the file name has no extension (#148 by @motazsaad)
Added ability to parse audio with pocketsphinx (#122 by @barrust)
Added ability to parse .psv and .tsv files (#141)
several bug fixes, including:
- checking for the importability of a parser rather than the presense of the file (#136 by @AusIV)
- manage versions with bumpversion (#146)
- properly reporting on missing external dependencies (#139 by @AusIV)
- pin chardet to version 2.1.1 to avoid decode errors (#107)
- avoid unicode decode error with html parser (#147 by @suned)
- enabling autocomplete and improving error handling (#149)

1.5.0

Added python 3 support, including pdfminer (#104 by @sirex via #126)
Python 3 support for pdfminer using pdfminer.six (#116 by @jaraco via #126)
fixed security vulnerability by properly using subprocess.call (#114 by @pierre-ernst)
updating to tesseract 3.03 (#127)
adding a .tif synonym for .tiff files (#113 by @onionradish)
improved .docx support using docx2txt (#100 by @ankushshah89)
several bug fixes, including:
- including all requirements for Pillow (#119 by @akoumjian)

1.4.0

added layout preservation option for pdftotext pdf extractor (#93 by @ankushshah89)
added simple support for extensionless filenames, treating them as plain .txt files (#85)
several bug fixes, including:
- now extracting the text in tables from docx files at the end of the text extraction (#92 by @jsmith-mploir)
- faster testing framework by only rebuilding test data when needed (#90)
- fixed .html and .epub parsers to deal with beautifulsoup4 upgrades
- using official msg-extractor now that it has a native setup.py
- updated tests for .html, .ogg, .wav, and .mp3 file types to be consistent with more recent versions of the underlying packages.

1.3.0

support for .rtf files (#84)
support for .msg files (#87 and #17 by @anthonygarvan)

1.2.0

support for .tiff files (#81)
added support for other languages for tesseract (#76 by @anderser)
added --option/-O flag to pass arbitrary arguments for things like languages into textract
several bug fixes, including:
- fix bug with doing OCR on multi-page pdfs and removing temporary directory (#82 by @pudo)
- correctly accounting for whitespace in .odt documents (#79 by @evfredericksen)
- standardizing testing environment to be compatible with different versions of third-party command line tools (#78)

1.1.0

support for .wav, .mp3, and .ogg files (#56 and #62 by @arvindch)
support for .csv files (#64)
support for scanned .pdf files with tesseract (#66 by @pudo)
support for .htm files (#69)
several bug fixes, including:
- .odt parser now correctly extracts text in order (#61 by @levivm)
- fixed Docker development environment compatability with the Vagrant VM environment (#73 by @ShawnMilo)
several internal improvements, including:
- improvements in the python documentation (#70)
- improved html output with reduced whitespace around inline elements in output text (#58 by @eiotec)

1.0.0

standardized encoding of output with -e/--encoding option (#39)
support for .xls and .xlsx files (#42 and #55 by @levivm)
support for .epub files (#40 by @kokxx)
several bug fixes, including:
- removing tesseract version info from output of image parsers (#48)
- problems with spaces in filenames (#53)
- concurrancy problems with tesseract (#44 by @ShawnMilo, #41 by @christomitov)
several internal improvements, including:
- switching to using class-based parsers to abstract away the common functionality between different parser classes (#39)
- switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (#49)
- including support for Docker-based testing (#46 by @ShawnMilo)

0.5.1

several bug fixes, including:
- documentation fixes
- shell commands hanging on large files (#33)

0.5.0

support for .json files (#13 by @anthonygarvan)
support for .odt files (#29 by @christomitov)
support for .ps files (#25)
support for .gif, .jpg, .jpeg, and .png files (#30 by @christomitov)
several bug fixes, including:
- improved fallback handling in .pdf parser if the pdftotext command line utility isn’t installed (#26)
- improved documentation for installation instructions on non-Ubuntu operating systems (#21, #26)
several internal improvements, including:
- cleaned up implementation of extension parsers to avoid magic

0.4.0

support for .html files (#7)
support for .eml files (#4)
automated the documentation for the python package using sphinx-apidoc in docs/Makefile (#9)

0.3.0

support for .txt files, haha (#8)
fixed installation bug with not properly including requirements files in the manifest

0.2.0

support for .doc files (#2)
support for .pdf files (#3)
several bug fixes, including:
- fixing tab complete bug no file paths (#6)
- fixing tests to make sure the work properly on travis-ci

0.1.0

Initial release, support for .docx and .pptx