Change Log
This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.
NEXT RELEASE
TBD
Minor nitpick to remove unused line (`#559`_ by `@KyleKing`_)
2.0.0
Major modernization of textract absorbing the textract-py3 fork (`#543`_ by `@KyleKing`_)
Implement GitHub Actions CI/CD and improve test coverage
Fix minor inconsistencies on Windows
Resolve dependency specification errors by migrating to uv
Update some minimum dependency versions and sets minimum Python to 3.9
Migrate to PyPi Trusted Publisher for releases
Multiple Fixes:
Return null on empty stream (`#422`_ by `@TheElementalOfDestruction`_)
Enable encoding detection for the txt parser (`#456`_ by `@LoicGrobol`_)
Support python3.12 (`#502`_ by `@branchv`_)
Documentation Updates
Add colons to issue template for consistency (`#520`_ by `@mcp292`_)
Fix a few typos (`#430`_ by `@timgates42`_)
1.6.5
switched epub parsing to MIT license compatible package (#411 by @jhale1805)
1.6.4
1.6.1
several bug fixes, including:
fixing the readthedocs build (#150)
1.6.0
Let the user provide file extension as an argument when the file name has no extension (#148 by @motazsaad)
Added ability to parse audio with
pocketsphinx(#122 by @barrust)Added ability to parse
.psvand.tsvfiles (#141)several bug fixes, including:
checking for the importability of a parser rather than the presense of the file (#136 by @AusIV)
manage versions with bumpversion (#146)
properly reporting on missing external dependencies (#139 by @AusIV)
pin chardet to version 2.1.1 to avoid decode errors (#107)
avoid unicode decode error with html parser (#147 by @suned)
enabling autocomplete and improving error handling (#149)
1.5.0
Added python 3 support, including pdfminer (#104 by @sirex via #126)
Python 3 support for
pdfminerusingpdfminer.six(#116 by @jaraco via #126)fixed security vulnerability by properly using subprocess.call (#114 by @pierre-ernst)
updating to
tesseract3.03 (#127)adding a
.tifsynonym for.tifffiles (#113 by @onionradish)improved
.docxsupport usingdocx2txt(#100 by @ankushshah89)several bug fixes, including:
including all requirements for
Pillow(#119 by @akoumjian)
1.4.0
added layout preservation option for pdftotext pdf extractor (#93 by @ankushshah89)
added simple support for extensionless filenames, treating them as plain
.txtfiles (#85)several bug fixes, including:
now extracting the text in tables from docx files at the end of the text extraction (#92 by @jsmith-mploir)
faster testing framework by only rebuilding test data when needed (#90)
fixed
.htmland.epubparsers to deal with beautifulsoup4 upgradesusing official
msg-extractornow that it has a nativesetup.pyupdated tests for
.html,.ogg,.wav, and.mp3file types to be consistent with more recent versions of the underlying packages.
1.3.0
support for
.rtffiles (#84)support for
.msgfiles (#87 and #17 by @anthonygarvan)
1.2.0
support for
.tifffiles (#81)added support for other languages for tesseract (#76 by @anderser)
added
--option/-Oflag to pass arbitrary arguments for things like languages into textractseveral bug fixes, including:
fix bug with doing OCR on multi-page pdfs and removing temporary directory (#82 by @pudo)
correctly accounting for whitespace in
.odtdocuments (#79 by @evfredericksen)standardizing testing environment to be compatible with different versions of third-party command line tools (#78)
1.1.0
support for
.wav,.mp3, and.oggfiles (#56 and #62 by @arvindch)support for
.csvfiles (#64)support for scanned
.pdffiles with tesseract (#66 by @pudo)support for
.htmfiles (#69)several bug fixes, including:
.odtparser now correctly extracts text in order (#61 by @levivm)fixed Docker development environment compatability with the Vagrant VM environment (#73 by @ShawnMilo)
several internal improvements, including:
1.0.0
standardized encoding of output with
-e/--encodingoption (#39)several bug fixes, including:
removing tesseract version info from output of image parsers (#48)
problems with spaces in filenames (#53)
concurrancy problems with tesseract (#44 by @ShawnMilo, #41 by @christomitov)
several internal improvements, including:
switching to using class-based parsers to abstract away the common functionality between different parser classes (#39)
switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (#49)
including support for Docker-based testing (#46 by @ShawnMilo)
0.5.1
several bug fixes, including:
documentation fixes
shell commands hanging on large files (#33)
0.5.0
support for
.jsonfiles (#13 by @anthonygarvan)support for
.odtfiles (#29 by @christomitov)support for
.psfiles (#25)support for
.gif,.jpg,.jpeg, and.pngfiles (#30 by @christomitov)several bug fixes, including:
several internal improvements, including:
cleaned up implementation of extension parsers to avoid magic
0.4.0
0.3.0
support for
.txtfiles, haha (#8)fixed installation bug with not properly including requirements files in the manifest
0.2.0
0.1.0
Initial release, support for
.docxand.pptx