Change Log ========== This project uses `semantic versioning `_ to track version numbers, where backwards incompatible changes (highlighted in **bold**) bump the major version of the package. NEXT RELEASE ------------------- .. THANKS FOR CONTRIBUTING; ADD YOUR UNRELEASED CHANGES HERE! TBD ------------------- * Minor nitpick to remove unused line (`#559`_ by `@KyleKing`_) 2.0.0 ------------------- * Major modernization of textract absorbing the `textract-py3` fork (`#543`_ by `@KyleKing`_) * Implement GitHub Actions CI/CD and improve test coverage * Fix minor inconsistencies on Windows * Resolve dependency specification errors by migrating to uv * Update some minimum dependency versions and sets minimum Python to 3.9 * Migrate to PyPi Trusted Publisher for releases * Multiple Fixes: * Return null on empty stream (`#422`_ by `@TheElementalOfDestruction`_) * Enable encoding detection for the txt parser (`#456`_ by `@LoicGrobol`_) * Catch ShellError from pdf2txt.py (`#495`_ by `@dhrim`_) * Support python3.12 (`#502`_ by `@branchv`_) * Documentation Updates * Add colons to issue template for consistency (`#520`_ by `@mcp292`_) * Fix a few typos (`#430`_ by `@timgates42`_) 1.6.5 ------------------- * switched epub parsing to MIT license compatible package (`#411`_ by `@jhale1805`_) 1.6.4 ------------------- * several bug fixes, including: * fixing dependency declarations (`#162`_ by `@lillypad`_) 1.6.1 ------------------- * several bug fixes, including: * fixing the readthedocs build (`#150`_) 1.6.0 ------------------- * Let the user provide file extension as an argument when the file name has no extension (`#148`_ by `@motazsaad`_) * Added ability to parse audio with ``pocketsphinx`` (`#122`_ by `@barrust`_) * Added ability to parse ``.psv`` and ``.tsv`` files (`#141`_) * several bug fixes, including: * checking for the importability of a parser rather than the presense of the file (`#136`_ by `@AusIV`_) * manage versions with `bumpversion `_ (`#146`_) * properly reporting on missing external dependencies (`#139`_ by `@AusIV`_) * pin `chardet` to version 2.1.1 to avoid decode errors (`#107`_) * avoid unicode decode error with html parser (`#147`_ by `@suned`_) * enabling autocomplete and improving error handling (`#149`_) 1.5.0 ----- * Added python 3 support, including pdfminer (`#104`_ by `@sirex`_ via `#126`_) * Python 3 support for ``pdfminer`` using ``pdfminer.six`` (`#116`_ by `@jaraco`_ via `#126`_) * fixed security vulnerability by properly using subprocess.call (`#114`_ by `@pierre-ernst`_) * updating to ``tesseract`` 3.03 (`#127`_) * adding a ``.tif`` synonym for ``.tiff`` files (`#113`_ by `@onionradish`_) * improved ``.docx`` support using ``docx2txt`` (`#100`_ by `@ankushshah89`_) * several bug fixes, including: * including all requirements for ``Pillow`` (`#119`_ by `@akoumjian`_) 1.4.0 ----- * added layout preservation option for pdftotext pdf extractor (`#93`_ by `@ankushshah89`_) * added simple support for extensionless filenames, treating them as plain ``.txt`` files (`#85`_) * several bug fixes, including: * now extracting the text in tables from docx files at the end of the text extraction (`#92`_ by `@jsmith-mploir`_) * faster testing framework by only rebuilding test data when needed (`#90`_) * fixed ``.html`` and ``.epub`` parsers to deal with beautifulsoup4 upgrades * using official ``msg-extractor`` now that it has a native ``setup.py`` * updated tests for ``.html``, ``.ogg``, ``.wav``, and ``.mp3`` file types to be consistent with more recent versions of the underlying packages. 1.3.0 ----- * support for ``.rtf`` files (`#84`_) * support for ``.msg`` files (`#87`_ and `#17`_ by `@anthonygarvan`_) 1.2.0 ----- * support for ``.tiff`` files (`#81`_) * added support for other languages for tesseract (`#76`_ by `@anderser`_) * added ``--option/-O`` flag to pass arbitrary arguments for things like languages into textract * several bug fixes, including: * fix bug with doing OCR on multi-page pdfs and removing temporary directory (`#82`_ by `@pudo`_) * correctly accounting for whitespace in ``.odt`` documents (`#79`_ by `@evfredericksen`_) * standardizing testing environment to be compatible with different versions of third-party command line tools (`#78`_) 1.1.0 ----- * support for ``.wav``, ``.mp3``, and ``.ogg`` files (`#56`_ and `#62`_ by `@arvindch`_) * support for ``.csv`` files (`#64`_) * support for scanned ``.pdf`` files with tesseract (`#66`_ by `@pudo`_) * support for ``.htm`` files (`#69`_) * several bug fixes, including: * ``.odt`` parser now correctly extracts text in order (`#61`_ by `@levivm`_) * fixed Docker development environment compatability with the Vagrant VM environment (`#73`_ by `@ShawnMilo`_) * several internal improvements, including: * improvements in the python documentation (`#70`_) * improved html output with reduced whitespace around inline elements in output text (`#58`_ by `@eiotec`_) 1.0.0 ----- * **standardized encoding of output with** ``-e/--encoding`` **option** (`#39`_) * support for ``.xls`` and ``.xlsx`` files (`#42`_ and `#55`_ by `@levivm`_) * support for ``.epub`` files (`#40`_ by `@kokxx`_) * several bug fixes, including: * removing tesseract version info from output of image parsers (`#48`_) * problems with spaces in filenames (`#53`_) * concurrancy problems with tesseract (`#44`_ by `@ShawnMilo`_, `#41`_ by `@christomitov`_) * several internal improvements, including: * switching to using class-based parsers to abstract away the common functionality between different parser classes (`#39`_) * switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (`#49`_) * including support for Docker-based testing (`#46`_ by `@ShawnMilo`_) 0.5.1 ----- * several bug fixes, including: * documentation fixes * shell commands hanging on large files (`#33`_) 0.5.0 ----- * support for ``.json`` files (`#13`_ by `@anthonygarvan`_) * support for ``.odt`` files (`#29`_ by `@christomitov`_) * support for ``.ps`` files (`#25`_) * support for ``.gif``, ``.jpg``, ``.jpeg``, and ``.png`` files (`#30`_ by `@christomitov`_) * several bug fixes, including: * improved fallback handling in ``.pdf`` parser if the ``pdftotext`` command line utility isn't installed (`#26`_) * improved documentation for installation instructions on non-Ubuntu operating systems (`#21`_, `#26`_) * several internal improvements, including: * cleaned up implementation of extension parsers to avoid magic 0.4.0 ----- * support for ``.html`` files (`#7`_) * support for ``.eml`` files (`#4`_) * automated the documentation for the python package using sphinx-apidoc in docs/Makefile (`#9`_) 0.3.0 ----- * support for ``.txt`` files, haha (`#8`_) * fixed installation bug with not properly including requirements files in the manifest 0.2.0 ----- * support for ``.doc`` files (`#2`_) * support for ``.pdf`` files (`#3`_) * several bug fixes, including: * fixing tab complete bug no file paths (`#6`_) * fixing tests to make sure the work properly on travis-ci 0.1.0 ----- * Initial release, support for ``.docx`` and ``.pptx`` .. list of contributors that are linked to above. putting links here .. to make the text above relatively clean .. _@akoumjian: https://github.com/akoumjian .. _@anthonygarvan: https://github.com/anthonygarvan .. _@anderser: https://github.com/anderser .. _@ankushshah89: https://github.com/ankushshah89 .. _@arvindch: https://github.com/arvindch .. _@barrust: https://github.com/barrust .. _@AusIV: https://github.com/AusIV .. _@christomitov: https://github.com/christomitov .. _@eiotec: https://github.com/eiotec .. _@evfredericksen: https://github.com/evfredericksen .. _@jaraco: https://github.com/jaraco .. _@jhale1805: https://github.com/jhale1805 .. _@jsmith-mploir: https://github.com/jsmith-mploir .. _@kokxx: https://github.com/Kokxx .. _@levivm: https://github.com/levivm .. _@lillypad: https://github.com/lillypad .. _@motazsaad: https://github.com/motazsaad .. _@onionradish: https://github.com/onionradish .. _@pierre-ernst: https://github.com/pierre-ernst .. _@pudo: https://github.com/pudo .. _@ShawnMilo: https://github.com/ShawnMilo .. _@sirex: https://github.com/sirex .. _@suned: https://github.com/suned .. list of issues that have been resolved. putting links here to make .. the text above relatively clean .. _#2: https://github.com/deanmalmgren/textract/issues/2 .. _#3: https://github.com/deanmalmgren/textract/issues/3 .. _#4: https://github.com/deanmalmgren/textract/issues/4 .. _#6: https://github.com/deanmalmgren/textract/issues/6 .. _#7: https://github.com/deanmalmgren/textract/issues/7 .. _#8: https://github.com/deanmalmgren/textract/issues/8 .. _#9: https://github.com/deanmalmgren/textract/issues/9 .. _#13: https://github.com/deanmalmgren/textract/issues/13 .. _#17: https://github.com/deanmalmgren/textract/issues/17 .. _#21: https://github.com/deanmalmgren/textract/issues/21 .. _#25: https://github.com/deanmalmgren/textract/issues/25 .. _#26: https://github.com/deanmalmgren/textract/issues/26 .. _#29: https://github.com/deanmalmgren/textract/issues/29 .. _#30: https://github.com/deanmalmgren/textract/issues/30 .. _#33: https://github.com/deanmalmgren/textract/issues/33 .. _#39: https://github.com/deanmalmgren/textract/issues/39 .. _#40: https://github.com/deanmalmgren/textract/issues/40 .. _#41: https://github.com/deanmalmgren/textract/issues/41 .. _#42: https://github.com/deanmalmgren/textract/issues/42 .. _#44: https://github.com/deanmalmgren/textract/issues/44 .. _#46: https://github.com/deanmalmgren/textract/issues/46 .. _#48: https://github.com/deanmalmgren/textract/issues/48 .. _#49: https://github.com/deanmalmgren/textract/issues/49 .. _#53: https://github.com/deanmalmgren/textract/issues/53 .. _#55: https://github.com/deanmalmgren/textract/issues/55 .. _#56: https://github.com/deanmalmgren/textract/issues/56 .. _#58: https://github.com/deanmalmgren/textract/issues/58 .. _#61: https://github.com/deanmalmgren/textract/issues/61 .. _#62: https://github.com/deanmalmgren/textract/issues/62 .. _#64: https://github.com/deanmalmgren/textract/issues/64 .. _#66: https://github.com/deanmalmgren/textract/issues/66 .. _#69: https://github.com/deanmalmgren/textract/issues/69 .. _#70: https://github.com/deanmalmgren/textract/issues/70 .. _#73: https://github.com/deanmalmgren/textract/issues/73 .. _#76: https://github.com/deanmalmgren/textract/issues/76 .. _#78: https://github.com/deanmalmgren/textract/issues/78 .. _#79: https://github.com/deanmalmgren/textract/issues/79 .. _#81: https://github.com/deanmalmgren/textract/issues/81 .. _#82: https://github.com/deanmalmgren/textract/issues/82 .. _#84: https://github.com/deanmalmgren/textract/issues/84 .. _#85: https://github.com/deanmalmgren/textract/issues/85 .. _#87: https://github.com/deanmalmgren/textract/issues/87 .. _#90: https://github.com/deanmalmgren/textract/issues/90 .. _#92: https://github.com/deanmalmgren/textract/issues/92 .. _#93: https://github.com/deanmalmgren/textract/issues/93 .. _#100: https://github.com/deanmalmgren/textract/issues/100 .. _#104: https://github.com/deanmalmgren/textract/issues/104 .. _#107: https://github.com/deanmalmgren/textract/issues/107 .. _#113: https://github.com/deanmalmgren/textract/issues/113 .. _#114: https://github.com/deanmalmgren/textract/issues/114 .. _#116: https://github.com/deanmalmgren/textract/issues/116 .. _#119: https://github.com/deanmalmgren/textract/issues/119 .. _#126: https://github.com/deanmalmgren/textract/issues/126 .. _#122: https://github.com/deanmalmgren/textract/issues/122 .. _#127: https://github.com/deanmalmgren/textract/issues/127 .. _#136: https://github.com/deanmalmgren/textract/issues/136 .. _#139: https://github.com/deanmalmgren/textract/issues/139 .. _#141: https://github.com/deanmalmgren/textract/issues/141 .. _#146: https://github.com/deanmalmgren/textract/issues/146 .. _#147: https://github.com/deanmalmgren/textract/issues/147 .. _#148: https://github.com/deanmalmgren/textract/issues/148 .. _#149: https://github.com/deanmalmgren/textract/issues/149 .. _#150: https://github.com/deanmalmgren/textract/issues/150 .. _#162: https://github.com/deanmalmgren/textract/issues/162 .. _#411: https://github.com/deanmalmgren/textract/issues/411