Change Log¶
This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.
latest changes in development for next release¶
1.4.0¶
- added layout preservation option for pdftotext pdf extractor (#93 by @ankushshah89)
- added simple support for extensionless filenames, treating them as plain
.txt
files (#85) - several bug fixes, including:
- now extracting the text in tables from docx files at the end of the text extraction (#92 by @jsmith-mploir)
- faster testing framework by only rebuilding test data when needed (#90)
- fixed
.html
and.epub
parsers to dealing with beautifulsoup4 upgrades - using official
msg-extractor
now that it has a nativesetup.py
- updated tests for
.html
,.ogg
,.wav
, and.mp3
file types to be consistent with more recent versions of the underlying packages.
1.3.0¶
- support for
.rtf
files (#84) - support for
.msg
files (#87 and #17 by @anthonygarvan)
1.2.0¶
- support for
.tiff
files (#81) - added support for other languages for tesseract (#76 by @anderser)
- added
--option/-O
flag to pass arbitrary arguments for things like languages into textract - several bug fixes, including:
- fix bug with doing OCR on multi-page pdfs and removing temporary directory (#82 by @pudo)
- correctly accounting for whitespace in
.odt
documents (#79 by @evfredericksen) - standardizing testing environment to be compatible with different versions of third-party command line tools (#78)
1.1.0¶
- support for
.wav
,.mp3
, and.ogg
files (#56 and #62 by @arvindch) - support for
.csv
files (#64) - support for scanned
.pdf
files with tesseract (#66 by @pudo) - support for
.htm
files (#69) - several bug fixes, including:
.odt
parser now correctly extracts text in order (#61 by @levivm)- fixed Docker development environment compatability with the Vagrant VM environment (#73 by @ShawnMilo)
- several internal improvements, including:
1.0.0¶
- standardized encoding of output with
-e/--encoding
option (#39) - support for
.xls
and.xlsx
files (#42 and #55 by @levivm) - support for
.epub
files (#40 by @kokxx) - several bug fixes, including:
- removing tesseract version info from output of image parsers (#48)
- problems with spaces in filenames (#53)
- concurrancy problems with tesseract (#44 by @ShawnMilo, #41 by @christomitov)
- several internal improvements, including:
- switching to using class-based parsers to abstract away the common functionality between different parser classes (#39)
- switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (#49)
- including support for Docker-based testing (#46 by @ShawnMilo)
0.5.1¶
- several bug fixes, including:
- documentation fixes
- shell commands hanging on large files (#33)
0.5.0¶
- support for
.json
files (#13 by @anthonygarvan) - support for
.odt
files (#29 by @christomitov) - support for
.ps
files (#25) - support for
.gif
,.jpg
,.jpeg
, and.png
files (#30 by @christomitov) - several bug fixes, including:
- several internal improvements, including:
- cleaned up implementation of extension parsers to avoid magic
0.4.0¶
0.3.0¶
- support for
.txt
files, haha (#8) - fixed installation bug with not properly including requirements files in the manifest
0.2.0¶
0.1.0¶
- Initial release, support for
.docx
and.pptx