Change Log

This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.

latest changes in development for next release

1.6.1

  • several bug fixes, including:
    • fixing the readthedocs build (#150)

1.6.0

  • Let the user provide file extension as an argument when the file name has no extension (#148 by @motazsaad)
  • Added ability to parse audio with pocketsphinx (#122 by @barrust)
  • Added ability to parse .psv and .tsv files (#141)
  • several bug fixes, including:
    • checking for the importability of a parser rather than the presense of the file (#136 by @AusIV)
    • manage versions with bumpversion (#146)
    • properly reporting on missing external dependencies (#139 by @AusIV)
    • pin chardet to version 2.1.1 to avoid decode errors (#107)
    • avoid unicode decode error with html parser (#147 by @suned)
    • enabling autocomplete and improving error handling (#149)

1.5.0

1.4.0

  • added layout preservation option for pdftotext pdf extractor (#93 by @ankushshah89)
  • added simple support for extensionless filenames, treating them as plain .txt files (#85)
  • several bug fixes, including:
    • now extracting the text in tables from docx files at the end of the text extraction (#92 by @jsmith-mploir)
    • faster testing framework by only rebuilding test data when needed (#90)
    • fixed .html and .epub parsers to deal with beautifulsoup4 upgrades
    • using official msg-extractor now that it has a native setup.py
    • updated tests for .html, .ogg, .wav, and .mp3 file types to be consistent with more recent versions of the underlying packages.

1.3.0

1.2.0

  • support for .tiff files (#81)
  • added support for other languages for tesseract (#76 by @anderser)
  • added --option/-O flag to pass arbitrary arguments for things like languages into textract
  • several bug fixes, including:
    • fix bug with doing OCR on multi-page pdfs and removing temporary directory (#82 by @pudo)
    • correctly accounting for whitespace in .odt documents (#79 by @evfredericksen)
    • standardizing testing environment to be compatible with different versions of third-party command line tools (#78)

1.1.0

  • support for .wav, .mp3, and .ogg files (#56 and #62 by @arvindch)
  • support for .csv files (#64)
  • support for scanned .pdf files with tesseract (#66 by @pudo)
  • support for .htm files (#69)
  • several bug fixes, including:
    • .odt parser now correctly extracts text in order (#61 by @levivm)
    • fixed Docker development environment compatability with the Vagrant VM environment (#73 by @ShawnMilo)
  • several internal improvements, including:
    • improvements in the python documentation (#70)
    • improved html output with reduced whitespace around inline elements in output text (#58 by @eiotec)

1.0.0

  • standardized encoding of output with -e/--encoding option (#39)
  • support for .xls and .xlsx files (#42 and #55 by @levivm)
  • support for .epub files (#40 by @kokxx)
  • several bug fixes, including:
    • removing tesseract version info from output of image parsers (#48)
    • problems with spaces in filenames (#53)
    • concurrancy problems with tesseract (#44 by @ShawnMilo, #41 by @christomitov)
  • several internal improvements, including:
    • switching to using class-based parsers to abstract away the common functionality between different parser classes (#39)
    • switching to using a python-based test suite and added standardized text tests to make sure output is consistent across file types (#49)
    • including support for Docker-based testing (#46 by @ShawnMilo)

0.5.1

  • several bug fixes, including:
    • documentation fixes
    • shell commands hanging on large files (#33)

0.5.0

  • support for .json files (#13 by @anthonygarvan)
  • support for .odt files (#29 by @christomitov)
  • support for .ps files (#25)
  • support for .gif, .jpg, .jpeg, and .png files (#30 by @christomitov)
  • several bug fixes, including:
    • improved fallback handling in .pdf parser if the pdftotext command line utility isn’t installed (#26)
    • improved documentation for installation instructions on non-Ubuntu operating systems (#21, #26)
  • several internal improvements, including:
    • cleaned up implementation of extension parsers to avoid magic

0.4.0

  • support for .html files (#7)
  • support for .eml files (#4)
  • automated the documentation for the python package using sphinx-apidoc in docs/Makefile (#9)

0.3.0

  • support for .txt files, haha (#8)
  • fixed installation bug with not properly including requirements files in the manifest

0.2.0

  • support for .doc files (#2)
  • support for .pdf files (#3)
  • several bug fixes, including:
    • fixing tab complete bug no file paths (#6)
    • fixing tests to make sure the work properly on travis-ci

0.1.0

  • Initial release, support for .docx and .pptx