Saturday, December 31, 2011

Apache Tika - a content analysis toolkit

Apache Tika has been released version 1.0. This release removes all deprecated pre 1.0 API methods, make several Configuration and OSGi improvements.

It provide a single API for extracting data and detecting language from arbitrary input formats, such as PDFs, images, text documents, spreadsheets. Even audio or video input formats are supported to a certain degree.

No comments:

Post a Comment