Office 2007 and Office 2010 documents Text extraction using Tika

We were earlier using various different libraries to extract text out of word, pdf, ppt, excel and it was tricky to maintain it. Our CTO found this cool apache Tika project that made our life easy. Now extracting text out of various documents is a piece of cake. Beauty of tika library is that it can detect mimetype and other metadata automatically. Here is a sample code to extract text using Tika

    @Override
    public String getText(InputStream stream, int maxSize) {
        Tika tika = new Tika();
        tika.setMaxStringLength(maxSize);
        try {
            return tika.parseToString(stream);
        } catch (Throwable t) {
            logger.error("Error extracting text from document of type" + logIdentifier, t);
            return " ";
        }
    }

Programming fun at startup

Search This Blog

Office 2007 and Office 2010 documents Text extraction using Tika

Labels

Comments

Post a Comment

Popular posts from this blog

Haproxy and tomcat JSESSIONID

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

RabbitMQ java clients for beginners