Tika supported document types

Tika is a library to extract text out of documents. We wrote a remote document processor service that given a streamed document can extract the text out of it and return it back in response. The reason for streaming documents is that we didnt wanted to mount all filers on that box, as filers keeps on changes so we dont want ops people to forget adding the new filers to the box and leading to any issues.

I needed a way to figure out if tika can extract the text out of a document or not before sending request to the document processor. Had to look into the code but if you are using the default AutoDetecting parser here is a way to find

    public static boolean canExtractText(String extension) {
        String mimeType = tika.detect("a." + extension);
        return parser.getParsers().containsKey(mimeType);
    }

    private static AutoDetectParser parser = new AutoDetectParser();
    private static Tika tika = new Tika();

Programming fun at startup

Search This Blog

Tika supported document types

Labels

Comments

Post a Comment

Popular posts from this blog

Haproxy and tomcat JSESSIONID

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

RabbitMQ java clients for beginners