Tika is a library to extract text out of documents. We wrote a remote document processor service that given a streamed document can extract the text out of it and return it back in response. The reason for streaming documents is that we didnt wanted to mount all filers on that box, as filers keeps on changes so we dont want ops people to forget adding the new filers to the box and leading to any issues.
I needed a way to figure out if tika can extract the text out of a document or not before sending request to the document processor. Had to look into the code but if you are using the default AutoDetecting parser here is a way to find
public static boolean canExtractText(String extension) {
String mimeType = tika.detect("a." + extension);
return parser.getParsers().containsKey(mimeType);
}
private static AutoDetectParser parser = new AutoDetectParser();
private static Tika tika = new Tika();
I needed a way to figure out if tika can extract the text out of a document or not before sending request to the document processor. Had to look into the code but if you are using the default AutoDetecting parser here is a way to find
public static boolean canExtractText(String extension) {
String mimeType = tika.detect("a." + extension);
return parser.getParsers().containsKey(mimeType);
}
private static AutoDetectParser parser = new AutoDetectParser();
private static Tika tika = new Tika();
Comments
Post a Comment