We were earlier using various different libraries to extract text out of word, pdf, ppt, excel and it was tricky to maintain it. Our CTO found this cool apache Tika project that made our life easy. Now extracting text out of various documents is a piece of cake. Beauty of tika library is that it can detect mimetype and other metadata automatically. Here is a sample code to extract text using Tika
@Override
public String getText(InputStream stream, int maxSize) {
Tika tika = new Tika();
tika.setMaxStringLength(maxSize);
try {
return tika.parseToString(stream);
} catch (Throwable t) {
logger.error("Error extracting text from document of type" + logIdentifier, t);
return " ";
}
}
@Override
public String getText(InputStream stream, int maxSize) {
Tika tika = new Tika();
tika.setMaxStringLength(maxSize);
try {
return tika.parseToString(stream);
} catch (Throwable t) {
logger.error("Error extracting text from document of type" + logIdentifier, t);
return " ";
}
}
Comments
Post a Comment