Skip to main content

Tika supported document types

Tika is a library to extract text out of documents. We wrote a remote document processor service that given a streamed document can extract the text out of it and return it back in response. The reason for streaming documents is that we didnt wanted to mount all filers on that box, as filers keeps on changes so we dont want ops people to forget adding the new filers to the box and leading to any issues.

I needed a way to figure out if tika can extract the text out of a document or not before sending request to the document processor. Had to look into the code but if you are using the default AutoDetecting parser here is a way to find

    public static boolean canExtractText(String extension) {
        String mimeType = tika.detect("a." + extension);
        return parser.getParsers().containsKey(mimeType);       
    private static AutoDetectParser parser = new AutoDetectParser();
    private static Tika tika  = new Tika();  


Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost(""); factory.setPort(5672); Conne…

Spring query timeout or transaction timeout

If you are using spring to manage transactions then you can specify default transaction timeout using

    <bean id="transactionManager"
        <property name="dataSource" ref="dataSource" />
        <property name="defaultTimeout" value="30" /> <!--30 sec--->             

or you can override the timeout in the annotation

    @Transactional(readOnly = false, timeout=30)

or if you are doing it programatic transactions then you can do

DataSourceTransactionManager transactionManager = new DataSourceTransactionManager(dataSource);

 or you can override the timeout for one particular transaction

TransactionTemplate transactionTemplate = new TransactionTemplate();

Python adding pid file

I have a thumbnail generator that launches multiple processes and the correct way to shut it down is to send kill -HUP to the parent process. To automate I had to write a pid file from python, it was a piece of cake
def writePidFile(): pid = str(os.getpid()) f = open('', 'w') f.write(pid) f.close()