I do text annotations (POS tagging, NER, chunking, synset) by using a specific annotation tool for Natural Language Processing. I would like to make the same annotations on different tools to compare the performances of both.
Furthermore, for I found several logical and linguistic errors in the way the algorithm was previously trained, I would like to measure the way such anomalies affect the intelligence of the chatbot (that's to say its ability to understand questions and answers made by the customers as regard to sentences which have been structured in a certain way), by comparing results with those performed by other NLP engines. In other terms, I would like to collect some "benchmark" to have an idea of which level the NLP algorithm developed by the company I work with works at.
Is there any tool (open source annotation tools based on other NLP algorithms, tools to collect benchmark, etc.) which might help me to perform such a task?