NLP Evaluation


Natural Language Processing techniques can be used for a wide range of applications based on several capabilities:

      • Information retrieval: finding relevant material within document (textual) repositories that fulfills a certain condition.
      • Sentiment analysis: understanding the subjective component of the contextm underlying sentiments or categorizing opinions, comments or feedback.
      • Machine translation: translating texts from one language into another.
      • Question answering: given a question, finding responses within a large repository of texts.
      • Information extraction: finding concrete wanted facts in a documents repository (unstructured or semi-structured data).

Concrete examples of these and other applications were illustrated in this previous post. In practice, these models aim to replicate, or further improve, how humans process language. In the mentioned post, we learned how 2018 was a key year for NLP-based applications in becoming commercial products applied in diverse fields like recruitment, chatbots for client support or targeted marketing. However, what can be done to verify, validate and evaluate these outputs? How can we compare model performance with that of expert behaviour? Doubtless, the application of these automatic, or semi-automatic, tools provides the enormous advantage of being able to process large amounts of texts in just a matter of few minutes or even seconds. However, in order to ensure the suitability of these techniques for concrete application, it is necessary to assess the accuracy, or relevance, of the output provided.

We should start by reviewing the differences among the terms verification, validation and evaluation as applied to intelligent systems. Although there is no unanimous definition of these terms, the following are the most used ones:

      • Verification: ensuring that the performance of the model satisfies its requirements
      • Validation: ensuring that the system satisfy user requirements
      • Evaluation: overall performance of the system, its adequacy to its requirements, behaviour against failures (diagnostic) and even a general prediction of the expected performance of the system in actual use.

In order to understand if the behaviour of the system matches an expert outcome for a given input, a verification and validation process and/or evaluation process would be required. According to this, a NLP model would need to be measured according to its specific application and, therefore, the evaluation technique would need to be specific to each case to compare its performance to the expert one given the same inputs and a concrete output.

We face this challenge in developing the CAMERA project. CAMERA has applied NLP techniques with three objectives:

1) from the whole list of H2020 and FP7 projects, select the ones with a thematical (i.e. semantical) relation with the CAMERA scope: mobility. This exercise is made in an iterative manner.

2) classify the projects selected in step 1) by thematic similarity. The model is asked to automatically (i.e. unsupervised) extract topics related to the list of projects and give, for every project, a probability distribution of the likelihood that a project belongs to each of the topics extracted: air transport, urban mobility, travel information systems, etc.

3) classify the projects selected in step 1) in a number of pre-defined topics. While the labels in the previous exercise were automatically determined by the model, in this exercise, the topics are input into the model (i.e. semi-supervised). This requires previously training the model with labelled texts to make it “learn” each label.

To validate the results of these exercises, a review of validation or evaluation techniques was performed in order to find the most suitable ones. While most of the previous work applied a somehow “black-box” approach, we found the expert-based evaluation (also referred to as “clerical reviews”) to be the most suitable for the CAMERA models.

These “clerical reviews” were performed during the CAMERA Workshop on the 9th of July to evaluate the performance of the models in steps 1 and 2 above. This evaluation was conducted with the help of experts following two guidelines:

1) Every expert was given a set of projects, based on their titles and summary taken from the cordis database (ref) which is the same input given to the model, and asked to determine its relevancy to the mobility field. The sample of projects for this exercise included projects identified by the model as CAMERA-relevant and projects rejected by the model as CAMERA-relevant in the last iteration (the NLP algorithm applied for this step converges on a list of projects after a number of iterations).

2) Experts were required to form groups of 3-4 people and put together the set of projects individually marked as CAMERA-relevant. In groups, they were asked to distribute 9 points (dots) among the 9 topics automatically identified by the algorithm. Based on the scope of each of these projects, experts were requested to assign the 9 dots to topics they relate to.

The results of these exercises were used to validate the results obtained by the NLP exercises; but also to refine and improve them. For instance, if, in a result of step 1, we found that the algorithm missed a number of CAMERA-relevant projects, then we would take note to reduce the number of iterations performed to capture a larger number of projects. If, on the contrary, a number of the projects initially labelled as CAMERA-relevant were labelled by experts as irrelevant, we would aim to perform an additional iteration.

If you are interested on the expert-based validation results, you can now read them here!