0

I am building the training dataset for a named entity recognition model, with 2 tags: Name and Category and I am using a pre-trained spaCy model.

Given a document, the model needs to extract the name and category of several items.

However, the name or category can be missing or contain a value which means it is absent. In this case, should I tag or ignore the entities?

For example, given the following document:

The first item's name is: X and category is: N/A
The second item's name is: --- and category is <specifiy-item-category>
The third item's category is: Y

should I also tag N/A as the Category (first line) or --- as the Name or <specifiy-item-category> as the Category (the second line) ?

Specifically, I am not interested in empty values and in the end I am going to ignore them anyway. Should the model be agnostic or not of empty meaning?

desertnaut
  • 1,005
  • 10
  • 19
mrang
  • 3
  • 2

1 Answers1

1

NER models are usually trained on non overlapping sequence of tokens. And this is pretty much the only rule followed (even though it's not strictly required to do so, but at the risk of complicating training and hindering performance).

In your specific case I see a couple of possibilities:

  1. if "N/A", "---" and are systematic values used in the data when a name or a category is not given, then you could implement a preprocessing step to identify them, and skip both training and inference for them since you already know how to retrieve them manually.

  2. if you don't know in advance all the possible values people will use to label something as not known (for both name and category), you can skip their labelling all at once. A similar case holds for image segmentation, you don't have to label background, rather you label all areas of interest, and the model will automatically learn to label everything that is not of interest as background. In your case, the model will learn only name and category as entities and it will ignore everything else (or label everything else as 'non entity', depending on your code implementation).

desertnaut
  • 1,005
  • 10
  • 19
Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25