I am building the training dataset for a named entity recognition model, with 2 tags: Name
and Category
and I am using a pre-trained spaCy model.
Given a document, the model needs to extract the name and category of several items.
However, the name or category can be missing or contain a value which means it is absent. In this case, should I tag or ignore the entities?
For example, given the following document:
The first item's name is: X and category is: N/A
The second item's name is: --- and category is <specifiy-item-category>
The third item's category is: Y
should I also tag N/A
as the Category (first line) or ---
as the Name or <specifiy-item-category>
as the Category (the second line) ?
Specifically, I am not interested in empty values and in the end I am going to ignore them anyway. Should the model be agnostic or not of empty meaning?