0

I have transaction data and I would like to extract the merchant from the transaction description. I am new to this but I just came across Named Entity Recognition and SpaCy. I have hundreds of thousands of different merchants.

Some questions that I have:

  • How much labelling do I need to do given the number of merchants I need to extract?

  • How many different instances of the same merchant I need to label to get decent results?

nbro
  • 39,006
  • 12
  • 98
  • 176
Unicorn07
  • 1
  • 1
  • Could you provide an example of a transaction description? (with fake name of course). Unless the descriptions are really extensive it sounds hard to me to believe that the pre trained model from spacy perform badly and require further training. – Edoardo Guerriero Nov 26 '21 at 09:32

3 Answers3

0

There is no specific number of labels that is "enough". For simple cases you can start with a few hundred examples, but normally you'll want several thousand.

Since you have a large number of classes your problem might be a harder one, but on the other hand it could be easy if most of your text is like "This merchant is called XXX".

polm23
  • 101
  • 2
0

In my experience with NER with Spacy, and disagreeing with this stackoverflow solution and as @polm23 rightly mentioned, a several thousand samples for each entity should generate/predict entities, otherwise spacy would just recognise them based on default spacy entity types (mainly 'work-of-art')

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 26 '21 at 18:13
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://ai.stackexchange.com/help/whats-reputation) you will be able to [comment on any post](https://ai.stackexchange.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/19616) – Saurav Maheshkar Dec 27 '21 at 01:10
0

It depend on your workflow, language of text. Your official guide at here https://github.com/explosion/assets/blob/main/Prodigy/Prodigy_NER_flowchart_v2_0_0_light.pdf . You can know how many data is enough. You can see few number 4000, 25%, 2000, etc.

Vy Do
  • 99
  • 3