1

I have a dataset of texts, each text was identified with an ID number. I would like to do a prediction by finding the best match ID number for upcoming new texts. To use multi text classification, I am not sure if this is the right approach since there is only one text for most of ID numbers. In this case, I wouldn't have any test set. Can up-sampling help? Or is there any other approach than classification for such a problem?

The data set looks like this:

id1 'text1', id2 'text2', id3 'text3', id3 'text4', id3 'text5', id4 'text6', . . id200 'text170'

I would appreciate any guidance to find the best approach for this problem.

Fara
  • 11
  • 1
  • won't be able to classify if there are too many kinds of ids – Dee Mar 11 '21 at 02:40
  • and theoretically, won't be able to classify, if the data are singlesample --to--> multiple ids; it must be manysamples --to--> single id – Dee Mar 11 '21 at 02:43
  • If your texts are simple (have same words/phrases/subparts) then approaches like edit distance might work (levenshtein distance or any similar thing). If they are more complex, such as different words but similar meaning, then you can use pretrained models like Bert and use them to get embeddings and classify based on distance of embeddings – SajanGohil Nov 01 '22 at 12:16

1 Answers1

0

Siamese networks may be useful in your case.

http://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf

https://en.wikipedia.org/wiki/Siamese_neural_network

https://link.springer.com/protocol/10.1007%2F978-1-0716-0826-5_3

user31264
  • 203
  • 1
  • 8
  • Thank you very much! It seems really useful. I hope it works on my text data as well considering the texts in my datasets are kind of messy complaint data. – Fara Feb 08 '21 at 18:55