0

I have a database of books. Each book have a list of categories that describe the genre/topics of the book (I use Python models).

Most of the time, the categories in the list are composed from 1-3 words.

Examples of a book category list:

['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
["Children's stories", 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']

I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.

The categories are not well defined and tend to change. For example, there can be a category of 'story' and other called 'stories' category (since the text in the system don't saved categories and use a open text box).

So far I tried 2 algorithms:

  • cosine similarity with WordNet - split the category to bag of words and check if each word have synonym in the other book lists.
  • Check the similarity using the nlp model of the spacy library (Python) - distance algorithm.

So far I used WordNet model from the nltk package and spacy had problem with those two algorithms because when the algorithm compare a categories that contain 2 or 3 words the results wasn't accurate and each of them had specific problems.

Which algorithm and models (in Python), can I use to compare between the books that can handle strings that contain 2 or 3 words?

B.w is the first time I ask here. If you need more details about the database or what I did so far please tell me.

nbro
  • 39,006
  • 12
  • 98
  • 176
Eitan Rosati
  • 101
  • 3
  • 1
    One thing is not clear to me. Do you want to compute the similarity of books by only considering these list of categories? From your description, this seems to be the case. So, what is your actual question? – nbro Mar 06 '22 at 22:39
  • Hey. Firstable thanks for your comment. As i described the categories are not solid - since they was added as an open text. I need a tool that compare and find similarity between those strings(we can also treat them as a short phrase). For example can be categories like `child`, `children` and `children story`. All of them are about children but the algorithms i mentioned before can't compare between a string of 2-3 words in a good way. So i search a techniques to find similarity/compare between those strings – Eitan Rosati Mar 07 '22 at 23:17

1 Answers1

1

You can use a model to create rich embeddings for example: sentence transformers and then use cosine similarity distance from sklearn with a threshold (at least 0.6) to create clusters of semantically-close document

Saeron X
  • 21
  • 2