Match two paragraphs of text

Question

I'm building a friend finder app and I need to match people based on a paragraph of text. Here is an example of what I mean:

Person A: I love walking and going to the beach, I also love reading and keeping active. I'm very allergic to dogs, so I don't have any pets and have no intention of having any. I used to swim in college as a d1 athlete. My favourite movie is Finding Nemo, I hate horror films..."

Person B: I'm a dog lover with 3 Labradors. I'm an extremely active person who loves to swim many days out of the week. I love going to the theatre and watching movies in IMAX. My job is an civil engineer, but I like to code games in my spare time..."

Match: 56%

What is the best way to calculate a score that accurately matches the similarity of interests/hobbies etc. between the two individuals?

I have looked at OpenAI embeddings API and storing them in a Pinecone database for retrieval. I have also looked at creating a 'Profile' using some kind of classification model that would give each user a score for various traits such as {active: 0.3, social: 0.9, ...} and then comparing results.

Many thanks for any help.

This sounds like a good project. The question is also reasonable but it does seem a bit like you are asking the community to do your homework for you. Perhaps you could indicate some of the ideas you have to start with. — Bruce Adams, Jul 20 '23 at 11:22
Hi @BruceAdams I'm definitely not trying to get someone to do my homework! This was an extracurricular activity that doesn't even count to my final grade. You are right, I should have added more context to some of the ideas I have explored already. I have looked at OpenAI embeddings API and storing them in a Pinecone database for retrieval. I have also looked at creating a 'Profile' using some kind of classification model that would give each user a score for various traits such as {active: 0.3, social: 0.9, ...} and then comparing results. — Dom, Jul 20 '23 at 11:50
It would be better to edit the question to say that. You are also contradicting yourslef by saying "for my MSc project" vs extracurricular. — Bruce Adams, Jul 20 '23 at 12:10
You might also consider trying to match the text with "dating relevant" questions. Look at okcupid or similar for some examples. — Bruce Adams, Jul 20 '23 at 12:22
I have followed your advice @BruceAdams. I don't believe I am contradicting myself as what I am working on will be looked at by my tutor (as part of my MSc), but it's completely optional and I'm doing it for fun + learning! Thanks — Dom, Jul 20 '23 at 12:24

score 2 · Answer 1 · answered Jul 20 '23 at 10:57

There are several ways to do this. The most straightforward would be to encode the two paragraphs as vectors (also called text embeddings) using a pretrained language model.

The idea is that the vectors representing the paragraphs encode the "semantic content" or the underlying "meaning" of the text, and the vector encoder model was trained such that you can use a distance metric like dot-product or cosine-similarity to measure this "semantic distance" between text. The sentence-transformers library has several implementations and tutorials for this.

One caveat is that the type of similarity captured by the models may not apply to your specific use case as they may be trained on a different distribution of data (these models are often trained for information retrieval purposes). In this case you may want to fine-tune the model or choose a model that was trained similarly to your target use-case.

Another similar method is to use cross-encoders. Instead of encoding each paragraph separately then comparing them with a cheap metric, cross-encoders are models that take both paragraphs as input and output a similarity score. This may give you better accuracy (again, depending on how it was trained), but it comes at the cost of performance since you need to do an expensive forward pass for each pair of sentences.

I would also consider a more interpretable method of first extracting "properties" from each description, then comparing each of them separately. For example, it'd be reasonable to assume that most personal descriptions have hobbies, favorite movies, etc. You can try extracting spans of text corresponding to each of these properties separately, then use one of the methods described previously (or even just basic token-matching) for comparisons.

Hi @Alexander, thanks for taking the time to write this out. There is lots for me to unpack and get started in the right direction. — Dom, Jul 20 '23 at 12:04

Match two paragraphs of text

1 Answers1