I am trying to write a program in which an ai can detect whether a conversation is occurring or not. The ai does not need to transcribe words or have any meaning about the conversation, simply if one is occurring. A conversation can then simply be defined as having more than one speaker.
Anyways, while searching for past research on the subject, I came across the field of speech diarization, which is where an AI is trained to distinguish the numbers of speakers in a conversation. This seems perfect to me, however, while implementing I came across a few troubles. First of all, it wasn't good. I used this tutorial: https://medium.com/saarthi-ai/who-spoke-when-build-your-own-speaker-diarization-module-from-scratch-e7d725ee279 to write a simple program for this task, but I found it wasn't good at finding if there was a single or two speakers. Also the times where it was distinguishing speakers were all off.
It occurred to me that perhaps speech diarization may not be the best approach for this problem, so I decided to ask here about if this is the best solution, or if there are better ones out there. If it is the best solution, I would love some insight into why this wasn't working for me. Is the tutorial simply not good enough? I used 45 second - 1 minute long clips of just myself speaking or other people speaking with me and it did not work well at all like I said.