1

I have some free text (think: blog articles, interview transcripts, chat comments), and would like to explore the text data by analysing the proper nouns it contains.

I know of many ways to simply look up the text against a 'list' of proper nouns. The problem with the approach is many false positives and false negatives, as well as inaccuracies where one proper noun (e.g. "John Allen") is identified as two proper nouns ("John" and "Allen"), as well as other problems, mostly to do with long or unusual proper nouns (e.g. "the Gulf of Carpentaria" - a single proper noun containing the word "of", and long names like "Joost van der Westhuizen"). These kinds of longer, non-conformist proper nouns tend to really trip up grep-style proper noun identification models.

Does anyone know if any AI available to the public can more accurately identify proper nouns in free text?

stevec
  • 111
  • 4

1 Answers1

2

This is a hard problem, unless you have a list of proper nouns you want to recognise. If John Allen is in this list, then you can easily use a longest match to prefer it over John or Allen. The same applies to the other examples you give.

Capitalisation on its own is not very reliable, as words at the beginnings of sentences are also capitalised, and sometimes technical terms or emphasis are expressed By Capitalising Them In Mid-Sentence. It's not what I would do, but you really have to expect anything in free text.

You could go some way by looking for sequences of proper nouns with of, de or van etc between them. There is probably a reasobaly small list of those connectors.

You can set up a grammar to capture complex names, but by far the most reliable way is to have a list. Unfortunately there is no easy solution for this. I would approach it iteratively, ie process a segment of text, tidy up the results, generate a list from them, and repeat with more text. Everything that is in the list you accept, other candidates you vet before adding them to the list. Eventually you should get fewer candidates that are not in the list.

You can probably kick-start your list with a gazetteer, ie a list of names and places. There are several of those on the web.

In general this procedure is referred to as Named Entity Recognition (NER). There are many ways to solve it, and my suggestion here is a pragmatic approach, which works reasonably well with not too much effort.

Oliver Mason
  • 5,322
  • 12
  • 32
  • 1
    This does not appear to be an AI answer, but more of a hack-ish approach. You're correct that small words like "van" are an indication of proper nouns, as is capitalization, but neither is perfect. AI's are however very good in statistics and context. Our biggest problem might be deciding whether "Ford Transit Van" is a proper noun ! Two AI's might have different conclusions, and I couldn't fault either. – MSalters Apr 26 '21 at 14:04
  • 1
    You might want to mention that this problem is called [Named-Entity Recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition). – Graipher Apr 26 '21 at 16:08
  • @MSalters I would prefer to call it a "pragmatic" approach -- nothig hack-ish about it. There are several heuristics that one can make use of, as it's a non-trivial task there is no clear-cut 100% answer. – Oliver Mason Apr 27 '21 at 14:08
  • @Graipher Yes, thank you. Will add that. – Oliver Mason Apr 27 '21 at 14:08