0

I am using Stanford's Stanza pipeline to perform Named Entity Recognition on news articles.

For every NE span of type PERson I am attempting to link corresponding DBPedia entities (Named Entity Linking) but of course I can get more than one (homonyms) and sometimes many entries containing that name, especially when using only the last name.

Here is an example with python code:

import regex as re
from SPARQLWrapper import SPARQLWrapper, JSON

PERSON_STRING = "Musk"
PERSON_STRING = re.sub(
    r"\s+", "_", PERSON_STRING
)  # DBPEDIA query breaks if space in name

QUERY = f"""
SELECT DISTINCT ?uri 
WHERE {{ 
   ?uri a foaf:Person. 
   ?uri ?p ?person_full_name. 
   FILTER(?p IN(dbo:birthName,dbp:birthName ,dbp:fullname,dbp:name)). 
   ?uri rdfs:label ?person_name . 
   ?person_name bif:contains "{PERSON_STRING}" .  
   FILTER(langMatches(lang(?person_full_name), "en")) .
}} 
LIMIT 100
"""

# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery(QUERY)
sparql.setReturnFormat(JSON)

# Run the query
result = sparql.query().convert()

# Just print the (list of) DBPedia URI(s)
for link in result["results"]["bindings"]:
    print(link["uri"]["value"])

which in this example would produce the following output:

http://dbpedia.org/resource/Maye_Musk
http://dbpedia.org/resource/Jack_Musk
http://dbpedia.org/resource/El_Ligero
http://dbpedia.org/resource/Elon_Musk
http://dbpedia.org/resource/Justine_Musk
http://dbpedia.org/resource/Kimbal_Musk
http://dbpedia.org/resource/Tosca_Musk

Are there techniques to somehow "rank" these URIs based on other properties to have Elon Musk emerging as the most "likely" Musk?

Of course I will try also other heuristics if I know the type of article (eg. Politics or USA or SpaceX) to get the most probable link.

Any suggestion very welcome. Thanks in advance.

PS Find the "El Ligero" URI particularly amusing ;)

nbro
  • 39,006
  • 12
  • 98
  • 176

0 Answers0