What’s in a Name?
A conversation with Monocl’s disambiguation expert Jiaji Wu about the challenge of finding the right John Smith. Jiaji Wu is one one of our many talented engineers working with data science and disambigutaion.
Chances are, when you last read Shakespeare’s Romeo and Juliet you didn’t think about disambiguation. Nor did I, but bear with me through a minute of literary analysis. “What’s in a name?
That which we call a rose. By any other name would smell as sweet.”
- Romeo & Juliet, Act 2 Scene 2
While Juliet argues that names are meaningless and don’t capture the essence of a person and - on a philosophical level - that sounds reasonable; on a practical level, in science, research, medical and clinical practice your name is everything. Your name is associated with your publications, patents, board memberships, talks and social media presences. Your name is how people find and connect with you, your name carries the summary of your career and accomplishments.
Most of us, though, have to share their names with other people. If you have a common last and a popular first name you might have to share with quite a few, e.g. if you are called John Smith there are almost 48,000 people in the US alone with that name, and Mary Smith has almost 38,500 namesakes.
As a scientific expert you aren’t likely to be confused with a baker in Nashville or a facilities manager in Suffolk but how about other scientists, physicians or researchers? A quick look in the Monocl database shows 120 experts worldwide with the name John Smith; Mary Smith only has 30 others to contend with.
And now we come back to disambiguation, specifically: how do we know that there are 120 John Smiths and not 119 or 122? How do we know which John Smith published what and how do we make sure that John Smith 13 isn’t credited with a publication authored by John Smith 89?
This is where our disambiguation expert Jiaji comes in. In my conversation with him we talked about the challenges of attributing activities such as publications to the right John Smith, making sure we don’t “split” one John into two or lump two into one, how naming conventions in different countries confound him every day and how academic institutions also make life hard for him.
Here is an overview of what keeps him up at night and how he solves these issues so that the Monocl platform accurately reflects the accomplishments of all the experts included. This post only discusses disambiguation from publications perspective. Monocl uses a wide range of sources, and this is just one (simple example).
Scientific publications and their changing naming conventions
Let’s just stick with John and Mary to illustrate that point: John can be John Smith, or John R. Smith (he just acquired the middle name Reginald) or John Reginald Smith, maybe even John Reginal Smith, Jr or Johnny Smith. Mary Jane Smith has similar possible permutations plus a few possible extra ones should she have married and added a new last name. To make matters worse, sometimes titles are used resulting in John Smith, M.D. and Mary Smith, Ph.D.
And, Jiaji tells me, these naming conventions have changed over time so older papers might have a different format than newer ones.
But, it’s not just English names that are confusing.
Julia van de Berg and Maria di Martino also create their share of problems because, somehow, one needs to deal with those particles or noble titles. Are they part of the last name or a suffix? As so often, the answer is: it depends, e.g. on the nationality and whether the particle is capitalized or not. No wonder then, that different publications might treat them differently adding to the confusion.
Spanish and Portuguese names are challenging because they can consist of multiple last names. A person typically carries both their father’s and their mother’s last name and sometimes compound names are created through marriage. These names might not get used consistently in publications. So Pedro García-Carrión Martínez can be just that or Pedro Martinez or Pedro García-Carrión and probably a couple of other permutations.
And we haven’t even mentioned Asian names, e.g. the challenge of sorting out Korean experts when more than half the country has one of three last names Kim, Lee or Park.
For our disambiguation expert the challenge is to develop a system that can deal with these ambiguities in a systematic manner. He and his colleagues need to develop a software that is robust and flexible enough for him to add naming conventions on a country by country basis as he develops them without the new additions breaking the entire system.
Who published what?
With the input sorted, the names now need to be disambiguated. This task is tackled using machine learning algorithms and applying the “classical” process:
- A set of features is determined that allows the algorithm to differentiate between different experts
- The algorithm is trained using a training set of existing data
- A valuation set confirms that the algorithm is working properly (or not, then more/different features, more training, more valuation are needed)
The validated algorithm is applied to new data
One of the most critical parts in this process is to determine the right “features” that will allow the algorithm to learn how to assign an activity such as a publication or talk at a conference to the correct John Smith. Jiaji is using a combination of five or six different features to make the determination. While the exact number and nature of the features is nothing he talks about publicly he mentions one critical feature: co-authorship.
On a high-level, this is straight-forward: if two papers written by John Smith both have some of the same co-authors, chances are we are dealing with the same expert. In the range of things to consider that is just one for simplicity’s sake.
Of course, reality is always more complicated which is why he is using several additional features to fine tune expert identification and then applies post-processing steps to fix issues that might have crept in.
Harvard Medical School or Harvard School of Medicine?
Affiliations with institutions is a useful feature to help sort out experts, but institutions have their own set of issues and inconsistent names. Not only are institutions referred to by different names – in addition to Harvard Medical School and Harvard School of Medicine you will also find Graduate School of Medicine at Harvard and probably a few more – but there also different institutes associated with one university which also might have changed their name over time.
Sometimes publications list the departments in addition to the medical schools and universities, refer to institutes by name with or without disclosing the affiliation with a particular school and/or department. This affiliation is an example of that issue, it contains four distinct pieces of information about the author’s affiliation all of which might or might not be listed in a specific publication:
Division of Pharmacology and Toxicology, College of Pharmacy, The University of Texas at Austin, Dell Pediatric Research Institute.
In short, it gets messy fairly quickly!
The value of disambiguation
For the users of the Monocl platform disambiguation, esp. based on co-authorship of publications, has two distinct advantages:
- While Jiaji can’t guarantee that he gets all the John Smith’s sorted 100% correctly the Monocl platform does an extremely good job at sorting out the John Smiths and assigning activities such as publications, talks or clinical trials to the correct expert.
- All that analysis of co-authorship is also reflected in each expert’s profile. There, a user can find a list of people their expert has collaborated with and the number of collaborations the experts have undertaken with their various co-authors creating a detailed network map.
This relationship information can be very useful when engaging an expert. If you know a frequent collaborator of an important expert you are trying to connect with you might be able to get a referral. Maybe just mentioning your existing long-term relationship is enough to get the attention of the important KOL.
Finding out who that very happy customer of yours regularly works with gives you a great list of leads: peer influence is powerful and no marketing asset, however well designed and carefully crafted, beats a recommendation by a collaborator.
A lot’s in a name!
Disambiguation is a continuing process that has the important goal of capturing every medical, clinical and life science expert in their own right and assigning all the relevant activities they have undertaken to them – not more, not less.
For interpreting Shakespeare there are experts, too. Let’s leave any further pondering about the question “What’s in a name?” to them.