Distillations/Constellations #3: name matching
Recently, I've been helping a team who is building a name matching algorithm and evaluation mechanism, by thinking through the ethical implications in practice – and, most excitingly, actually building it too, for use in anti-corruption work.
Name matching algorithms are processes for comparing and matching names that might have slight differences in spelling or format. In this project, we're looking at name matching across language spaces – so, for example, how the name محمد might be written in English as Mohammed, Muhammed, Mohamed, Muhamad... the list goes on.
The project has led to me to all sorts of interesting spaces. This journal, Names: A Journal of Onomastics, for example. (What's onomastics, you ask? The study of the history and origin of proper names, especially personal names.)
In an ever-so-slight tangent to the actual task at hand, the range of topics covered in this journal is fascinating, and makes me deeply appreciate the niches that one can get into in academia. This paper examines the 'namework' in Ursula K. LeGuin novels. This one explores cat naming (yes, cat naming) practices in Saudi Arabia. (Why, you ask? I actually have no idea. A piece of trivia from that paper though: apparently women cat owners tend to assign non-Arabic foreign names to their cats, while men prefer traditional Arabic ones.)
OK, back from the tangent. So, name matching. There are a number of existing algorithms and approaches that attempt to do name matching, and each of them has their pros and cons in terms of effectiveness in a given context. And looking into the historical origins of them reveals a lot – like phonetic-based algorithms such as Soundex, first used in the 1880 federal census. It's based on English language pronunciation, and was aimed to make it easier to find a particular name even if it's spelled differently - so "surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together." Phonetic based algorithms like Soundex don't do well at matching names across cultures or languages.
Other approaches include looking at 'distance', like the as the Levenshtein distance or Jaro-Winkler distance which calculate the minimum number of edits required to transform one name into another.
This is all to say – it's hard to get right. And it really matters which approach or algorithm you use. But beyond the technical details of the project, one question that I've been thinking a lot about is: is it necessarily a good thing to make it easier for names to be matched across cultures and languages?
In the specific use case of the project at hand – ie. anticorruption work – I can pretty confidently say, yes, it would be helpful for fighting corruption (ie. better identifying people engaging in corrupt practices). But there are many, many other occasions where technical systems are used to 'identify' people based on name, for purposes that I would not agree with.
For example: Frontex, the European Union border police, might want to identify a person who has moved from one country to another without authorisation. In their eyes, the reasons for that person moving from country to country don't matter – even if they're trying to find their family or find a job. If the name matching algorithm that they currently use isn't great and means that they take longer to 'identify' that person – I'm okay with that. I fundamentally believe in freedom of movement, and that Frontex should be abolished. And that means that I don't want to contribute to anything that might help them do their job.
But the project I'm advising on is an open source project, meaning that theoretically, anybody could take it and use it.
Ethical licensure
One way that limitations could be placed on this use is via licensing, that is, specifying who is allowed to use the code, and who is not.
I was excited to find licenses like this: the Hippocratic License, which "aims to confront the potential harms and abuses technology can have on fundamental human rights." It draws upon globally agreed upon human rights standards, and offers options to customise the license depending upon the human rights issues that users are most concerned about (eg. fossil fuel divestment, supply chain, or ecocide.)
I love this idea. And I'm even happier to discover that the field of 'ethical licensure' is a emerging space of work, with quite a few different licenses already available from the Ethical Licensure Incubator. Some questions that I currently have include: how would you know if someone is violating these terms? Issues like supply chain rights are difficult to track down – surely finding that out for every particular use of a popular code repository would be prohibitively difficult? And if someone is violating these terms, would these licenses hold up in court? (Could this potentially be another angle for strategic litigation against Frontex or human rights violating agencies?)
I don't have the answers to these questions yet, but I'm looking forward to following along (and who knows, maybe contributing to) the field of ethical licensure.
More fascinating onomastics articles
- Gendering Urban Namescapes: The Gender Politics of Street Names in an Eastern European City, by Mihai S. Rusu
- “A Change of Name during Sickness”: Surveying the Widespread Practice of Renaming in Response to Physical Illness, by Russell Fielding - on "the special relationship between personal names and physical health in a wide variety of worldcultures."
- WeChat Usernames: An Exploratory Study of Users’ Selection Practices by Xing Xu, He Huang, Ting Jiang, Yuanpeng Zou – on "how the interplay of online discourse, acquaintance networks, and Chinese culture contribute to the development of this important onomastic phenomenon.