I've tried to write a blog post on tag systems for years now. Literally years, I think I first started drafting it out in 2018 or so? The problem is that there's just so much to them, so many different approaches and models and concerns that trying to be comprehensive and rigorous is an exercise in madness.
So screw it. These are my noncomprehensive, poorly-researched thoughts on tag systems, thrown on the newsletter. This is not about implementation of tag systems, just their design.
What is a Tag
A tag is a metadata label associated with content. The tag name is also the id: two tags with the same name are the same tag. Tags appear in almost all large systems:
- Social media #hashtags.
- Wikipedia categories.
- Blog post/CMS labels.
- AWS infrastructure tags.
- ACM digital library index terms.
Tags are primarily used for both querying, or discovering information based on tags, and grouping, or organizing content for processing. They are sometimes also used for business logic, like "all articles tagged
covid are free even without a subscription" or "devs can upload to any s3 bucket with the qa tag".
Tags are primarily client-driven. You do not need to make changes to the codebase to add or a remove a tag. This is in contrast with the content structure, which requires dev intervention to change.
Relationships between Tags
In the simplest system, all tags are uniquely identified by their name.
horses are separate tags. This is easy to implement and reason about, but it's also really annoying for taggers. Why should I have to tag everything with both
horses if they clearly mean the same thing?
The simplest relationship we could add are "tag aliases": if A is aliased to B, then querying A is identical to querying B. While things could be tagged
horses, the internal system only "knows"
horse, and automatically converts searches for
horses into searches for
The best example of tag aliases is the fanfiction repository Archive of Our Own (AO3). Fanfics use a lot of jargon to refer to various character pairings, which makes querying difficult. Teams of volunteers comb through stories and manually add aliases to tags, so that stories tagged
snarry show up under the root tag
The additional tag structure adds expressiveness. But it also raises use-case questions, like "should users be able to query a specific tag alias?" ie, can I search for stories tagged with the alias
snarry? There's no correct choice here. AO3 went with "no".
In a subtag system, tags are placed into a hierarchy of parent tags and subtags. Anything tagged with the subtag also counts for the parent tag. For example,
quantum physics could be a subtag of
physics, which could be a subtag of
science would include content on quantum physics. A tag with no subtags is a leaf tag. This is arguably the most common kind of tag structure.
Implementation considerations: transitive queries are expensive, and you need to prevent cycles in the tag hierarchy (A's parent is B, B's parent is A). They also have major usage considerations:
- Can things be tagged with root tags or just leaf tags?
- Can things be tagged with something and its subtag?
- How do users search for X without its subtags?
- How do set operations work? Does a book on particle physics and a book on elephant biology share the
This post is sponsored by me. I'm teaching an online TLA+ workshop in March, May, and June. Use the code
C0MPUT3RTHINGS for 15% off!
Most subtag systems form a forest (set of disjoint trees), where each tag has at most one parent. Wikipedia categories, on the other hand, can have multiple parents. "American Male Novelists" is a subtag of "American Male Writers" and "American Novelists", both of which are subtags of "American Writers". The tag system forms a directed acyclic graph, so let's call these DAG tags.
DAG tags makes preventing cycles harder and adds "redundancy" as a structural design question. If A is the child of B, can C be the child of both A and B? Wikipedia sort of has it both ways, with "diffusing" vs "nondiffusing" subtags. Querying also gets complicated. How do we handle the query "all items with tag X but not tag Y" when an item has tag Z, which has as parents both X and Y? Is that even something you should let users query?
Notice that the more richness in structure we add, the more ambiguous our queries become. "Every article that shares a tag with article X that's not also a tag on article Y" is unambiguous with simple tags, less so with a tag tree, even less so with a tag DAG. This is the standard "generality vs tractability" problem you see everywhere in CS, from static analysis to constraint solving to aliasing. Constraints liberate and liberties constraint.
So far all the tags have been "simple": the set of tagged items is the set specifically assigned to that tag. This is made more complicated with aliases and parent tags but the core idea of explicit assignment is there. "Smart tags" are instead based on a rule: anything matching that rule counts as having that tag. For example, any recipe that doesn't have any nut ingredients is automatically tagged nut-free.
Smart tags are very expressive. They also play hell with any other kind structure and even themselves if you're not careful. One mobile device manager (MDM) I worked with allowed for smart tags like "every iPad assigned to a classroom in Ferndale High School that's not tagged ABC". As a curious and extremely dumb engineer, I added two smart tags:
- A: "Anything tagged B"
- B: "Anything not tagged A"
And then the MDM crashed.
The lesson: don't let users encode logical paradoxes. Also, don't try making logical paradoxes on a production server.
Smart tags are computationally expensive. Any change to any item can potentially require a large set of re-computations. Between this and the logical paradoxes thing they're a niche solution to specific problems. Which is a shame, I really like the idea of them.
Hashtags are a form of tagging where the tag is embedded as part of the content of the object. The archetypical example is Twitter: a tweet containing the string
#foo is considered to have the
foo tag. Tag membership is tested by searching the text of the content. In practice, many systems will opt to cache the tags associated with the content for performance reasons. Hashtags are very popular on social media because you don't need to expose any additional input functionality to the user: they use the same textbox for both content and tags.
Hashtags have two limitations over other tagging systems. First, they have to be parsed as part of the content. This means Twitter cannot have the hashtag
#this tag due to the parser interpreting the space as the end of the tag. It also means there's no easy way to refer to the tag without including it. I can't say "everybody who uses the #qanon tag is insane" without also tagging my post
Second, there is no place for additional metadata. You can't do tag aliases or subtags or anything.
Instead of a tag being a single label, the tag carries an additional semantic value that's not the tag id. Then you can query either the raw tag or the tag's value. This is used a lot in technical tooling. A sprint planner might tag some things as
priority: 1, in AWS you can tag resources as
env: production, etc.
For the most part I've only seen this done with string or enumerated values. While it's plausible to tag something like
due date: 2023-01-30 so you can query by date range, in practice that's usually lifted to built-in content structure instead of the ad-hoc tagging structure.
Other Thoughts on Tags
Who creates the tags?
All tag systems have an audience, who the content itself is for, and a set of taggers, who are doing the tagging. These can be the same people or completely different groups. Some possible breakdowns:
- Individuals tagging for themselves: MacOS Finder tagging. Personal photo tagging. Adding tags to bookmarks.
- A group tagging for themselves: AWS tagging. Internal documentation. Bills.
- Public content with private taggers: Archive of Our Own. Wikipedia(ish). Newspapers. Library of Congress.
- Tagging for a distributed resource: Semantic web. Social media. Amazon Customer Stores.
All of these have different design and usage concerns.
The Instagram problem
The last case, where everybody is tagging their own public content, poses two unique concerns. The obvious one is bad actors. In a distributed system everybody is competing with everybody else for attention, so there's an incentive to add unrelated tags.
Metadata exists in a competitive world. Suppliers compete to sell their goods, cranks compete to convey their crackpot theories (mea culpa), artists compete for audience. … That's why a search for any commonly referenced term at a search-engine like Altavista will often turn up at least one porn link in the first ten results. — Corey Doctorow
The other problem is what I call "The Instagram Problem". Instagram uses hashtags. Nobody coordinates the tags, so there's no way of knowing which tags people will look at. Here's a picture of a friend's horse:1
In order to be sure that other horse girls on Instagram saw the picture, she had to tag it
#horsesofinstagram. At some point "horse" stops being a word.
Tags vs Structure
I sometimes see tags used as a replacement for first class structure. Instead of having a "assigned to" field in a record, the record might be tagged with an "assigned to" property tag. This rankles me but I can't quite place down why. People do this because it's more expressive and lot easier to set up. But you lose the advantages of having that structure made explicit.
Library scientists have written a lot about tagging systems. Foundations of Library and Information Science book is a good intro to this (and other librarian topics). In my experience, they tend to focus on metadata strictly controlled by professionals for the purpose of knowledge categorization, which is not-entirely-in-line with how a lot of software systems use tagging. Even so, if you really want to get deep into the nuances of information classification, library science is the way to go. Some topics:
Update for the Internets
It's okay to be a horse girl if you literally own a horse ↩