Well, I’ve made it to two! Thanks for joining me on this little experiment.
There was a fun thread on Twitter last week that started when a well-known web developer pointed out that if you set the
maxlength on a web form input field to one, you can’t type emoji. They suggested it was to do with the number of bytes but this is not quite right.
Having just written a chapter on Unicode for a book for digital classicists, I thought I’d briefly explain on Twitter (and now here).
What we think of as a single glyph could be one or more characters (each represented by a number called a code point) and each code point might end up being one or more code units in a particular encoding (a code unit being a byte in UTF-8 or two bytes in UTF-16).
I constructed the following table to illustrate the difference between a glyph, how it’s represented as character code points and then encoded in either UTF-8 or UTF-16 code units.
Relevant to the
maxlength issue is that emoji (and decomposed characters with diacritics) need more than one code-unit in UTF-16 (although for different reasons). I suspect it’s UTF-16 code-units that browsers are using when determining “length”. Not bytes. But not characters either. And certainly not what we think of as single glyphs. And that’s what leads to the mismatch.
The reason the fair-skinned, blond-haired emoji above that I sometimes represent myself with takes two characters (and eight bytes in both UTF-8 and UTF-16) is that it is a combination of a character indicating a person with blond hair and a character indicating the previous emoji has a Fitzpatrick Type 3 skill colour. This separation allows for richer combinations with fewer characters needed in the coded character set than if every combination were represented individually.
Now the compositionality of emojis can get pretty involved.
That’s 7 code points, 25 UTF-8 bytes, 11 UTF-16 code-units (22 bytes)!
That’s because each person is a separate character with a joining character between each so a single glyph is constructed.
The Welsh flag emoji is also 7 code points although NOT for the same reason. It’s not really compositional. Instead, it’s made up of the following characters:
And no, GBWLS is not a Welsh word, despite trying hard to be. It’s a five-letter region code for Great Britain — Wales.
Sadly my Unicode chapter for digital classicists had zero room to discuss emoji. I do plan a supplementary webpage, though, where I can go wild with more information about this sort of thing.
Last year, after almost twelve years, the engine on my first MINI Cooper S, nicknamed “Max” (Max MINI, get it?) died and it was going to cost more to fix it than the car was worth. Even though I’d already ordered a Tesla, they were making the registration and financing a pain and, in my heart, I think I really just wanted another MINI. And so at the start of this year, my wife and I ordered a new MINI and I canceled my Tesla order.
MINIs are highly configurable and if you don’t pick the right options (e.g. automatic transmission in a popular colour) they have to be built to order in Oxford. And so it took six weeks or so to arrive but last Thursday we picked it up!
Meet Gandalf the Grey…
I absolutely LOVE driving it and have not once regretted cancelling the Tesla.
For my Old English class at Signum University, we have to do a translation project where we do a translation of a text of our own choice with notes. I really wanted to do the letter that King Alfred the Great wrote to Bishop Wærferth (which serves as a preface to his own translation from Latin of Gregory’s Pastoral Care). It’s a wonderful letter from the ninth century about how England could restore itself through more people reading.
The problem is we translated a good part of it in class and we’re not supposed to pick a text for our project that we’ve already translated. But I negotiated a compromise: let me do this text but I’ll do extra analysis. I’ll not only translate it but take a subset of it and do a lemmatisation, a morphosyntactic and syntactic dependency analysis, and translation alignment.
This wasn’t part of my pitch, but my goal is to eventually “publish” it in the Scaife Viewer, maybe as a dedicated mini-site for Old English learners.
I’ve been speculating a lot on Twitter about the Amazon Prime Lord of the Rings map reveals. I went through pretty much all 16 places they’ve shown on the map so far and talked about what their significance might be. Some of their choices have been pretty unexpected and they’ve held off on revealing obvious names that would indicate a late setting close to the time of the LotR books themselves.
For this reason, along with the inclusion of some older names, there has been suggestions it will take place much much earlier, like at the time Sauron first forged the rings. I think it unlikely it would be set just then. More likely, I suspect, is a story told with multiple timelines. There has to (in my opinion) be something that links us to the characters people know from the movies, hence my long-running suggestion to cover the life of Young Aragorn (which could also act as a framing story for the earlier stuff).
But stay tuned for a more thorough analysis on Digital Tolkien of where the place names that are mentioned appear in various texts.
Last week I found out about a workshop in Milan in June that’s part of the LiLa: Linking Latin project, focused on taking a Linked Open Data approach to Latin linguistic information.
Even though most of my own work is on Greek, I’ve had a long involvement in Linked Open Data (going back to implementing the first RDF library for Python in 1999) and its application to language data. Very recently, I’ve been involved with the Ontolex group looking at the modelling of morphological information in the Linguistic Linked Open Data world.
So I thought the workshop would be really useful to attend so I could apply to Greek what was discussed for Latin.
I just found out that Professor Crane, who I now work with extensively on digital classical philology projects after following his work from afar for 25 years, had been asked to speak but can likely not attend. He suggested to the organisers that perhaps I could speak in his place and they agreed!
So I’ll be talking mostly about Scaife and how it can be a platform for viewing Linguistic Linked Open Data (and in particular, all the Latin data other people at the workshop are producing).
Supposedly the top three books on descriptive bibliography:
and a couple of other books that looked interesting on library / information science in general:
I also got Composing Music for Games: The Art, Technology and Business of Video Game Scoring: Chance Thomas which looks particularly wonderful. It’s written by the person who scored Lord of the Rings Online!
That’s it for this one! Let me know if you have any feedback or if there are any topics you’d like to see more of.