Emoji Haikus, and the deepest secrets of the unicode specification || fabian writes.

About two and a half years ago, I built Emoji Haiku, a nifty little webpage that generates haikus out of the official unicode descriptions of emojis.

⏰ 🔨
👵🏽 🏐 🐋
🐫 👻
alarm clock hammer
old woman volleyball whale
two-hump camel ghost

A lot has changed about emoji since the initial version of Emoji Haiku! Here’s a few of the more notable things:

It’s possible to change the skin color of emoji (👋🏻👋🏼👋🏾👋🏾👋🏿)¹!
We’ve got a bunch of new jobs 👩‍🚀 and people-doing-things ⛷ emoji, which support gendering as “woman” and “man”, and usually default to a gender-neutral representation.
Apple changed the PISTOL 🔫 emoji to a water gun. Microsoft changed theirs to a space blaster, and Twitter and Google followed suit a couple of years later.
People started to become more observant of how emojis sound when read out load by automated screen reading software, thanks to CarPlay / Android Auto / Bluetooth functionality taking off more (’👍🏻’ = “light skin tone thumbs up”), and Kai on Twitter. Spoiler alert – it’s bad:

A screen reader recording of this mini thread when my name was this:

🏳️‍🌈 Kai 💖 🧡 💛 💚 💙 💜https://t.co/2TsaBxdk9Z https://t.co/aiJ1rJurlM
— Kai (@kai_wanders) July 1, 2018

Emoji are political, because technology is political.

We’re finally starting to understand and observe the impact that technology has upon the power relationships in society, and are realising that the act of making technology is therefore fundamentally political. 2016 was probably the year that this concept entered the mainstream, with conversations about Fake News, the influence of Russia upon the 2016 election, and the impact of tech upon gentrification and marginalisation all significantly present in the political discourse.

Changes to emoji, which had historically been considered by many to be “politically neutral”, turned out to catalyse this discussion. 2016 was the year when Apple changed the pistol emoji to a water gun, which somehow turned out to be super controversial:

In the history of running Emojipedia, I have never seen an emoji change so poorly received… The combination of “Apple” and “Gun” in the same headline seems to have proven irresistible.

Since that change was first made, most other vendors have followed suit.

The response to skin-color modifiers in the same year was mostly positive, but still generated a lot of controversy. There’s plenty of great commentary by more astute individuals than me, so here’s a super brief literature review:

Paige Tutt in the Washington Post notes that it adds race to conversations that didn’t need it and shouldn’t have it, and that the changes are superficial. It’s worth noting that:

this is especially the case because many keyboard implementations retain your skin color preference and you actively have to make an effort to change this once you’ve got it set. I wouldn’t be surprised if many users never figured out how to change it again afterwards, given some of the stories that Joe Clark writes about here.
this is exacerbated in the text to speech representation, where ‘👋🏼’ is rendered as ‘waving hand: medium-light skin tone’. This is especially weird given that the user’s intent in choosing a skin color is often not explicit, given the fact that keyboards save it by default.

Andrew McGill in the Atlantic notes that proportionally, white people tend to use color-neutral representations (👩‍🔬) instead of white representations (👩🏻‍🔬), partially because the neutral emoji still look like white people, just with yellow skin, and partially because:

The folks I talked to before writing this story said it felt awkward to use an affirmatively white emoji; at a time when skin-tone modifiers are used to assert racial identity, proclaiming whiteness felt uncomfortably close to displaying “white pride,” with all the baggage of intolerance that carries. At the same time, they said, it feels like co-opting something that doesn’t exactly belong to white people—weren’t skin-tone modifiers designed so people of color would be represented online? ²

Aditya Mukerjee wrote, in Model View Culture, about the fact that he’s unable to write his name in Bengali, his native script, because unicode never got around to including it. He writes:

Perhaps I wouldn’t mind that the emoji world now literally has “colored” people, if it weren’t for the timing. Instead, what could have been a meaningless, empty gesture becomes an outright insult. You can’t write your name in your native language, but at least you can tweet your frustration with an emoji face that’s the same shade of brown as yours!

Fast forward to 2018, where researchers at the University of Edinburgh found:

…the vast majority of skin tone usage matches the color of a user’s profile photo - i.e., tones represent the self, rather than the other. In the few cases where users do use opposite-toned emoji, we find no evidence of negative racial sentiment. Thus, the introduction of skin tones seems to have met the goal of better representing human diversity.

Good news.

Let’s talk about unicode, again

v1 of Emoji Haiku was pretty straightforward to build - I just grabbed an HTML file of emoji names and descriptions³, and worked with that. v2 is signficantly more complicated, and parses the plain text unicode spec files.

The unicode spec is super super complicated. Here’s a few things that I learned along the way:

A vast number of emoji (e.g. ✏️ ‘pencil’) come in both text mode (✏) and in emoji (✏️) mode. Some of them default to text, some of them default to emoji. So whenever you see ✏️ written, what your software sees is ✏ (lo-fi text-version pencil) + U+FE0F⁴ (the invisible “actually, please make that look like an emoji” character).
Skin colors work on some emoji, but not others. For example:
- ⛺️ (tent) obviously doesn’t support skin colors
- 👋 (waving hand) + 🏽 (medium skin tone) = 👋🏽. Seriously! You just stick the two different characters next to each other and they magically combine ✨.
- 🏂 (snowboarder) technically does support skin colors, but you can’t see any skin anyway.
- 🧛🏼‍♀️ Vampires do support skin color, but this is explicitly noted in the specification as ‘unusual default skin tone’, with the implication that your results will probably vary. On Android, their skin is always the same shade of grey, on iOS, they go from light grey to dark grey. Here are vampires in the default + all other skin tones, see how they render on your phone / computer! 🧛‍♂️ 🧛🏻‍♂️ 🧛🏼‍♂️ 🧛🏽‍♂️ 🧛🏾‍♂️ 🧛🏿‍♂️
If you do choose a skin color, it automatically implies that you want the character rendered as an emoji instead of text, and, according to the specification, you Should Not™️ add the magic U+FEOF “please make that look like an emoji” character. There is, however, an explicit acknowledgement in the spec that it’s hard to get this right and that everyone’s done it wrong for years, so… I guess we’re stuck with this now?
A vast number of emoji, which look like single characters, are actually comprised of other emoji, joined together. We already saw this a little bit with the waving hand example (👋 + 🏽 = 👋🏽), but did you know that a woman astronaut 👩‍🚀 is actually a woman 👩 + a ZWJ + a rocket 🚀? The ZWJ is another invisible character - it stands for ‘Zero-Width Join’, and you can think of it like a thin layer of glue between the characters on either side of it. I find this combining mechanism (woman 👩 + rocket 🚀 = astronaut 👩‍🚀) endearing and poetic⁵, but it’s a pain to work with in code because the meaning of a character can change based on its context⁶.
There’s two different ways of constructing gendered emoji, and while it looks like they’re similar, they’re really, really not.
- The aforementioned woman 👩 + rocket 🚀 = woman astronaut 👩‍🚀 approach
- Or, the person running 🏃‍ + male sign ♂ = man running 🏃‍♂️ approach.
Once you consider all of these things together, you can end up with some really complicated sequences. A man pilot: medium skin tone 👨🏽‍✈️ is actually 👨 (man) + 🏽 (medium skin tone) + ZWJ (invisible glue) + ‍✈ (lo-fi airplane) + U+FE0F (please spritz up my lo-fi airplane into a shiny emoji-style one). That is 5 characters, all magically smooshed together. ✨

Unicode and encoding schemes

To make things even more complicated, programmers usually can’t just think about characters - there’s also a bunch of different ways that you can convert those characters into bits and bytes that the computer can read. The method used to convert these characters into bytes is called an encoding scheme, and lots of stuff these days uses one called UTF-8. Encoding schemes are the sort of thing that you don’t notice until something is wrong, but stuff goes wrong frequently enough that you’ll probably know what I’m talking about.

Have you ever mentioned your ‘resumé’ on a website and discovered that it’s been repeated back to you as ‘resumÃ©’? Or copy-pasted some text from Microsoft Word and found that all the apostrophes have been turned to ‘â��’? These are encoding issues - the programmer either forgot to think about it, or took a guess at what the encoding of your input was, and guessed wrong⁷.

Did you ever notice (back when people paid per SMS) that if you typed certain characters, suddenly your Nokia 3310 would tell you you only had 67 characters to work with instead of the standard 160? Turns out you can check this on an iPhone too! Turn on Settings → Messages → Character Count. Start typing an SMS (you need the green bubble instead of the blue one, which is iMessage), add an emoji, and watch the character count change suddenly!

Type 🦄, watch the number of remaining characters drop!

That’s because some characters aren’t representable by GSM-7, the standard SMS encoding, and so your phone switches to UCS-2, a different encoding which can represent more characters, but also uses 2 bytes per character. Under UCS-2, your phone can’t fit as much text into the same amount of data, because each character takes up more space – and so you can’t send as much text per SMS.

Doing the numbers:

A normal SMS is 160 characters, using GSM-7, which uses 7 bits per character. There’s 8 bits in a byte, so the size of a normal SMS message is 160 characters × 7 bits per character ÷ 8 bits per byte = 140 bytes.
A ✨ Fancy UCS-2-encoded SMS 🦄 is 70 characters, using UCS-2, which is 2 bytes (or 16 bits) per character. The size of a UCS-2-encoded SMS is therefore 70 characters ️️️× 2 bytes = 140 bytes.

To the computer, they’re exactly the same size!

Remember how I said earlier that lots of stuff uses UTF-8 encoding? Well, UTF-8 is even quirkier than the encoding schemes used for SMSes, and different characters can take up different numbers of bytes. For example:

The letter ‘A’ requires 1 byte.
The letter ‘ü’ requires 2 bytes.
The character ‘€’ requires 3 bytes.
The character ‘🁗’ (DOMINO-TILE-HORIZONTAL-05-03⁸) requires 4 bytes.

This is called a variable-length encoding. Emoji are pretty late to the unicode specification (and presumably used less frequently than, y’know, letters), so they’re all towards the back of the list and require more bytes to represent. For eaxmple, our man pilot: medium skin tone 👨🏽‍✈️ from before is comprised of the unicode codepoints 1F468 1F3FD 200D 2708 FE0F, which is (4 + 4 + 3 + 3 + 3) bytes = 17 bytes! There are even unicode sequences where less bytes are required to write out the official description than the actual character – ‘👩‍⚖️’ is 13 bytes, but the textual description ‘woman judge’ clocks in at only 9. What a world we live in!

Lots of software is still really bad at handling unicode.

Given how complex all of this emoji stuff is, it’s super hard to get these details right. In re-building Emoji Haikus, I’ve discovered and read about bugs in lots of places:

The Microsoft EdgeHTML Renderer (RIP) had issues with displaying a number of Emoji back in September 2017. What I particularly like about this bug is that it was marked as fixed, and then discovered to be not really fixed - it says something about the complexity of the system or of the implementation that it’s hard to tell whether a bug was actually fixed or not.
Chrome on OSX (and probably on other platforms) doesn’t correctly render emoji with ZWJ (‘invisible glue’), unless you set the font to a specific emoji font:
Emoji rendering on Google Chrome and Apple Safari, respectively.
Adding a MEN WITH RABBIT EARS emoji 👯‍♂️ to my Twitter handle broke the code that embeds tweets into my website due to an encoding issue. Wow. I don’t even know who is at fault for this. Is it Twitter? Is it the software library I’m using for putting tweets in my blog? In the end, I just removed the emoji until I’d finished writing this article 😅
It’s been an absolutely nuisance editing this article with the text editor I’m using, because it treats each character from those 5-character magic glue sequences as separate characters – even though they’re invisible – which means I have to press → five times just to skip past our pilot buddy our pilot buddy 👨🏽‍✈️ from before.

Ok wait, so… why a rewrite?

I decided at some point that it’s probably time to incorporate all the new shiny emoji stuff into Emoji Haikus, and after interviewing a candidate who was using Serverless to deploy code to AWS Lambda recently, thought that it could be a good thing to learn about. The initial version was running on AWS Lambda too, but it used a patchwork of hand-rolled stuff and wasn’t really maintainable.

Did this warrant a from-the-ground-up rewrite? … Probably not, but that’s what I did 🤷

So what’s new?

Emoji Haiku 2.0 now allocates:

Skin color, to any emoji that supports it
Gender, to any emoji that supports it.

This means that, when the text says “ASTRONAUT”, you could get any of 👩🏻‍🚀👩🏼‍🚀👩🏽‍🚀👩🏾‍🚀👩🏿‍🚀👨🏻‍🚀👨🏼‍🚀👨🏽‍🚀👨🏾‍🚀👨🏿‍🚀. Super cool!

By default it does this randomly, but you can also choose specific options on the page and have them apply to everyone. It’s tucked away under ‘Display Options’.

So, go and generate yourself an Emoji Haiku here (they make for super cheap Christmas presents! 🎄🎁🎁) or follow @emojihaikus for one every 6 hours. If you’re a code person, you can examine some of my questionable choices on Github.

I think skin color was actually around when I wrote the first version of Emoji Haiku, but it wasn’t as widely supported at the time. ↩︎
I personally find this extremely relateable - we had a similar conversation about the protocol on this at my place of work, and it’s literally only white people still using the yellow emoji - myself included. ↩︎
It was an earlier version of this 42.5 MB weapon of a webpage. ↩︎
The U+FEOF format is called a ‘unicode codepoint’ - basically, in unicode, every letter, every symbol, and a bunch of weird non-printable characters is represented by a specific number, defined by the specification. ‘Codepoint’ here is a fancy word for ‘specification-allocated number’. I tend to just use the word ‘character’ in this article – ‘codepoint’ is the more-precise, but less commonly used, name. ↩︎
Other great examples include 👱‍♂️ man + 🍳 cooking = man cook 👨‍🍳 (surprise!), and 👩 woman + 🎤 microphone = 👩‍🎤 female david bowie. ↩︎
In particular, the code I wrote was naïve about this, and so I had to spend a bunch of time later untangling that. In the end, I hacked it in, rather than redesigning what I already had, because the hack doesn’t make things too much worse. My old manager at Google (Hi Michael!) had a ’three hacks’ rule - don’t bother re-engineering or redesigning anything until you’ve already made three quick-and-dirty fixes to the existing code. Waiting until you get to your ’third hack’ means that you have a better idea of what the code actually needs to do - if you build it after one hack, you’re going to swing a lot in one direction, and then having to swing back a lot next time you change it as well. On the other hand, after you’ve already hacked a few things in, you’re probably not learning much new every time after, and the cost of working in increasingly hacky code is getting higher and higher. ↩︎
This is almost always unintentional - it’s only a relatively recent development that programming languages have been designed with attention to this. Now, it’s harder to make these mistakes by accident than it used to be – but still very doable when you’re working across multiple operating systems and multiple programming languages. ↩︎
Fun fact - this domino also exists in three other rotations (🁋 🁗 🁽 🂉) and then names for all of them use number formats like ‘03’ and ‘05’, which implies that someone maybe expected one day to add dominos with more than 9 dots on them? To me, the fact that there’s this level of granularity for freakin’ dominos, but not e.g. the aforementioned proper support for Bengali, indicates that we need a broader range of voices working on technical ownership and guidance committees. ↩︎