About two and a half years ago, I built Emoji Haiku, a nifty little webpage that generates haikus out of the official unicode descriptions of emojis.

⏰ πŸ”¨
πŸ‘΅πŸ½ 🏐 πŸ‹
🐫 πŸ‘»

alarm clock hammer
old woman volleyball whale
two-hump camel ghost

A lot has changed about emoji since the initial version of Emoji Haiku! Here’s a few of the more notable things:

Emoji are political, because technology is political.

We’re finally starting to understand and observe the impact that technology has upon the power relationships in society, and are realising that the act of making technology is therefore fundamentally political. 2016 was probably the year that this concept entered the mainstream, with conversations about Fake News, the influence of Russia upon the 2016 election, and the impact of tech upon gentrification and marginalisation all significantly present in the political discourse.

Changes to emoji, which had historically been considered by many to be “politically neutral”, turned out to catalyse this discussion. 2016 was the year when Apple changed the pistol emoji to a water gun, which somehow turned out to be super controversial:

In the history of running Emojipedia, I have never seen an emoji change so poorly received… The combination of “Apple” and “Gun” in the same headline seems to have proven irresistible.

Since that change was first made, most other vendors have followed suit.

The response to skin-color modifiers in the same year was mostly positive, but still generated a lot of controversy. There’s plenty of great commentary by more astute individuals than me, so here’s a super brief literature review:

Paige Tutt in the Washington Post notes that it adds race to conversations that didn’t need it and shouldn’t have it, and that the changes are superficial. It’s worth noting that:

Andrew McGill in the Atlantic notes that proportionally, white people tend to use color-neutral representations (πŸ‘©β€πŸ”¬) instead of white representations (πŸ‘©πŸ»β€πŸ”¬), partially because the neutral emoji still look like white people, just with yellow skin, and partially because:

The folks I talked to before writing this story said it felt awkward to use an affirmatively white emoji; at a time when skin-tone modifiers are used to assert racial identity, proclaiming whiteness felt uncomfortably close to displaying β€œwhite pride,” with all the baggage of intolerance that carries. At the same time, they said, it feels like co-opting something that doesn’t exactly belong to white peopleβ€”weren’t skin-tone modifiers designed so people of color would be represented online? 2

Aditya Mukerjee wrote, in Model View Culture, about the fact that he’s unable to write his name in Bengali, his native script, because unicode never got around to including it. He writes:

Perhaps I wouldn’t mind that the emoji world now literally has β€œcolored” people, if it weren’t for the timing. Instead, what could have been a meaningless, empty gesture becomes an outright insult. You can’t write your name in your native language, but at least you can tweet your frustration with an emoji face that’s the same shade of brown as yours!

Fast forward to 2018, where researchers at the University of Edinburgh found:

…the vast majority of skin tone usage matches the color of a user’s profile photo - i.e., tones represent the self, rather than the other. In the few cases where users do use opposite-toned emoji, we find no evidence of negative racial sentiment. Thus, the introduction of skin tones seems to have met the goal of better representing human diversity.

Good news.

Let’s talk about unicode, again

v1 of Emoji Haiku was pretty straightforward to build - I just grabbed an HTML file of emoji names and descriptions3, and worked with that. v2 is signficantly more complicated, and parses the plain text unicode spec files.

The unicode spec is super super complicated. Here’s a few things that I learned along the way:

Unicode and encoding schemes

To make things even more complicated, programmers usually can’t just think about characters - there’s also a bunch of different ways that you can convert those characters into bits and bytes that the computer can read. The method used to convert these characters into bytes is called an encoding scheme, and lots of stuff these days uses one called UTF-8. Encoding schemes are the sort of thing that you don’t notice until something is wrong, but stuff goes wrong frequently enough that you’ll probably know what I’m talking about.

Have you ever mentioned your ‘resumΓ©’ on a website and discovered that it’s been repeated back to you as ‘resumé’? Or copy-pasted some text from Microsoft Word and found that all the apostrophes have been turned to ‘Γ’οΏ½οΏ½’? These are encoding issues - the programmer either forgot to think about it, or took a guess at what the encoding of your input was, and guessed wrong7.

Did you ever notice (back when people paid per SMS) that if you typed certain characters, suddenly your Nokia 3310 would tell you you only had 67 characters to work with instead of the standard 160? Turns out you can check this on an iPhone too! Turn on Settings β†’ Messages β†’ Character Count. Start typing an SMS (you need the green bubble instead of the blue one, which is iMessage), add an emoji, and watch the character count change suddenly!

Type πŸ¦„, watch the number of remaining characters drop!

That’s because some characters aren’t representable by GSM-7, the standard SMS encoding, and so your phone switches to UCS-2, a different encoding which can represent more characters, but also uses 2 bytes per character. Under UCS-2, your phone can’t fit as much text into the same amount of data, because each character takes up more space – and so you can’t send as much text per SMS.

Doing the numbers:

To the computer, they’re exactly the same size!

Remember how I said earlier that lots of stuff uses UTF-8 encoding? Well, UTF-8 is even quirkier than the encoding schemes used for SMSes, and different characters can take up different numbers of bytes. For example:

This is called a variable-length encoding. Emoji are pretty late to the unicode specification (and presumably used less frequently than, y’know, letters), so they’re all towards the back of the list and require more bytes to represent. For eaxmple, our man pilot: medium skin tone πŸ‘¨πŸ½β€βœˆοΈ from before is comprised of the unicode codepoints 1F468 1F3FD 200D 2708 FE0F, which is (4 + 4 + 3 + 3 + 3) bytes = 17 bytes! There are even unicode sequences where less bytes are required to write out the official description than the actual character – ‘πŸ‘©β€βš–οΈ’ is 13 bytes, but the textual description ‘woman judge’ clocks in at only 9. What a world we live in!

Lots of software is still really bad at handling unicode.

Given how complex all of this emoji stuff is, it’s super hard to get these details right. In re-building Emoji Haikus, I’ve discovered and read about bugs in lots of places:

Ok wait, so… why a rewrite?

I decided at some point that it’s probably time to incorporate all the new shiny emoji stuff into Emoji Haikus, and after interviewing a candidate who was using Serverless to deploy code to AWS Lambda recently, thought that it could be a good thing to learn about. The initial version was running on AWS Lambda too, but it used a patchwork of hand-rolled stuff and wasn’t really maintainable.

Did this warrant a from-the-ground-up rewrite? … Probably not, but that’s what I did 🀷

So what’s new?

Emoji Haiku 2.0 now allocates:

This means that, when the text says “ASTRONAUT”, you could get any of πŸ‘©πŸ»β€πŸš€πŸ‘©πŸΌβ€πŸš€πŸ‘©πŸ½β€πŸš€πŸ‘©πŸΎβ€πŸš€πŸ‘©πŸΏβ€πŸš€πŸ‘¨πŸ»β€πŸš€πŸ‘¨πŸΌβ€πŸš€πŸ‘¨πŸ½β€πŸš€πŸ‘¨πŸΎβ€πŸš€πŸ‘¨πŸΏβ€πŸš€. Super cool!

By default it does this randomly, but you can also choose specific options on the page and have them apply to everyone. It’s tucked away under ‘Display Options’.

So, go and generate yourself an Emoji Haiku here (they make for super cheap Christmas presents! πŸŽ„πŸŽπŸŽ) or follow @emojihaikus for one every 6 hours. If you’re a code person, you can examine some of my questionable choices on Github.

  1. I think skin color was actually around when I wrote the first version of Emoji Haiku, but it wasn’t as widely supported at the time. ↩︎

  2. I personally find this extremely relateable - we had a similar conversation about the protocol on this at my place of work, and it’s literally only white people still using the yellow emoji - myself included. ↩︎

  3. It was an earlier version of this 42.5 MB weapon of a webpage↩︎

  4. The U+FEOF format is called a ‘unicode codepoint’ - basically, in unicode, every letter, every symbol, and a bunch of weird non-printable characters is represented by a specific number, defined by the specification. ‘Codepoint’ here is a fancy word for ‘specification-allocated number’. I tend to just use the word ‘character’ in this article – ‘codepoint’ is the more-precise, but less commonly used, name. ↩︎

  5. Other great examples include πŸ‘±β€β™‚οΈ man + 🍳 cooking = man cook πŸ‘¨β€πŸ³ (surprise!), and πŸ‘© woman + 🎀 microphone = πŸ‘©β€πŸŽ€ female david bowie↩︎

  6. In particular, the code I wrote was naΓ―ve about this, and so I had to spend a bunch of time later untangling that. In the end, I hacked it in, rather than redesigning what I already had, because the hack doesn’t make things too much worse. My old manager at Google (Hi Michael!) had a ’three hacks’ rule - don’t bother re-engineering or redesigning anything until you’ve already made three quick-and-dirty fixes to the existing code. Waiting until you get to your ’third hack’ means that you have a better idea of what the code actually needs to do - if you build it after one hack, you’re going to swing a lot in one direction, and then having to swing back a lot next time you change it as well. On the other hand, after you’ve already hacked a few things in, you’re probably not learning much new every time after, and the cost of working in increasingly hacky code is getting higher and higher. ↩︎

  7. This is almost always unintentional - it’s only a relatively recent development that programming languages have been designed with attention to this. Now, it’s harder to make these mistakes by accident than it used to be – but still very doable when you’re working across multiple operating systems and multiple programming languages. ↩︎

  8. Fun fact - this domino also exists in three other rotations (πŸ‹ πŸ— 🁽 πŸ‚‰) and then names for all of them use number formats like ‘03’ and ‘05’, which implies that someone maybe expected one day to add dominos with more than 9 dots on them? To me, the fact that there’s this level of granularity for freakin’ dominos, but not e.g. the aforementioned proper support for Bengali, indicates that we need a broader range of voices working on technical ownership and guidance committees. ↩︎