mahiwaga

I'm not really all that mysterious

Typographical Oddities

While wading through all these old blog posts, I keep running into mangled Unicode characters. I’ve had this problem ever since I started blogging in 2000, but I never knew the source of the error.

Apparently it’s because whatever software I was using to handle my blog posts was misreading UTF-8 as Windows 1252 or ISO-8859-1.

This led me to creating a Ruby script to replace the most common mangled Unicode characters I’ve come across so far, adding more as I get through my old blog posts.

#!/usr/bin/env ruby
input = File.read(ARGV[0])

map = { '•' => '•',
        '—' => '—',
        '’' => '\'',
        '…' => '…',
        '≥' => '≥',
        'ö' => 'ö',
        '“' => '“',
        'â€' => '”',
        'é' => 'é',
        '®' => '®' }
re = Regexp.union(map.keys)
output = input.gsub(re, map)

File.write(ARGV[0], output)

(This technique was lifted from a Stack Overflow answer on how to replace multiple substrings in a single call to the gsub method.)

It so happens that a lot of the mangled Unicode characters I run into used to be smart quotes, and I stumbled upon this phenomenon:

initially published online on:
page regenerated on: