Typographical Oddities
While wading through all these old blog posts, I keep running into mangled Unicode characters. I’ve had this problem ever since I started blogging in 2000, but I never knew the source of the error.
Apparently it’s because whatever software I was using to handle my blog posts was misreading UTF-8 as Windows 1252 or ISO-8859-1.
This led me to creating a Ruby script to replace the most common mangled Unicode characters I’ve come across so far, adding more as I get through my old blog posts.
#!/usr/bin/env ruby
input = File.read(ARGV[0])
map = { '•' => '•',
'—' => '—',
'’' => '\'',
'…' => '…',
'≥' => '≥',
'ö' => 'ö',
'“' => '“',
'â€' => '”',
'é' => 'é',
'®' => '®' }
re = Regexp.union(map.keys)
output = input.gsub(re, map)
File.write(ARGV[0], output)
(This technique was lifted from a Stack Overflow answer on how to replace multiple substrings in a single call to the gsub
method.)
It so happens that a lot of the mangled Unicode characters I run into used to be smart quotes, and I stumbled upon this phenomenon:
- Smart Quotes are Killing the Apostrophe • 2013 May 6 • New Republic