• 2016 Aug 18 • mahiwaga

trying to convert atom to something MT can read

I contemplated the idea of simpling writing an XSL stylesheet to convert Atom to WXR because this is one of the formats that MT can import. But unfortunately there is no codified spec for WXR, so I have no idea which elements I can safely ignore. And I don’t want to comb through the WXR-to-MT plugin to figure out what MT is actually reading (although I may end up doing this anyway.)

WXR looks kind of messy, too. I’m not sure how I can get LibXSLT to write out the CDATA sections. I wonder what will happen if I just write out escaped HTML? Will the importer choke? (I know, I know, read the source.) And I really don’t like how categories are handled. I’m not sure something like a category ID number should be something codified into an XML document that’s supposed to be portable. I suppose it’s good enough for transferring between Wordpress setups, and even for MT, but it strikes me as being too implementation dependent. What if you want to do some neat transforms, like converting categories to tags? There’s no easy way to do it without knowing something about the target, it seems. You have to know how the recipient of the XML document maps their categories or tags to ID numbers.

I wonder if the WXR-to-MT plugin actually reads the WXR as XML, or if it just uses regexes? I guess I’m just going to have to look at the source code.

Another solution would be just to convert Atom to MT’s native blog export format. The advantage of this is that MT’s format is actually spec’ed. Like MT itself, it’s a venerable format that several blog engines support. Apparently it came about before XML took over the world. It’s just a plaintext file, with element names in all caps, followed by a colon, followed by the content, with individual entries rather fragiley demarcated by a row of dashes. Even still, it actually looks a lot less messy than WXR (and I’m not just talking about the angle-brackets) and I’m surprised there isn’t an XMLized version of this spec. Although I suppose that’s what Atom is ultimately for. (So when is everyone gonna get with the program?)

This speaks to the messiness of dealing with RDBMS-based blog engines. (Not that file-system based blog engines don’t have their own problems.) The most reliable way to migrate your data from one blog engine to another is to reverse engineer the database schema. For someone who doesn’t know a lot of SQL and doesn’t really want to learn it, this seems extraordinarily painful.

What I wish existed was an XML DB-based blog engine that can be queried by XPath. (Syncato is exactly what I’m looking for, but it looks like it hasn’t been updated in four years, and while it runs, it’s not exactly feature complete.) Unfortunately, there aren’t really any free XML databases out there, much less any webhosts that offer them. One can argue about the relative merits of RDBMS vs. XML DB all day long, but since I understand XPath and XSLT, and don’t know a lick of SQL, you can see why I feel the way I feel. Everyone has their favorite ~~weapons~~ tool.

Still, considering that the Web pretty much depends on XML these days (specifically XHTML, RSS, Atom, XML-RPC, SOAP, etc., etc.) you would think that it would be a natural fit.

But I guess everyone needs their Holy Grail to quest for.

The last solution for migrating the 30-odd posts I wrote while using Typo and Mephisto is to parse the Atom feed and send it MT’s XML-RPC server entry-by-entry. Sounds easy enough, really. Normally, this would probably not be the most attractive way to do it. After glancing at the API, it still looks like you would need to know how the target maps its categories and/or tags (actually, I don’t think categories and tags are even part of the core MetaWebLog API.) And I’m not even sure how migrating comments would work (which is luckily not an issue in my particular case.)

Now that I think about it, it looks like the best thing to do is just convert from Atom to MT’s import format. It should be trivial to parse the Atom feed and even more trivial to write out the plaintext file.

on the road again

The weather has cooled down wonderfully, but it’s still like an oven inside my apartment. I give up. I’m going to go to my parents’ house in L.A. and bask in air-conditioned glory. Sure, I have to go to work on Sunday, making this a short trip, but whatever.

Time to put more miles on my car. Whee.

perl script for converting atom to MT import format

Ideally, this should probably be a plugin that uses the MT API, but this little bit of kludgery seems to do the trick. Be forewarned, I used a lot of perl modules that may be non-standard.

#!/usr/bin/perl

use strict;
use warnings;

use File::Basename;
use XML::XPath;
use XML::XPath::XMLParser;
use DateTime;
use DateTime::Format::ISO8601;

my $atomfeed_location = "atom.xml"; # CHANGE THIS TO THE PATH OF THE SOURCE FILE
my $author = "YOUR_NAME";

my $atom_xml = XML::XPath->new(filename=>$atomfeed_location);
my $atomnodeset = $atom_xml->find('/feed/entry');

foreach my $context ($atomnodeset->get_nodelist) {
  print "AUTHOR: $author\n"
  print "TITLE: ", $context->find('./title')->string_value, "\n";
  my $url = $context->find('./link/@href')->string_value;
  print "BASENAME: ", basename($url), "\n";
  print "STATUS: Publish\n";
  print "ALLOW COMMENTS: 1\n";
  print "CONVERT BREAKS: 0\n";
  print "ALLOW PINGS: 1\n";
  my $pub_atom = $context->find('./published')->string_value;
  my $pub_iso8601 = DateTime::Format::ISO8601->parse_datetime($pub_atom);
  my $pub_mtif = $pub_iso8601->mdy('/')  . ' ' . $pub_iso8601->hms;
  print "DATE: ", $pub_mtif, "\n";
  my $cat_list = $context->find('./category');
  my $taglist = '';
  foreach my $cat ($cat_list->get_nodelist) {
      my $tag = $cat->find('./@term')->string_value();
      $taglist = ($taglist eq '' ? $tag : $taglist . ',' . $tag);
  }
  print "TAGS: ", $taglist, "\n";
  print "—--\n";
  print "BODY:\n", $context->find('./content')->string_value, "\n", "—--\n";
  print "——--\n";
}

Redirect the output of this script to a file, and import the file into MT, and you should be all set. As you can see, there is no recourse for handling comments or trackbacks. Note also that categories are imported as tags!