Porting to WordPress Part 3: Code

In the last post on this topic, I discussed some of the differences between Manila and WordPress, and how understanding those differences teased out some of the requirements for this project.

In this post I’m going to talk about the design and implementation of a ManilaToWXR Tool, some more requirements that were revealed through the process of building it, and a few of the tricky edge cases I had to deal with.

A little history first…

Among the more interesting things I did while I was a developer at UserLand, was to build a framework we called the Tools Framework, which brought together many different points of extensibility, and made it easy for developers to customize the environment.

In Frontier, Radio UserLand, and the OPML Editor, a Tool is a collection of code and data in a database, which extends or overrides some platform- or application-level functionality. It’s sort of analogous to a Plugin in the WordPress universe, but Tools can also do things like run code periodically (or continuously) in the background, or implement entirely new web applications, or even customize Frontier’s native UI.

For example, you could implement a Tool that hooks into the windowTypes framework and File menu callbacks to implement a new document type corresponding to a WordPress post. Commands in the File menu call the WordPress API, and present a native interface for editing your blog—probably in an outline. Radio UserLand did exactly this for Manila sites, and it was fantastic. (More on that later.)

Another example of a Tool is one that implements some new XML-RPC endpoints (RPC handlers in Frontier) to provide a programmatic API for accessing some content in a database on your server.

For my purposes, I’m not doing anything nearly so complicated. The main thing I wanted comes from the Tools > New Tool… menu command. This creates a new database and pre-populates it with a bunch of placeholders for things like its menu, a table for data and preferences, and of course a table where my code will live.

It gives me an easy, standard way to create a database with the right structure, and the hooks into the menu bar that I wanted to make my exporter easy to use.

Code Components

Now some of this may sound pedantic to the developer-types who are reading this, but please bear with me on behalf of our non-nerd cohorts.

Any time you need to write a lot of code, it makes sense to break the work down into small, bite-sized problems. By solving each of those problems one at a time, sometimes in layers, you eventually work your way towards a complete solution.

Each little piece should be simple enough that you can compartmentalize it and separate it from the other pieces. This is called factoring, and it’s good for lots of reasons including readability, maintainability, debug-ability, reuse. And if you miss something, make a mistake in your design, or discover that some part of your system doesn’t perform well, it’s far easier to rewrite just one or a couple of parts than it is to de-spaghettify a big, monolithic mess.

Components and sub-components should have simple and consistent interfaces so that other code that talks to them can in turn be made simple and consistent. Components should also have minimal or no side-effects, meaning that they don’t change data that some other code depends on. And components should usually perform one or a very small number of tasks in a predictable way, to keep them small, and make them easy to test and debug. If you find yourself writing hundreds of lines of code in one place, you probably need to break the problem down into smaller components.

So with these concepts in mind, I set about coming up with a component-level design for my Tool. I initially came up with four types of components that I would need, and each type of component may have a specific version depending on the type of object it knows about.

Iterators

First, I’m going to need an easy way to iterate across posts, stories, pictures, and other objects. As my code iterates objects in my site, the tool will create a fragment of XML that will go into a WXR file on disk.

By separating the iteration from everything else, I can easily change the order in which objects are exported, apply filters for specific object types, or only export objects in a given date or ID range. (It turned out that ranges and filters were useful for debugging later on.)

Manila stores most content in its #discussionGroup in a sub-table named messages. User information is in #membershipGroup, and there’s some other data scattered around too. But the most important content—posts, pages, pictures, and comments—is all in the #discussionGroup.

Initially I’d planned to make multiple passes over the data, with one pass for each type of data I wanted to export. So first export all the posts, next the pages, next pictures, etc. As it turned out however, in both Manila and WordPress, a post, a page, and a picture have more in common than not in terms of how they’re stored and the data that comes along with them. Therefore it actually made more sense to do just one pass, and export all the data at one time.

There was one exception, however: In WordPress unlike Manila, comments are stored in a separate table from other first-class site content, and they appear in a WXR file as children of an <item> rather than as their own <item> under the <channel> element:

<item>
  <content:encoded><![CDATA[<em> ... Post contents here ...</em> ]]></content:encoded>
...
  <wp:comment>
    <wp:comment_author><![CDATA[<em>commenter</em>]]></wp:comment_author>
    <wp:comment_author_email><em>someone@example.com</em></wp:comment_author_email>
    <wp:comment_author_IP><em>IP_address</em></wp:comment_author_IP>
    <wp:comment_author_url><em>http://blog.example.com/</em></wp:comment_author_url>
    <wp:comment_content><![CDATA[<em>Hi, I found your blog via a google search. I was interested in your comments about setting this up. Can you help? Thanks!</em>]]></wp:comment_content>
    <wp:comment_date>2004-08-01 14:17:03</wp:comment_date>
    <wp:comment_date_gmt>2004-08-01 21:17:03</wp:comment_date_gmt>
    <wp:comment_id>15</wp:comment_id>
    <wp:comment_parent>0</wp:comment_parent>
    <wp:comment_type></wp:comment_type>
    <wp:comment_user_id>3</wp:comment_user_id>
    <wp:comment_approved>1</wp:comment_approved>
  </wp:comment>
...
</item>

<item>

<content:encoded><![CDATA[ ... Post contents here ... ]]></content:encoded>

...

<wp:comment>

<wp:comment_author><![CDATA[commenter]]></wp:comment_author>

<wp:comment_author_email>someone@example.com</wp:comment_author_email>

<wp:comment_author_IP>IP_address</wp:comment_author_IP>

<wp:comment_author_url>http://blog.example.com/</wp:comment_author_url>

<wp:comment_content><![CDATA[Hi, I found your blog via a google search. I was interested in your comments about setting this up. Can you help? Thanks!]]></wp:comment_content>

<wp:comment_date>2004-08-01 14:17:03</wp:comment_date>

<wp:comment_date_gmt>2004-08-01 21:17:03</wp:comment_date_gmt>

<wp:comment_id>15</wp:comment_id>

<wp:comment_parent>0</wp:comment_parent>

<wp:comment_type></wp:comment_type>

<wp:comment_user_id>3</wp:comment_user_id>

<wp:comment_approved>1</wp:comment_approved>

</wp:comment>

...

</item>

In the end I decided to write two iterators. Each of them would take the address of the site (so they can find other required metadata about a person for instance), and the address of a function to call for each object as it goes along:

wxr.visit.messages – iterates over all of the messages in my site’s #discussionGroup, skipping over deleted items and comments, since they won’t be exported as an <item> in my WXR file.

// UserTalk Source for wxr.visit.messages
on messages (adrsite, visitproc) {
  local (adrmsgs = wxr.site.messages (adrsite), adr);
  for adr in adrmsgs {
    local (id = wxr.post.id (adr));
    if not visitproc^ (id) { // Stop here?
      return (false)}};
  return (true)}

// UserTalk Source for wxr.visit.messages

on messages (adrsite, visitproc) {

local (adrmsgs = wxr.site.messages (adrsite), adr);

for adr in adrmsgs {

local (id = wxr.post.id (adr));

if not visitproc^ (id) { // Stop here?

return (false)}};

return (true)}

wxr.visit.comments – recurses over responses to a message to generate threaded comment information.

// UserTalk Source for wxr.visit.comments
on comments (adrsite, adr, visitproc) {
  local (commentId);
  for commentId in adr^.responses {
    local (adrComment = wxr.comment.address (adrsite, commentId));
    if adrComment != adrPost {
      if not visitproc^ (adrComment) {
        return (false)}}}; //unwind recursion
  return (true)}

// UserTalk Source for wxr.visit.comments

on comments (adrsite, adr, visitproc) {

local (commentId);

for commentId in adr^.responses {

local (adrComment = wxr.comment.address (adrsite, commentId));

if adrComment != adrPost {

if not visitproc^ (adrComment) {

return (false)}}}; //unwind recursion

return (true)}

It turned out later on that I needed two more iterators—one for categories, and one for “Gems” (non-picture files), but the two above were a great starting point that would give my code easy access to the bulk of the content.

Data Extractors

Next I needed some data extractors. These are type-specific components will pull some data for a post, picture, comment, etc out of the database, and normalize it to a native data structure that can then easily be output to XML for my WXR file.

The most important data extractor is wxr.post.data, which takes the address of a message containing a blog post that’s in my site’s #discussionGroup—and returns a table (struct) that has all of the data elements that will go into an <item> in the exported WXR file.

Because the WordPress importer expects the comments as <wp:comment> sub-elements of <item> the post data extractor will also call into another data extractor that generates normalized data representing a comment.

For other types of objects I’ll need code that extracts data for that type as well. So I’ll need code to extract data for a picture, code to extract data for a page (story), and code to extract data for a gem (file).

Here’s part of the code that grabs the data for a comment:

// UserTalk Source for wxr.comment.data
on data (adrsite, id) //return a table of data for a comment
  local (t); new (tableType, @t); //<wp:comment>
  local (adr = wxr.comment.address (adrsite, id));
  
  on add (n, s) {
    t.["wp:" + n] = s}; //all comment data is in the wp: namespace

  add ("comment_id", id);
  add ("comment_author", wxr.string.cdata (wxr.member.name (adrsite, adr^.member)));
  add ("comment_author_email", adr^.member);
  add ("comment_content", wxr.string.cdata (wxr.string.processMacros (adrsite, adr^.body)));
...
  bundle { //<wp:comment_approved>
    local (flApproved = 1);
    if defined (adr^.flDeleted) and adr^.flDeleted {
      flApproved = 0};
    add ("comment_approved", flApproved)};
  add ("comment_parent", wxr.comment.parent (adrsite, id))
...
  
  return (t) //</wp:comment>

// UserTalk Source for wxr.comment.data

on data (adrsite, id) //return a table of data for a comment

local (t); new (tableType, @t); //<wp:comment>

local (adr = wxr.comment.address (adrsite, id));

on add (n, s) {

t.["wp:" + n] = s}; //all comment data is in the wp: namespace

add ("comment_id", id);

add ("comment_author", wxr.string.cdata (wxr.member.name (adrsite, adr^.member)));

add ("comment_author_email", adr^.member);

add ("comment_content", wxr.string.cdata (wxr.string.processMacros (adrsite, adr^.body)));

...

bundle { //<wp:comment_approved>

local (flApproved = 1);

if defined (adr^.flDeleted) and adr^.flDeleted {

flApproved = 0};

add ("comment_approved", flApproved)};

add ("comment_parent", wxr.comment.parent (adrsite, id))

...

return (t) //</wp:comment>

There are a few interesting things to point out here:

I chose to capture comment content even if it’s not approved. Better to keep the content than lose it, just in case I decide to approve it later.
The call to wxr.comment.parent gets the ID of the comment’s parent. This preserves the threaded nature of the conversation, even if I decide not to have threaded comments in my WordPress site later on. It turns out that supporting both threaded and unthreaded comments was the source of some pain that I’ll explain in a future post.
The call to wxr.string.processMacros is especially important. This call emulates what Manila, mainResponder, and the Frontier website framework do when a page is rendered to HTML. Without this capability, Frontier macro source code would leak through into my WordPress site, and possibly many internal links from #glossary items would not be broken. Getting this working was another source of pain that took a while to work through—again, more in a future post.
All sub-items in the table that gets returned have names that start with “wp:”, which I’ll explain below…

Encoders

Once I had some structured data, I was going to need to use it to encode some XML. It turns out that this component could be done in a very generic way that would work with any of my data extractors.

Frontier actually does have somewhat comprehensive XML capabilities. But the way it’s implemented requires very verbose code that I really didn’t want to write. I had done quite enough of that in a past life. 😉

So I decided to write a much simpler one-way XML-izer that I could easily integrate with my data extractors.

The solution I came up with was to recurse over the data structure that an extractor passed to it, and generate an XML tree whose element names match sub-items’ names, and whose element content were the contents of each sub-item.

There were three features I needed to add in order to make this work well:

Namespaces: Many elements in a WXR file are in a non-default namespace—either wp: for the WordPress-specific data, or dc: for the Dublin Core extension. This feature was easy to deal with by just naming sub-items with the namespace prefix, i.e. an element named parent in the wp: namespace would simply be called wp:parent when returned by the data extractor.

Multiple elements: Often I needed to create multiple elements at a given level in the XML file that all have the same name. <wp:comment> is a good example. The solution I came up with here is similar to the one Frontier implements in its native XML verbs.

A compiled XML table in Frontier has sub-items representing elements, which have a number, a tab character, and the element’s name. The Frontier GUI hides the number and the tab character when you view the table, so you can see multiple same-named elements in the table editor. When you click an item’s name, the number and tab character are revealed, and you can edit them if you want. That said, you’re supposed to use the XML verbs, xml.addTable or xml.addValue to add elements.

Most of this is not particularly well documented, and personally I don’t think it was the most elegant solution, but it was effective at working around Frontier’s limitation that items in tables had to have unique names, whereas in XML they don’t.

I wanted something simpler, so I decided instead to simply strip anything after a comma character from the sub-item’s name. This way whenever my data extractor is adding an item, it can just use table.uniqueName with a prefix ending in a comma character, and then add the item at that address. Two lines of code, or one if we get just a little bit fancy:

table.uniqueName (element + ",", @t)^ = value;

1	table.uniqueName (element + ",", @t)^ = value;

XML attributes: The last problem to solve was generating attributes on XML elements, for example <guid isPermalink="false">...</guid>. It turns out that if there were an xml.addAttributeValue in Frontier, it could have handled this pretty easily, but that was never implemented. Instead I’d have to add an /atts sub-table, and add the attribute manually—which takes multiple lines of code just to set a single attribute. Of course I could implement xml.addAttributeValue, but I don’t have a way to distribute it, so nobody else could use it! 🙁

In addition, I really didn’t want big, deeply-nested data structures flying around my call-stack, since I’m going to be creating thousands of tables at run-time, and I was concerned about memory and performance.

In the end I decided to do a hack: By using the | character to delimit attribute/value pairs in the name of table sub-elements, I could include the attributes and their values into the element name itself. So the <guid isPermalink="false"> element would come from a sub-item named guid|isPermalink=false.

Normally I would avoid doing something like this since hacks have a tendency to be fragile, but in this case I know in advance what all of the output needs to look like, so I don’t need a robust widely-applicable solution, and the time I save with the hacky version is worth it.

Utility Functions

Then there’s a bunch of miscellany:

A way to easily wrap the body of a post with <![CDATA[…]]> tokens, and properly handle the edge case where the body actually contains those tokens.
A non-buggy way to encode entities in text destined for XML. (xml.entityEncode has had some bugs forever, which weren’t fixed because of Rule 1.)
Code to deal with encoding various date formats, and converting to GMT.
Code to convert non-printable characters into the appropriate HTML entities (which in turn get encoded in XML).
Other utility functions dealing with URLs, calculating permalinks, getting people’s names from their usernames, etc.

The Elephants in the Room

At this point there were a few more things I knew I would need to address. I’ll talk about these along with handling media objects in my next post. In the meantime, here’s a teaser:

Lots of stuff in Manila just doesn’t work at all unless you actually install the site, with Manila’s source code available.
The macro and glossary processors aren’t easy to get working unless the code is running in the context of a real web request.
What should I do about all the incoming links to my site? Are they all going to simply break?

I’ll talk about how I dealt with these and other issues in the next post.

More soon…

Porting to WordPress Part 3: Code

A little history first…

Code Components

Iterators

Data Extractors

Encoders

Utility Functions

The Elephants in the Room

Like this:

Be First to Comment

Post a comment Cancel reply

Porting to WordPress Part 3: Code

A little history first…

Code Components

Iterators

Data Extractors

Encoders

Utility Functions

The Elephants in the Room

Share:

Like this:

Be First to Comment

Post a comment Cancel reply