mahiwaga

I'm not really all that mysterious

Unwrapping Nodes with Nokogiri

I learned a lot about the Nokogiri gem (used to parse and manipulate XML and HTML) when I wrote a script to download all my FriendFeed posts.

One thing I couldn’t figure out how to do was how to unwrap a node:

quick-brown-fox.html before transformation

<html>
<head>
<title>example</title>
</head>
<body>
<p>The <font face="Verdana" size="16">quick brown <a href="https://en.wikipedia.org/wiki/Fox" title="Fox • Wikipedia">fox</a></font> jumped over the <font face="Verdana" size="16">lazy <a href="https://en.wikipedia.org/wiki/Dog" title="Dog • Wikipedia">dog</a></font>.
</p>
</body>
</html>

quick-brown-fox.html after transformation

<html>
<head>
<title>example</title>
</head>
<body>
<p>The quick brown <a href="https://en.wikipedia.org/wiki/Fox" title="Fox • Wikipedia">fox</a> jumped over the lazy <a href="https://en.wikipedia.org/wiki/Dog" title="Dog • Wikipedia">dog</a>.
</p>
</body>
</html>

Now, one could simply use XSLT:

unwrap-font.xsl • XSL stylesheet to unwrap <font> tag

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="font">
    <xsl:apply-templates />
  </xsl:template>
  
  <xsl:template match="/ | @* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()" />
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

Ruby code to perform XSL transform

require 'nokogiri'

input = Nokogiri::HTML(open('quick-brown-fox.html'))
template = Nokogiri::XSLT(open('unwrap-font.xsl'))

output = template.transform(input)

…but this seems like overkill.

Instead, you could do this:

Ruby code to remove <font> tags

document = Nokogiri::HTML(open('quick-brown-fox.html'))
nodeset = document.xpath('//font')
nodeset.each { |node|
  node.replace(node.children)
}

(I originally found the following code for unwrapping nodes, but it doesn’t really do what I want it to since it appends the contents of the <font> tags to the end of the <p> instead of right after the <font> tag.)

initially published online on:
page regenerated on: