Unwrapping Nodes with Nokogiri
I learned a lot about the Nokogiri gem (used to parse and manipulate XML and HTML) when I wrote a script to download all my FriendFeed posts.
One thing I couldn’t figure out how to do was how to unwrap a node:
quick-brown-fox.html
before transformation
<html>
<head>
<title>example</title>
</head>
<body>
<p>The <font face="Verdana" size="16">quick brown <a href="https://en.wikipedia.org/wiki/Fox" title="Fox • Wikipedia">fox</a></font> jumped over the <font face="Verdana" size="16">lazy <a href="https://en.wikipedia.org/wiki/Dog" title="Dog • Wikipedia">dog</a></font>.
</p>
</body>
</html>
quick-brown-fox.html
after transformation
<html>
<head>
<title>example</title>
</head>
<body>
<p>The quick brown <a href="https://en.wikipedia.org/wiki/Fox" title="Fox • Wikipedia">fox</a> jumped over the lazy <a href="https://en.wikipedia.org/wiki/Dog" title="Dog • Wikipedia">dog</a>.
</p>
</body>
</html>
Now, one could simply use XSLT:
unwrap-font.xsl
• XSL stylesheet to unwrap <font> tag
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="font">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="/ | @* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Ruby code to perform XSL transform
require 'nokogiri'
input = Nokogiri::HTML(open('quick-brown-fox.html'))
template = Nokogiri::XSLT(open('unwrap-font.xsl'))
output = template.transform(input)
…but this seems like overkill.
Instead, you could do this:
Ruby code to remove <font> tags
document = Nokogiri::HTML(open('quick-brown-fox.html'))
nodeset = document.xpath('//font')
nodeset.each { |node|
node.replace(node.children)
}
(I originally found the following code for unwrapping nodes, but it doesn’t really do what I want it to since it appends the contents of the <font> tags to the end of the <p> instead of right after the <font> tag.)