2

I'm trying to split up a 13GB xml file into small ~50MB xml files with this XSLT style sheet.

But this process kills xsltproc after I see it taking up over 1.7GB of memory (that's the total on the system).

Is there any way to deal with huge XML files with xsltproc? Can I change my style sheet? Or should I use a different processor? Or am I just S.O.L.?

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
                xmlns:exsl="http://exslt.org/common"
                extension-element-prefixes="exsl"
                xmlns:fn="http://www.w3.org/2005/xpath-functions">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="block-size" select="75000"/>

<xsl:template match="/">
        <xsl:copy>
                <xsl:apply-templates select="mysqldump/database/table_data/row[position() mod $block-size = 1]" />
        </xsl:copy>
</xsl:template>

<xsl:template match="row">
        <exsl:document href="chunk-{position()}.xml">
        <add>
                <xsl:for-each select=". | following-sibling::row[position() &lt; $block-size]" >
                <doc>
                        <xsl:for-each select="field">
                        <field>
                                <xsl:attribute name="name"><xsl:value-of select="./@name"/></xsl:attribute>
                                <xsl:value-of select="."/>
                        </field>
                        <xsl:text>&#xa;</xsl:text>
                        </xsl:for-each>
                </doc>
                </xsl:for-each>
        </add>
        </exsl:document>
</xsl:template>
David Parks
  • 2,516

2 Answers2

1

Well, it seems there is a streaming XML option that's a bit different from XSLT (xslt requires the whole document to fit into memory, the streaming xml transformation languages don't).

Instead of re-writing the XSLT, which I had just spent a day preparing, I just spun up an AWS spot instance with 64 GB of ram, gave it some swap, and my 13GB XML consumed just about 65GB of ram under xsltproc.

On this system it ran, but I can now tell you that you won't get much more than a 13GB file out of even Amazon's largest high-memory instance. You'll need to go with another solution such as streaming XML transformations.

As a benchmark for anyone thinking of pushing the envelope. This transformation failed on an instance with 32 GB of RAM and 120G swap partition. It seems like you can swap a few gigs, but not too much before you hit some problem that will crash it.

David Parks
  • 2,516
1

This simple utility, requires you to have Python and python-lxml module (libxml2 installed in system) will let you stream parse elements, transform each element through XSLT and write it into the result file right away, no buffering

#!/usr/bin/env python3

from lxml import etree
import re

_xslt = etree.parse('FILL_XSLT_DOC')
_dom = etree.iterparse('FILL_SOURCE_XML')
transform = etree.XSLT(_xslt)
results = open('FILL_RESULT_XML','w+b')

for elem in _dom:
    if (elem[1].tag.endswith('FILL_SEARCHED_ELEMENT_NAME')):
        newElem = transform(elem[1])
        #print(etree.tostring(newElem,xml_declaration = False,encoding='utf8'))
        results.write(etree.tostring(newElem,xml_declaration = False,encoding='utf8'))
        results.write(b'\n')

Ok, please be aware, if your XSLT contains <xsl:strip-space elements="*"/> you can suffer from this 2010 bug, https://bugs.launchpad.net/lxml/+bug/583249

Marek Sebera
  • 126
  • 4