[ACCEPTED]-Best way to process large XML in PHP-large-files

Accepted answer
Score: 23

For a large file, you'll want to use a SAX parser rather 11 than a DOM parser.

With a DOM parser it will 10 read in the whole file and load it into 9 an object tree in memory. With a SAX parser, it 8 will read the file sequentially and call 7 your user-defined callback functions to 6 handle the data (start tags, end tags, CDATA, etc.)

With 5 a SAX parser you'll need to maintain state 4 yourself (e.g. what tag you are currently 3 in) which makes it a bit more complicated, but 2 for a large file it will be much more efficient 1 memory wise.

Score: 11

My take on it:

https://github.com/prewk/XmlStreamer

A simple class that will extract 3 all children to the XML root element while 2 streaming the file. Tested on 108 MB XML 1 file from pubmed.com.

class SimpleXmlStreamer extends XmlStreamer {
    public function processNode($xmlString, $elementName, $nodeIndex) {
        $xml = simplexml_load_string($xmlString);

        // Do something with your SimpleXML object

        return true;
    }
}

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
Score: 8

When using a DOMDocument with large XML files, don't 4 forget to pass the LIBXML_PARSEHUGE flag in the options 3 of the load() method. (Same applies for the other 2 load methods of the DOMDocument object)

    $checkDom = new \DOMDocument('1.0', 'UTF-8');
    $checkDom->load($filePath, LIBXML_PARSEHUGE);

(Works with a 120mo 1 XML file)

Score: 6

A SAX Parser, as Eric Petroelje recommends, would 6 be better for large XML files. A DOM parser 5 loads in the entire XML file and allows 4 you to run xpath queries-- a SAX (Simple 3 API for XML) parser will simply read one 2 line at a time and give you hook points 1 for processing.

Score: 3

It really depends on what you want to do 7 with the data? Do you need it all in memory 6 to effectively work with it?

6.5 MB is not 5 that big, in terms of today's computers. You 4 could, for example, ini_set('memory_limit', '128M');

However, if your data 3 can be streamed, you may want to look at 2 using a SAX parser. It really depends on your usage 1 needs.

Score: 2

SAX parser is the way to go. I've found 9 that SAX parsing can get messy if you don't 8 stay organised.

I use an approach based on 7 STX (Streaming Transformations for XML) to 6 parse large XML files. I use the SAX methods 5 to build a SimpleXML object to keep track 4 of the data in the current context (ie just 3 the nodes between the root and the current 2 node). Other functions are then used for 1 processing the SimpleXML document.

Score: 1

I needed to parse a large XML file that 6 happened to have an element on each line 5 (the StackOverflow data dump). In this specific 4 case it was sufficient to read the file 3 one line at a time and parse each line using 2 SimpleXML. For me this had the advantage 1 of not having to learn anything new.

More Related questions