Parsing simple data structures from XML files with Perl
This page shows some ways to get xml data into perl using XML::Simple and XML::Twig for use in arrays and hashes
Perl offers many modules for parsing XML data. A glimpse at cpan.perl.org in the category "string processing" lists modules for just about everything that you can imagine. With such a vast array of choice it can also be overwhelming with deciding which one is the right choice for you. I use XML for accessing data that comes from other applications. Although you do have do be aware of coding it allows you to markup your data with tags and feed the data into data objects. The whole method of accessing the data afterwards is far easier and more elegant to code.
Simple import whitespace separated
To import external data as whitespace separated, tabbed or CSV files - increment through a list or file, split the line and make into an array.
B01 192.168.0.101 500MHZ 256MB B02 192.168.0.102 800MHZ 128MB
foreach (@lines){ @array = split /\w/, $_; print "computer name", $array[0]; print "computer ip", $array[1]; print "computer cpu", $array[2]; print "computer ram", $array[3]; } Simple Markup Method - but not offical xml
A more robust method is to use a simple method of markup and parse it with a regular expression. The regex method is VERY fast and will beat all XML parsers when it comes to SPEED.
Horrible regexps are a chore and limited if you have more than a handful of tags. Its even harder when the data is spread over more than one line. More problems arise when the tag names change or the order changes and the whole regexp has to be rewritten.
<CNAME>B01</CNAME><IP>192.168.0.101</IP><CPU>500MHZ</CPU><RAM>256MB</RAM> <CNAME>B02</CNAME><IP>192.168.0.102</IP><CPU>800MHZ</CPU><RAM>128MB</RAM>
my ($name, $ip, $cpu, $ram); foreach (@lines){ /<CNAME>(.*)<\/CNAME><IP>(.*)<\/IP><CPU>(.*)<\/CPU><RAM>(.*)<\/RAM>/ $name = $1 if ($1); $ip = $2 if ($2); $cpu = $3 if ($3); $ram = $4 if ($4); print "NAME", $name, "\n"; print "IP", $ip, "\n"; print "CPU", $cpu, "\n"; print "RAM", $ram, "\n"; } Real XML Parsing
XML is very much like the above example in that you define your own tags to enclose each item of data. However XML requires a tag to declare that the data within is XML and all XML data must belong to a tree like hierarchy. This means it must have a top element or root. The data XML would look like...
<?xml version="1.0"?> <root> <COMPUTER> <CNAME>B01</CNAME> <IP>192.168.0.101</IP> <CPU>500MHZ</CPU> <RAM>256MB</RAM> </COMPUTER> <COMPUTER> <CNAME>B02</CNAME> <IP>192.168.0.102</IP> <CPU>800MHZ</CPU> <RAM>128MB</RAM> </COMPUTER> </root>
Very Useful perl module - Data::Dumper
I assume you have simply want to get simple data structures into you perl scripts from an external file call "datafile.xml", parse it, feed the data into a perl data structure and access it via a simple loop. I would recommend installing the perl module Data::Dumper which although does not do anything with the data but it will show you the structure of how the data is parsed into perl. This is of use when you later reference deeply nested elements within hashes of hashes of hashes etc.
Parsing XML files with XML::Simple
XML::Simple requires XML::Parser and was designed for use with configuration files. Configuration is the KEYWORD hint <-> ASCII. It actually parses internally with UTF-8 encoding. You may experience some real problems if you want to use this module for parsing xml with ISO-8859-1 containing special characters such as umlauts etc.
Watchout on Windows
Be aware on Windows, just because you have umlauts in you text it does not mean that your data is ISO-8859-1 or Latin-1. It you text file contains a euro symbol, it probably is not even ISO-8859-1 as this symbol is above the 255 character table and may in fact be MS Windows extended Latin codepage 1252. So watch out when you movng from windows to unix.
XML parse with XML::Simple and using Data::Dumper to show check the data structure that is parsed.
#!/usr/bin/perl -w use strict; use XML::Simple; use Data::Dumper; my $xmlfile = "./datafile.xml"; my $ref = eval { XMLin($xmlfile) }; if ($@){ print "XML Read ERROR"; } else { print Dumper($ref); }
Output of XML parse with Data::Dumper
$VAR1 = { 'COMPUTER' => [ { 'RAM' => '256MB', 'CNAME' => 'B01', 'CPU' => '500MHZ', 'IP' => '192.168.0.101' }, { 'RAM' => '128MB', 'CNAME' => 'B02', 'CPU' => '800MHZ', 'IP' => '192.168.0.102' } ] };
Outputing data contained in the hash
#!/usr/bin/perl -w use strict; use XML::Simple; my $xmlfile = "./datafile.xml"; my $ref = eval { XMLin($xmlfile) }; if ($@){ print "XML Read ERROR"; } else { foreach my $item (@{$ref->{COMPUTER}}){ print $item->{CNAME}, "\n"; print $item->{IP}, "\n"; print $item->{CPU}, "\n"; print $item->{RAM}, "\n"; } }
Output of XML parse with XML::Simple
B01 192.168.0.101 500MHZ 256MB
B02 192.168.0.102 800MHZ 128MB
XML::Simple for effective processing
Here is an example what XML::Simple can process.
<?xml version="1.0"?> <root> <COMPUTER> <RAM>256MB</RAM> <CNAME>B01</CNAME> <IP>192.168.0.101</IP> <CPU>500MHZ</CPU> </COMPUTER> <COMPUTER> <CNAME>B02</CNAME> <IP>192.168.0.102</IP> <CPU>800MHZ</CPU> <RAM>128MB</RAM> <RAMTYPE>RAMBUS</RAMTYPE> </COMPUTER> <COMPUTER> <CNAME>B03</CNAME> <IP>192.168.0.103</IP> <CPU>800MHZ</CPU> <RAM>128MB</RAM> </COMPUTER> <COMPUTER id="A10922" CNAME="C01" IP="92.168.0.106" CPU="300MHZ" RAM="64MB" /> <COMPUTER id="A10223"> Johns works on a computer called <CNAME>B04</CNAME>. It should have an ip address of <IP>192.168.0.104</IP>. It runs with <CPU>800MHZ</CPU> and <RAM>64MB</RAM> Sometimes it crashes because it uses <OS>Windows</OS> </COMPUTER> </root>
Parsing XML files with XML::Twig
XML::Twig is my choice because I use ISO-8859-1 all the time and XML::Twig allows me to use ignore the encoding used and accept the data as it comes. It does not solve all the problems and even makes a few others because it also require us to be careful about what data we are parsing. I also prefer accessing the data by the methods provided by XML::Simple, however as I am only parsing for simple structures it does not really affect me that much. XML::Twig get my thumbs up because it can do more and solves my encoding problems.
#!/usr/bin/perl -w use strict; use XML::Twig; my $twig= new XML::Twig(KeepEncoding => 1); $twig->parsefile("datafile.xml"); my $root= $twig->root; my @items= $root->children; foreach my $item (@items){ print $item->first_child( 'CNAME')->text, "\n"; print $item->first_child( 'IP')->text, "\n"; print $item->first_child( 'CPU')->text, "\n"; print $item->first_child( 'RAM')->text, "\n\n"; }
Output of XML parse with XML::Twig
B01 192.168.0.101 500MHZ 256MB B02 192.168.0.102 800MHZ 128MB
