a highly efficient xml compression scheme for the web przemysław skibiński 1, jakub swacha 2,...

A Highly Efficient XML Compression Scheme

for the Web

Przemysław Skibiński1, Jakub Swacha2, Szymon Grabowski3

1 Uniwersytet Wrocławski, Instytut Informatyki, ul. Joliot-Curie 15, 50-383 Wrocław, Poland. E-mail: [email protected]

2 Uniwersytet Szczeciński, Instytut Informatyki w Zarządzaniu, ul. Mickiewicza 64, 71-101 Szczecin,

Poland. E-mail: [email protected]

3 Politechnika Łódzka, Katedra Informatyki Stosowanej, al. Politechniki 11, 90-924 Łódź, Poland. E-mail: [email protected]

<conf_data> <conf_name>SOFSEM</conf_name> <conf_location><town>Nový Smokovec</town><country>Slovakia</country></conf_location> <conf_date><month>January</month><year>2008</year></conf_date></conf_data>

2

XML is textual – good for many reasons.But also verbose...

(NEED FOR COMPRESSION!)

XML databases can be large:Protein Sequence Database (annotated) – 683 MB

DBLP Computer Science – 127 MB.(Lots of information stored, but also

a verbose representation.)

What’s wrong with XML

More and more XML documents exchanged through the Web

(the advent of Open XML format in MS Office 2007can only accelerate this trend).

3

XML compression goals

What shall we do, use general-purpose compression(eg. zip, bzip2, ppmd)?

Far from optimal (known since 1999 when first XML compressors appeared – e.g. XMill).

Compression ratio could be improved.Speed can be improved

(maybe not easily with zip though...).

Compression ratio / (de)compression speedare typically contradictory criteria;

what should we choose?

WE CARE FOR A TOTAL TRANSFER TIME(OVER A NET).

4

Specialized XML compression

XMill (Liefke & Suciu, 1999, 2000) – separate streams:element and attribute names; actual content (text),

XML document structure.Significant gains esp. with gzip as the back-end compressor.

XMLPPM (Cheney, 2001) – switching between different PPM models. Novel idea: injecting a symbol from the prev

model into the current context (so both the “traditional” and the element related contexts matter).

SCMPPM (Adiego et al., 2004) – XMLPPM to the extreme: a separate model for each element path. Beats XMLPPM

on large files. But also needs lots of memory.

5

Every end tag must match the corresponding start tag each end tag may be replaced with merely a closing flag.

Some words appear with high frequency build a dictionary. Not only over tag / attribute names,

but also over the textual content.

Physical layout often regular encode trailing spaces in linesalmost to zero. Similar thing often works

for End-of-Line chars.

Decimal system is verbose compact integers (use e.g. base 256).

Redundancy in XML databases

6

Our web-compression-oriented transform, bird-flight view

Design assumption: dedicated for PPM compressors(e.g. PPMd).

Semi-dynamic dictionary: use a byte coding for words that appear at least fmin = 64 times in the document.

(The dictionary is front-compressed and stored in the archive.)

The notion of word comprises also: start XML tags, URL prefixes ( http://domain/ ), emails,

&data, =", "> patterns, runs of spaces.

Integers and some other patterns encoded densely.

7

Dictionary coding

1st pass: gather the words of at least lmin = 2 characters,with least fmin occs, and sort acc. to freqs.

Variable-length coding used: from 1 up to 4 bytes.The codeword alphabet: 127-255 range, most 0-31 range +

a few more chars.Non-intersecting value ranges for different codeword bytes

of size w, x, y, z.Namely: w 1-byters, x • w 2-byters,

y • x • w 3-byters, z • y • x • w 4-byters.

The parameters w, x, y, z are selected acc. to the size of the created dictionary, with the principle of maximizing

the number of short codewords.

8

Pattern encoding

Some patterns: integers, dates (in a specified format),IPs occur frequently, and can be encoded densely in binary.

Original idea: XMill.In XWP: automatic detection (no need for DTD or

human assistance).

XWP handles:• integers from 1900...2155 (years) – 2 bytes (incl. a flag),

• other integers – from 2 to 5 bytes (up to 232),• IP addresses – 5 bytes,

• dates (e.g., 1980-02-31, 01-MAR-1920) – 2 or 3 bytes, differential encoding,

• times (e.g., 11:30pm, 23:20, 23:30:59) – 3 or 4 bytes,• page ranges – 4 bytes,

• floats x.x (0.0...24.9) and .xx – 2 bytes.

9

Page ranges x-y on 4 bytes:flag,

number x on 2 bytes,difference y-x on 1 byte.

dblp.xml

Encoding of time and range patternsNumbers from 1...12 followed by “am” or “pm”

are interpreted as times, and encoded on 3 bytes:time pattern flag,

the hour (in 24-h convention),the minutes.

10

PPMVC(PPM with variable-length contexts)

[Skibiński & Grabowski, 2004]

Main weakness of most PPM algorithms ispoor handling of long matching sequences(as opposed to LZ77 algs which excel in it).

Using high orders (16+):memory-hungry, quite slow, it's hard to overcome the

so-called zero-frequency problem.

A possible solution: coupling PPM with LZ matching.Original idea: PPM* (Cleary et al., 1995).

Another implementation: PPMZ (Bloom, 1998).

11

PPMVC, cont’d

In PPMVC, each max order context holds a pointer to reference context (the prev occ of the context) and

the minimum left match length.

The left match length (LML) = the length of the common part of the active context and the reference context.

LML always at least as large as the maximum PPM order.

The right match length (RML) = the length of the matching sequence between symbols to encode and symbols followed by

the reference context.

If the left match between the current pos and the prev max-order context occurrence is at least minLML, then the RML (0 or more) is sent to the output. If not, plain PPM coding (Shkarin's PPMd) is used.

In practice it is better to quantize RML, e.g. round down to a multiple of 8.

12

Fast PAQA relatively fast compressor from the PAQ (Mahoney, 2002-2007) family.

PAQ features:• working on bit level,

• mixing predictions from various models run in parallel (PPM-like models, string matching model, word model, tabular data model, etc.),

• mixing predictions with several neural networks,• adaptive probability maps (APM) mechanism to update the models

considering previous experience and the current context,• extremely high compression, extremely slow.

FastPAQ features:• models irrelevant for XML removed,

• APM stages simplified,• much faster than PAQ8 for a reasonable loss in compression.

13

Experiments: databases

14

Enwikinews, excerpt

15

Swissprot, excerpt

16

DBLP, excerpt

17

Experiments: methodology etc.

The test machine: Intel Core 2 Duo E6600 2.40 GHz, 1 GB RAM,

two Seagate 250 GB SATA drives in RAID mode 1, Windows XP (64-bit).

Implementation: C++ (Visual C++ 6.0). XML-WRT v3.1 with sources:

http://www.ii.uni.wroc.pl/~inikep/.

Back-end compressors used: gzip 1.2.4, Pavlov’s LZMA (used in 7-zip),

Shkarin’s PPMd, PPMVC, FastPAQ.

18

Experimental results

19

Decompression and transmission times

XWRT3+PPMVC: best choice for transmission speed up to 384Kbps.For 1 Mbps, it succumbs only to XWRT2 (our prev. scheme) + LZMA.

Still, XWRT3 decompression is streamlined.

20

Conclusions

XWRT3 (XWP transform + PPMVC)seems to be best choice for transmitting XML documents

over slow / moderate-speed networks.

For high-bandwidth networks: XWP+PPMVC may be slowerin retrieving a document than XWRT2+LZMA,

but both the transform (XWP) and the coder (PPMVC)components are streamlined, i.e. immediate display

of the (beginning of the) document is possible.

Best XML compression ratios presented so far: with PPMVC, outperforming SCMPPM by 9% on avg,

with FastPAQ (alas, impractical) an extra 9% avg gain.

a highly efficient xml compression scheme for the web przemysław skibiński 1, jakub swacha 2,...

Documents