Some performance numbers

I did some performance measurements early on but got bored when there didn't seem to be a large difference between compressed and uncompressed access.

I now just access everything compressed except when I need to change the stucture of the netcdf file (add variables, attributes or dimensions).


The test parameters

The following timing tests were done on a large netcdf file that has one record dimension, 31 record variables and 25,000 or so records. The file has the following sizes.

file typesize (bytes)
uncompressed 2869468
4k 301813
8k 267189
16k 249029
32k 237229

Uncompressed is the normal netcdf file format. The remaining four files have different block sizes (4,8,16 and 32 kbytes) and show the effects of different buffers. The basic idea is that larger buffers allow better compression, reduce the number of buffers necessary to hold the same amount of data (thus reducing the header size), and improve sequential access times. The down side is that larger buffers increase the overhead on getting non-sequential values and take more working memory.

The worst case scenario is randomly grabbing data from any record in the file. This makes it likely that the desired data will not be already uncompressed in the buffered blocks and each record access will have to decompress a full block of data. The following are user and system times for getting four variables from 1000 records at random from the above files. The runs were only made once and will show some variation depending on system status but the trends are clear.

usersystem
uncompressed 0.380 0.276
4k 1.303 0.351
8k 1.933 0.171
16k 3.079 0.157
32k 5.261 0.227

It take a lot longer to get the data from compressed files, especially with large buffers.

However most of our access is sequential, we go in and get a block of data. The following timings are reading four variables from 1000 sequential records from the above files.

usersystem
uncompressed 0.336 0.037
4k 0.407 0.038
8k 0.382 0.034
16k 0.407 0.035
32k 0.393 0.045

No real difference.


Some Real Numbers

The above numbers are good for tuning special cases and to get an idea of the trade-offs involved but the following give a better feel for the performance in real use.

This dataset is the daily climate information for the coop station network in New York.

Create the empty netcdf files (622 files):
uncompressed
6.090u 1.460s 0:14.77 51.1% 0+0k 12+2584io 0pf+0w
compressed 
16.000u 1.850s 0:20.94 85.2% 0+0k 29+2545io 0pf+0w

Fill them with the data:
uncompressed
4176.130u 42.430s 1:10:59.72 99.0% 0+0k 15516+15653io 0pf+0w
compressed
4534.890u 20.200s 1:16:38.89 99.0% 0+0k 16048+3191io 0pf+0w

Update some of the data (necessitating recompressing and moving blocks of data):
uncompressed
702.130u 14.670s 13:00.43 91.8% 0+0k 5592+6506io 0pf+0w
compressed
850.770u 9.960s 15:07.76 94.8% 0+0k 5261+4091io 0pf+0w

So there is only a small penalty when creating compressed netcdf files.
When accessing the data there is effectively no difference.

Opening 5 station files and getting 50 years of TMIN, TMAX, and PRCP:
uncompressed
16.140u 1.160s 0:17.94 96.4% 0+0k 500+25io 0pf+0w
15.840u 1.170s 0:17.12 99.3% 0+0k 5+21io 0pf+0w
compressed
15.910u 0.080s 0:16.47 97.0% 0+0k 101+27io 0pf+0w
15.830u 0.110s 0:15.99 99.6% 0+0k 0+18io 0pf+0w


And now for the whole reason for this exercise

Dataset size:
Uncompressed 236963
Compressed 31063