I now just access everything compressed except when I need to change the stucture of the netcdf file (add variables, attributes or dimensions).
| file type | size (bytes) |
|---|---|
| uncompressed | 2869468 |
| 4k | 301813 |
| 8k | 267189 |
| 16k | 249029 |
| 32k | 237229 |
Uncompressed is the normal netcdf file format. The remaining four files have different block sizes (4,8,16 and 32 kbytes) and show the effects of different buffers. The basic idea is that larger buffers allow better compression, reduce the number of buffers necessary to hold the same amount of data (thus reducing the header size), and improve sequential access times. The down side is that larger buffers increase the overhead on getting non-sequential values and take more working memory.
The worst case scenario is randomly grabbing data from any record in the file. This makes it likely that the desired data will not be already uncompressed in the buffered blocks and each record access will have to decompress a full block of data. The following are user and system times for getting four variables from 1000 records at random from the above files. The runs were only made once and will show some variation depending on system status but the trends are clear.
| user | system | |
| uncompressed | 0.380 | 0.276 |
| 4k | 1.303 | 0.351 |
| 8k | 1.933 | 0.171 |
| 16k | 3.079 | 0.157 |
| 32k | 5.261 | 0.227 |
It take a lot longer to get the data from compressed files, especially with large buffers.
However most of our access is sequential, we go in and get a block of data. The following timings are reading four variables from 1000 sequential records from the above files.
| user | system | |
| uncompressed | 0.336 | 0.037 |
| 4k | 0.407 | 0.038 |
| 8k | 0.382 | 0.034 |
| 16k | 0.407 | 0.035 |
| 32k | 0.393 | 0.045 |
No real difference.
This dataset is the daily climate information for the coop station network in New York.
uncompressed 6.090u 1.460s 0:14.77 51.1% 0+0k 12+2584io 0pf+0w compressed 16.000u 1.850s 0:20.94 85.2% 0+0k 29+2545io 0pf+0w
uncompressed 4176.130u 42.430s 1:10:59.72 99.0% 0+0k 15516+15653io 0pf+0w compressed 4534.890u 20.200s 1:16:38.89 99.0% 0+0k 16048+3191io 0pf+0w
uncompressed 702.130u 14.670s 13:00.43 91.8% 0+0k 5592+6506io 0pf+0w compressed 850.770u 9.960s 15:07.76 94.8% 0+0k 5261+4091io 0pf+0w
So there is only a small penalty when creating compressed netcdf files.
When accessing the data there is effectively no difference.
uncompressed 16.140u 1.160s 0:17.94 96.4% 0+0k 500+25io 0pf+0w 15.840u 1.170s 0:17.12 99.3% 0+0k 5+21io 0pf+0w compressed 15.910u 0.080s 0:16.47 97.0% 0+0k 101+27io 0pf+0w 15.830u 0.110s 0:15.99 99.6% 0+0k 0+18io 0pf+0w
| Uncompressed | 236963 |
| Compressed | 31063 |