Using NetCDF4 Compression with CDMS

CDMS2 writes out data using the NetCDF library

NetCDF4 allows for file compression, a good blog about NetCDF4 and compression can be found here

From this blog:

"The netCDF-4 libraries inherit the capability for data compression from the HDF5 storage layer underneath the netCDF-4 interface. Linking a program that uses netCDF to a netCDF-4 library allows the program to read compressed data without changing a single line of the program source code."

and

"Also, we're only dealing with lossless compression"

This Notebook shows how to control NetCDF4 compression (shuffling/deflating) capabilities via cdms2.

You can download the Notebook here

Table Of Contents

Back To Top

Preparing The Notebook

In order to look at a NetCDF content the easiest way is to use ncdump. The following function helps us do a line call within Python, for Notebook clarity.

We also prepare some random data

Back To Top

In [1]:
from __future__ import print_function
import subprocess
import shlex
import numpy
import os
import io
import time

# Get file size
def size_it(filename):
    statinfo = os.stat(filename)
    return statinfo.st_size

# Write and return time
def dump(data,filename="example.nc"):
    start = time.time()
    f = cdms2.open(filename,"w")
    f.write(data,id="data")
    f.close()
    return time.time()-start,size_it(filename)

class HTML(object):
    def __init__(self,html):
        self.html = html
    def _repr_html_(self):
        return self.html


# Nice html output for ncdump
class NCINFO(object):
    def __init__(self, filename, variable=None, options=""):
        self.filename = filename
        self.variable = variable
        self.options = options
    def _repr_html_(self):
        out = self.nc_info()
        lines = []
        for l in out.split("\n"):
            for kw in ["chunk","deflate","classic","netcdf4","netcdf-4"]:
                if l.lower().find(kw)>-1:
                    l = "<b>{0}</b>".format(l)
            lines.append(l.replace("\t","&emsp;&emsp;"))
        return "{0}".format("<br>".join(lines))
    def nc_info(self):
        """calls ncdump on file
    Can opass a variable or optional ncdump arguments
    Default call `ncdump -hs filename`"""
        with io.BytesIO() as out:
            ncdumpOptions = "-hs {options}".format(options=self.options)
            if self.variable is not None:
                ncdumpOptions += "-v {variable}".format(self.variable)
            cmd = "ncdump {options} {file}".format(options=ncdumpOptions, file=self.filename)
            print("Runnning {0}".format(cmd),file=out)
            cmd = shlex.split(cmd)
            p = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
            o, e = p.communicate()
            print("-------",file=out)
            print(o,file=out)
            print("-------",file=out)
            print("File Size {0} bytes".format(size_it(self.filename)),file=out)
            return out.getvalue()
        
import requests
def download(fnm):
    r = requests.get("https://uvcdat.llnl.gov/cdat/sample_data/%s" % fnm,stream=True)
    with open(fnm,"wb") as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:  # filter local_filename keep-alive new chunks
                f.write(chunk)

download("clt.nc")
data = numpy.random.random((120,180,360))
# Random data do not compress well at all, switching to 0/1
data = numpy.greater(data,.5).astype(numpy.float)

Default Settings

By default cdms writes out data in NetCDF4 classic with no shuffling and a deflate level of 1

Back To Top

To access the netcdf value used to write data out use the following commands:

In [2]:
import cdms2
print("NetCDF4? ",cdms2.getNetcdf4Flag())
print("NetCDF Classic?",cdms2.getNetcdfClassicFlag())
print("NetCDF4 Shuffling",cdms2.getNetcdfShuffleFlag())
print("NetCDF4 Deflate?",cdms2.getNetcdfDeflateFlag())
print("NetCDF4 Deflate Level?",cdms2.getNetcdfDeflateLevelFlag())
NetCDF4?  1
NetCDF Classic? 1
NetCDF4 Shuffling 0
NetCDF4 Deflate? 1
NetCDF4 Deflate Level? 1

These values are read in at the time you open the file for writing

Note the BOLD lines

In [3]:
dump(data)
NCINFO("example.nc")
/Users/doutriaux1/anaconda2/envs/2.12/lib/python2.7/site-packages/cdms2/dataset.py:1967: Warning: Files are written with compression and shuffling
You can query different values of compression using the functions:
cdms2.getNetcdfShuffleFlag() returning 1 if shuffling is enabled, 0 otherwise
cdms2.getNetcdfDeflateFlag() returning 1 if deflate is used, 0 otherwise
cdms2.getNetcdfDeflateLevelFlag() returning the level of compression for the deflate method

If you want to turn that off or set different values of compression use the functions:
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included

Turning all values to 0 will produce NetCDF3 Classic files
To Force NetCDF4 output with classic format and no compressing use:
cdms2.setNetcdf4Flag(1)
NetCDF4 file with no shuffling or deflate and noclassic will be open for parallel i/o
  "for parallel i/o", Warning)
Out[3]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_DeflateLevel = 1 ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_DeflateLevel = 1 ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_DeflateLevel = 1 ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_DeflateLevel = 1 ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4 classic model" ;
}

-------
File Size 4144432 bytes

Turning Off Compression

Back to Top

We can use no compression by runnnig

In [4]:
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included
dump(data)
NCINFO("example.nc")
Out[4]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "contiguous" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "contiguous" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "contiguous" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "contiguous" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4 classic model" ;
}

-------
File Size 62222804 bytes

Pure NetCDF3

Back To Top

All these option can either be turned to 0 to enable NetCDF3 (as the warning above shows). One can also use the single command:

In [5]:
cdms2.useNetcdf3()
# or for versions earlier than 2.12.2017.10.25
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included
cdms2.setNetcdf4Flag(0)
dump(data)
NCINFO("example.nc")
Out[5]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
  double axis_1(axis_1) ;
  double axis_2(axis_2) ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_Format = "64-bit offset" ;
}

-------
File Size 62213640 bytes

NetCDF4 non classic

Back To TOp

We can also turn off the classic option for netcdf4

In [6]:
cdms2.setNetcdf4Flag(1)
cdms2.setNetcdfClassicFlag(0)
dump(data)
NCINFO("example.nc")
Out[6]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "contiguous" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "contiguous" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "contiguous" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "contiguous" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 62222745 bytes

Using Shuffling

Back To Top

We can turn on/off shuffling

In [7]:
cdms2.setNetcdf4Flag(1)
cdms2.setNetcdfClassicFlag(0)
cdms2.setNetcdfShuffleFlag(1)
dump(data)
NCINFO("example.nc")
Out[7]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_Shuffle = "true" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_Shuffle = "true" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_Shuffle = "true" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_Shuffle = "true" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 62231714 bytes

Controling Deflate Level

Back To top

We can choose our deflate level (at the expense of time)

In [8]:
cdms2.setNetcdfShuffleFlag(0)
cdms2.setNetcdfDeflateFlag(1)
cdms2.setNetcdfDeflateLevelFlag(5)
dump(data)
NCINFO("example.nc")
Out[8]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_DeflateLevel = 5 ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_DeflateLevel = 5 ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_DeflateLevel = 5 ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_DeflateLevel = 5 ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 2772118 bytes

Summarizing All Options

Back To Top

Let's try with a real life example

In [9]:
f=cdms2.open("clt.nc")
clt = f("clt")

html = "<table border='2'><tr><th>Deflate Level</th><th>NC3</th><th>NC4 Classic no shuffle</th><th>NC4 Classic shuffled</th><th>NC4 no shuffle</th><th>NC4 shuffled</th></tr>"

def addCell():
    t,s = dump(clt)
    return "<td align='center'>{:.2f}/{:d}</td>".format(t,s)

def nc4s():
    out = ""
    for classic in [1,0]:
        cdms2.setNetcdfClassicFlag(classic)
        for shuffle in [0,1]:
            cdms2.setNetcdfShuffleFlag(shuffle)
            out+=addCell()
    out+="</tr>"
    return out

# NetCDF3
html+="<tr><td align='center'>0</td>"
cdms2.useNetcdf3()
cdms2.setNetcdf4Flag(0)
html+=addCell()
cdms2.setNetcdf4Flag(1)
html+=nc4s()
cdms2.setNetcdfDeflateFlag(1)
for i in range(1,10):
    cdms2.setNetcdfDeflateLevelFlag(i)
    html += "<tr><td align='center'>{0}</td><td align='center'>N/A</td>".format(i)
    html += nc4s()
html+="<caption>Time To Write NetCDF File and size for various NC4 settings</caption></table>"
HTML(html)
Out[9]:
Deflate LevelNC3NC4 Classic no shuffleNC4 Classic shuffledNC4 no shuffleNC4 shuffled
00.01/16254210.01/16254800.02/16331540.01/16254210.02/1633095
1N/A0.09/12012070.07/12278510.09/12011480.07/1227792
2N/A0.09/12005730.07/12240070.09/12005140.07/1223948
3N/A0.09/12004730.07/12203870.09/12004140.07/1220328
4N/A0.10/12064540.07/12182710.09/12063950.08/1218212
5N/A0.10/12061940.08/12154420.10/12061350.08/1215383
6N/A0.10/12060630.10/12134650.10/12060040.09/1213406
7N/A0.10/12060070.09/12128250.10/12059480.09/1212766
8N/A0.10/12059900.15/12119200.10/12059310.14/1211861
9N/A0.11/12059900.19/12115610.11/12059310.18/1211502
Time To Write NetCDF File and size for various NC4 settings