The la package provides two ways to archive larrys: using archive functions such as save and load and using the dictionary-like interface of the IO class. Both I/O methods store larrys in HDF5 1.8 format and require h5py.
Contents
Not all data types can be saved to a HDF5 archive; see Data types for details.
One method to archive larrys is to use the archive functions (see IO class for a second, more powerful method):
To demonstrate, let’s start by creating a larry:
>>> import la
>>> y = la.larry([1, 2, 3])
Next let’s save the larry, y, in an archive using the save function:
>>> la.save('/tmp/data.hdf5', y, 'y')
The contents of the archive:
>>> la.archive_directory('/tmp/data.hdf5')
['y']
To load the larry we use the load function:
>>> z = la.load('/tmp/data.hdf5', 'y')
The entire larry is loaded from the archive. The load function does not have an option to load parts of a larry, such as a slice. (To load parts of a larrys from the archive, see IO class.)
The name of the larry in save and load statements (and in all the other archive functions) must be a string. But the string may contain one or more forward slashes (‘/’), which is to say that larrys can be archived in a hierarchical structure:
>>> la.save('/tmp/data/hdf5', y, '/experiment/2/y')
>>> z = la.load('/tmp/data/hdf5', '/experiment/2/y')
Instead of passing a filename to the archive functions you can optionally pass a h5py File object:
>>> import h5py
>>> f = h5py.File('/tmp/data.hdf5')
>>> z = la.load(f, 'y')
To check if a larry is in the archive:
>>> la.is_archived_larry(f, 'y')
True
To delete a larry from the archive:
>>> la.delete(f, 'y')
>>> la.is_archived_larry(f, 'y')
False
HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. After repeatedly opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive:
>>> la.repack(f)
To see how much space the archive takes on disk and to see how much freespace is in the archive see IO class.
For further information on the archive functions see Archive function reference.
The IO class provides a dictionary-like interface to the archive.
To demonstrate, let’s start by creating two larrys, a and b:
>>> import la
>>> a = la.larry([1.0, 2.0, 3.0, 4.0])
>>> b = la.larry([[1, 2],[3, 4]])
To work with an archive you need to create an IO object:
>>> io = la.IO('/tmp/data.hdf5')
where /tmp/data.hdf5 is the path to the archive used in this example.
Let’s add (save) two larrys, a and b, to the archive and then list the contents of the archive:
>>> io['a'] = a
>>> io['b'] = b
>>> io
larry dtype shape
----------------------
a float64 (4,)
b int64 (2, 2)
We can get a list of the keys (larrys) in the archive:
>>> io.keys()
['a', 'b']
>>> for key in io: print key
...
a
b
>>> len(io)
2
Are the larrys a (yes) and c (no) in the archive?
>>> 'a' in io
True
>>> 'c' in io
False
>>> list(set(io) & set(['a', 'c']))
['a']
When we load data from the archive using an IO object, we get a lara not a larry:
>>> z = io['a']
>>> type(z)
<class 'la.io.lara'>
Whereas larry stores his data in a numpy array and a list (labels), lara stores her data in a h5py Dataset object and a list (labels). The reason that an IO object returns a lara instead of a larry is that you may want to extract only part of a larry, such as a slice, from the archive.
To convert a lara object into a larry, just index into the lara (the indexing below is the slice [:2]):
>>> z = io['a'][:2]
>>> type(z)
<class 'la.deflarry.larry'>
>>> z
label_0
0
1
x
array([ 1., 2.])
In the example above, only the first two items in the array were loaded from the archive—a feature that comes in handy when you only need a small part of a large larry.
Although the data from a larry is not loaded until you index into the lara, the entire label is always loaded. That allows you to use the labels right away:
>>> z = io['a']
>>> type(z)
<class 'la.io.lara'>
>>> idx = z.labelindex(1, axis=0)
>>> type(z[:idx])
<class 'la.deflarry.larry'>
To delete the larry b from the archive:
>>> del io['b']
HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. After repeatedly opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive:
>>> io.repack()
Repack means to transfer all the larrys to a new archive (with the same name) and delete the old archive.
Before looking at the size of the archive, let’s add some bigger larrys:
>>> import numpy as np
>>> io['rand'] = la.rand(1000, 1000)
>>> io['randn'] = la.randn(1000, 1000)
>>> io
larry dtype shape
----------------------------
a float64 (4,)
rand float64 (1000, 1000)
randn float64 (1000, 1000)
How many MB does the archive occupy on disk?
>>> io.space / 1e6
16.038903999999999 # MB
How much freespace is there?
>>> io.freespace / 1e6
0.0068399999999999997 # MB
Let’s delete randn from the archive and look at the space and freespace:
>>> del io['randn']
>>> io.space / 1e6
16.038903999999999 # MB
>>> io.freespace / 1e6
8.0228400000000004 # MB
So deleting a larry from the the archive does not reduce the size of the archive unless you repack:
>>> io.repack()
>>> io.space / 1e6
8.02224 # MB
>>> io.freespace / 1e6
0.0061760000000000001 # MB
(Sometimes freespace will get reused when saving new larrys to the archive. If any HDF5 users are reading this, could you tell me when freespace is reused and when it is not?)
The IO class takes an optional argument that can be used to automatically repack the archive when the freespace after deleting a larry exceeds a specified amount. The following IO object will repack the archive everytime a delete causes the freespace in the archive to exceed 100 MB:
>>> io = la.IO('/tmp/data.hdf5', max_freespace=100e6)
You can iterate through the keys or the values or the (key, value) pairs of an IO object:
>>> for key, value in io.iteritems():
... print key, value.shape
...
a (4,)
rand (1000, 1000)
The keys (larrys) in an IO object (archive) must be strings. But the string may contain one or more forward slashes (‘/’), which is to say that larrys can be archived in a hierarchical structure:
>>> io['/experiment/2/y'] = la.larry([1, 2, 3])
>>> z = io['/experiment/2/y']
What filename is associated with the archive?
>>> io.filename
'/tmp/data.hdf5'
For further information on the IO class see IO class reference.
There are several limitations of the archiving method used by the la package. In this section we will discuss two limitations:
HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. Therefore, after opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive.
You can use the utility provided by HDF5 to repack the archive or you can use the repack method (see IO class) or function (see Archive functions) in the la package.
A larry can have labels of mixed type, for example strings and numbers. However, when archiving larrys in HDF5 format the labels are converted to Numpy arrays and the elements of a Numpy array must be of the same type. Therefore, to archive a larry the labels along any one dimension must be of the same type and that type must be one that is recognized by h5py and HDF5: strings and scalars. An exception is made for labels with dates of type datetime.date: la automatically converts dates to integers when saving and back to dates when loading.
An archive is contructed from two types of HDF5 objects: Groups and Datasets. Groups can contain Datasets and more Groups. Datasets can contain arrays.
larrys are stored in a HDF5 Group. The name of the group, often referred to in this manual as the key, is the name of the larry. The group is given an attribute called ‘larry’ and assigned the value True. Inside the group are several HDF5 Datasets. For a 2d larry, for example, there are three datasets: one to hold the data (named ‘x’) and two to hold the labels (named ‘0’ and ‘1’). In general, for a nd larry there are n+1 datasets. Each label Dataset is given an attribute called ‘isdate’ which is set to True if all labels along the given axis are dates of type datetime.date; False otherwise. If ‘isdate’ is True then the labels are converted to integers before saving, and converted back to datetime.date object when loading.
This section contains the reference guide to the archive functions, Archive function reference, and the IO class methods, IO class reference.
Save a larry in HDF5 format.
Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).
Before saving, the labels are converted to Numpy arrays, one array for each dimension. Therefore, to save a larry in HDF5 format, the elements of a label along any one dimension must be of the same type and that type must be supported by HDF5.
If all labels along an axis are dates of type datetime.date, then the dates are converted to integers before saving and the HDF5 Dataset used to store that label is assigned an attribute name ‘isdate’ which is set to True. When loading the larry, the dates will automatically be converted back to datetime.date dates.
Parameters: | file : str or h5py.File
lar : larry
key : str
|
---|
See also
Examples
Create a larry:
>>> x = la.larry([1, 2, 3])
Save the larry:
>>> la.save('/tmp/x.hdf5', x, 'x')
Load a larry from a HDF5 archive.
Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).
Parameters: | file : str or h5py.File
key : str
|
---|---|
Returns: | out : larry
|
See also
Examples
Create a larry:
>>> x = la.larry([1, 2, 3])
Save the larry:
>>> la.save('/tmp/x.hdf5', x, 'x')
Now load it:
>>> y = la.load('/tmp/x.hdf5', 'x')
Delete a larry from a HDF5 archive.
Parameters: | file : str or h5py.File
key : str
|
---|---|
Returns: | out : None
|
See also
Examples
Create a larry:
>>> x = la.larry([1, 2, 3])
Save the larry:
>>> la.save('/tmp/x.hdf5', x, 'x')
Now delete it:
>>> la.delete('/tmp/x.hdf5', 'x')
Repack archive to remove freespace.
Parameters: | file : h5py File or str
|
---|---|
Returns: | file : h5py File or None
|
Save and load larrys in HDF5 format using a dictionary-like interface.
Methods
Save and load larrys in HDF5 format using a dictionary-like interface.
Dictionaries are made up of (key, value) pairs. In an IO object, a key is the name of a larry. The value part of the dictionary is a larry when saving data and is a lara, a larry-like archive object, when loading data.
(h5py has the same duality. When saving, the values are Numpy arrays; when loading the values are h5py Dataset objects.)
To convert a lara into a larry just index into the lara.
The reason why loading does not return a larry is that you may not want to load the entire larry which could, for example, be very large.
A lara loads the labels but does not load the array data until you index into it.
Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).
Before saving, the labels are converted to Numpy arrays, one array for each dimension. Therefore, to save a larry in HDF5 format, the elements of a label along any one dimension must be of the same type and that type must be supported by HDF5.
Parameters: | filename : str
max_freespace : scalar
|
---|---|
Returns: | A dictionary-like IO object. : |
See also
Notes
Examples
Save a larry in the archive:
>>> import la
>>> io = la.IO('/tmp/dataset.hdf5')
>>> io['x'] = la.larry([1,2,3]) # <-- Save
Examine the contents of the archive:
>>> io
larry dtype shape
------------------
x int64 (3,)
Overwrite the contents of x in the archive:
>>> io['x'] = la.larry([4.0]) # <-- Overwrite
Load from the archive:
>>> y = io['x'] # <-- Load
>>> type(y)
<class 'la.io.io.lara'>
>>> type(y[:])
<class 'la.deflarry.larry'>
>>> type(y[2:])
<class 'la.deflarry.larry'>
Test if x is in the archive:
>>> 'x' in io
True
>>> del io['x'] # <-- Delete (unlink)
>>> 'x' in io
False
Merge, or optionally update, a larry with a second larry.
See larry.merge for details.
Note: the entire larry is loaded from the archive, merged with lar and then the merged larry is saved back to the archive. The resize function of h5py is not used. In other words, this function might not be practical for very large larrys.