etoolbox.datazip.core

Code for DataZip.

Classes

DataZip

A ZipFile with methods for easier use with Python objects.

Module Contents

class etoolbox.datazip.core.DataZip(file, mode='r', ignore_pd_dtypes=False, *args, ids_for_dedup=True, **kwargs)[source]

Bases: zipfile.ZipFile

A ZipFile with methods for easier use with Python objects.

Create a DataZip.

Parameters:
  • file (str | os.PathLike | io.BytesIO) – Either the path to the file, or a file-like object. If it is a path, the file will be opened and closed by DataZip.

  • mode (Literal['r', 'w']) – The mode can be either read ‘r’, or write ‘w’.

  • recipes – Deprecated.

  • compression – ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib), ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).

  • ignore_pd_dtypes – if True, any dtypes stored in a DataZip for pandas.DataFrame columns or pandas.Series will be ignored. This may be useful when using global settings for mode.dtype_backend or mode.use_nullable_dtypes to force the use of pyarrow types.

  • args – additional positional will be passed to zipfile.ZipFile.__init__().

  • ids_for_dedup – If True, multiple references to the same object will not cause the object to be stored multiple times. If False, the object will be stored as many times as it has references. True can save space but because ids are not unique for objects with non-overlapping lifetimes, setting to True can result in subsequent new objects NOT being stored because they share an id with an earlier object.

  • kwargs – keyword arguments will be passed to zipfile.ZipFile.__init__().

Examples

First we can create a DataZip. In this case we are using a buffer (io.BytesIO) for convenience. In most cases though, file would be a pathlib.Path or str that represents a file. In these cases a .zip extension will be added if it is not there already.

>>> buffer = BytesIO()  # can also be a file-like object
>>> with DataZip(file=buffer, mode="w") as z0:
...     z0["df"] = pd.DataFrame({(0, "a"): [2.4, 8.9], (0, "b"): [3.5, 6.2]})
...     z0["foo"] = {
...         "a": (1, (2, {3})),
...         "b": frozenset({1.5, 3}),
...         "c": 0.9 + 0.2j,
...     }

Getting items from DataZip, like setting them, uses standard Python subscripting.

For pandas.DataFrame, it stores them as parquet and preserves pandas.MultiIndex columns, even when they cannot normally be stored in a parquet file.

>>> with DataZip(buffer, "r") as z1:
...     z1["df"]
     0
     a    b
0  2.4  3.5
1  8.9  6.2

While always preferable to use a context manager as above, here it’s more convenient to keep the object open. Even more unusual types that can’t normally be stored in json should work.

>>> z1 = DataZip(buffer, "r")
>>> z1["foo"]
{'a': (1, (2, {3})), 'b': frozenset({1.5, 3}), 'c': (0.9+0.2j)}

Checking to see if an item is in a DataZip uses standard Python syntax.

>>> "df" in z1
True

You can also check by filename. And check the number of items.

>>> "df.parquet" in z1
True
>>> len(z1)
2

When not used with a context manager, DataZip should close itself automatically but it’s not a bad idea to make sure.

>>> z1.close()

A DataZip is a write-once, read-many affair because of the way zip files work. Appending to a DataZip can be done with the DataZip.replace() method.

>>> buffer1 = BytesIO()
>>> with DataZip.replace(buffer1, buffer, foo=5, bar=6) as z:
...     z["new"] = "foo"
...     z["foo"]
5
static dump(obj, file, **kwargs)[source]

Write the DataZip representation of obj to file.

Parameters:
  • obj (Any) – A Python object, it must implement __getstate__ and __setstate__. There are other restrictions, especially if it contains instances of other custom Python objects, it may be enough for all of them to implement __getstate__ and __setstate__.

  • file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer where the DataZip will be saved.

  • kwargs – keyword arguments will be passed to DataZip.

Returns:

None

Return type:

None

Examples

Create an object that you would like to save as a DataZip.

>>> from etoolbox.datazip._test_classes import _TestKlass
>>> obj = _TestKlass(a=5, b={"c": [2, 3.5]})
>>> obj
_TestKlass(a=5, b={'c': [2, 3.5]})

Save the object as a DataZip.

>>> buffer = BytesIO()
>>> DataZip.dump(obj, buffer)
>>> del obj

Get it back.

>>> obj = DataZip.load(buffer)
>>> obj
_TestKlass(a=5, b={'c': [2, 3.5]})
static load(file, klass=None)[source]

Return the reconstituted object specified in the file.

Parameters:
  • file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer from which the DataZip will be read.

  • klass (type | None) – (Optional) allows passing the class when it is known, this is handy when it is not possible to import the module that defines the class that file represents.

Returns:

Object from DataZip.

Return type:

Any

Examples

See DataZip.dump() for examples.

classmethod replace(file_or_new_buffer, old_buffer=None, save_old=False, iterwrap=None, **kwargs)[source]

Replace an old DataZip with an editable new one.

Note: Data and keys that are copied over by this function cannot be reliably mutated.``kwargs`` must be used to replace the data associated with keys that exist in the old DataZip.

Parameters:
  • file_or_new_buffer – Either the path to the file to be replaced or the new buffer.

  • old_buffer – only required if file_or_new_buffer is a buffer.

  • save_old – if True, the old DataZip will be saved with “_old” appended, if False it will be deleted when the new DataZip is closed.

  • iterwrap – this will be used to wrap the iterator that handles copying data to the new DataZip to enable a progress bar, i.e. tqdm.

  • kwargs – data that should be written into the new DataZip, for any keys that were in the old DataZip, the new value provided here will be used instead.

Returns:

New editable DataZip with old data copied into it.

Examples

Create a new test file object and put a datazip in it.

>>> file = Path.home() / "test.zip"
>>> with DataZip(file=file, mode="w") as z0:
...     z0["series"] = pd.Series([1, 2, 4], name="series")

Create a replacement DataZip.

>>> z1 = DataZip.replace(file, save_old=False)

The replacement has the old content.

>>> z1["series"]
0    1
1    2
2    4
Name: series, dtype: int64

We can also now add to it.

>>> z1["foo"] = "bar"

While the replacement is open, the old verion still exists.

>>> (Path.home() / "test_old.zip").exists()
True

Now we close the replacement which deletes the old file.

>>> z1.close()
>>> (Path.home() / "test_old.zip").exists()
False

Reopening the replacement, we see it contains all the objects.

>>> z2 = DataZip(file, "r")
>>> z2["series"]
0    1
1    2
2    4
Name: series, dtype: int64
>>> z1["foo"]
'bar'

And now some final test cleanup.

>>> z2.close()
>>> file.unlink()
close()[source]

Close the file, and for mode ‘w’ write attributes and metadata.

Return type:

None

get(key, default=None)[source]

Retrieve an item if it is there otherwise return default.

Parameters:

key (str)

Return type:

etoolbox.datazip._types.DZable

reset_ids()[source]

Reset the internal record of stored ids.

Because ‘two objects with non-overlapping lifetimes may have the same id() value’, it can be useful to reset the set of seen ids when you are adding objects with non-overlapping lifetimes.

See id().

Return type:

None

items()[source]

Lazily read name/key valye pairs from a DataZip.

Return type:

collections.abc.Generator[str, etoolbox.datazip._types.DZable]

keys()[source]

Set of names in DataZip as if it was a MutableMapping.

Return type:

collections.abc.KeysView

read_dfs()[source]

Read all dfs lazily.

DeprecationWarning

read_dfs will be removed in a future version, use DataZip.items().

Return type:

collections.abc.Generator[tuple[str, pandas.DataFrame | pandas.Series]]

writed(name, data)[source]

Write dict, df, str, or some other objects to name.

DeprecationWarning

writed will be removed in a future version, use self[key] = data.

Parameters:
  • name (str)

  • data (Any)