etoolbox.datazip.core¶
Code for DataZip
.
Classes¶
A |
Module Contents¶
- class etoolbox.datazip.core.DataZip(file, mode='r', ignore_pd_dtypes=False, *args, ids_for_dedup=True, **kwargs)[source]¶
Bases:
zipfile.ZipFile
A
ZipFile
with methods for easier use with Python objects.Create a DataZip.
- Parameters:
file (str | os.PathLike | io.BytesIO) – Either the path to the file, or a file-like object. If it is a path, the file will be opened and closed by DataZip.
mode (Literal['r', 'w']) – The mode can be either read ‘r’, or write ‘w’.
recipes – Deprecated.
compression – ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib), ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).
ignore_pd_dtypes – if True, any dtypes stored in a DataZip for
pandas.DataFrame
columns orpandas.Series
will be ignored. This may be useful when using global settings formode.dtype_backend
ormode.use_nullable_dtypes
to force the use ofpyarrow
types.args – additional positional will be passed to
zipfile.ZipFile.__init__()
.ids_for_dedup – If True, multiple references to the same object will not cause the object to be stored multiple times. If False, the object will be stored as many times as it has references. True can save space but because ids are not unique for objects with non-overlapping lifetimes, setting to True can result in subsequent new objects NOT being stored because they share an id with an earlier object.
kwargs – keyword arguments will be passed to
zipfile.ZipFile.__init__()
.
Examples
First we can create a
DataZip
. In this case we are using a buffer (io.BytesIO
) for convenience. In most cases though,file
would be apathlib.Path
orstr
that represents a file. In these cases a.zip
extension will be added if it is not there already.>>> buffer = BytesIO() # can also be a file-like object >>> with DataZip(file=buffer, mode="w") as z0: ... z0["df"] = pd.DataFrame({(0, "a"): [2.4, 8.9], (0, "b"): [3.5, 6.2]}) ... z0["foo"] = { ... "a": (1, (2, {3})), ... "b": frozenset({1.5, 3}), ... "c": 0.9 + 0.2j, ... }
Getting items from
DataZip
, like setting them, uses standard Python subscripting.For
pandas.DataFrame
, it stores them asparquet
and preservespandas.MultiIndex
columns, even when they cannot normally be stored in aparquet
file.>>> with DataZip(buffer, "r") as z1: ... z1["df"] 0 a b 0 2.4 3.5 1 8.9 6.2
While always preferable to use a context manager as above, here it’s more convenient to keep the object open. Even more unusual types that can’t normally be stored in json should work.
>>> z1 = DataZip(buffer, "r") >>> z1["foo"] {'a': (1, (2, {3})), 'b': frozenset({1.5, 3}), 'c': (0.9+0.2j)}
Checking to see if an item is in a
DataZip
uses standard Python syntax.>>> "df" in z1 True
You can also check by filename. And check the number of items.
>>> "df.parquet" in z1 True
>>> len(z1) 2
When not used with a context manager,
DataZip
should close itself automatically but it’s not a bad idea to make sure.>>> z1.close()
A
DataZip
is a write-once, read-many affair because of the wayzip
files work. Appending to aDataZip
can be done with theDataZip.replace()
method.>>> buffer1 = BytesIO() >>> with DataZip.replace(buffer1, buffer, foo=5, bar=6) as z: ... z["new"] = "foo" ... z["foo"] 5
- static dump(obj, file, **kwargs)[source]¶
Write the DataZip representation of
obj
tofile
.- Parameters:
obj (Any) – A Python object, it must implement
__getstate__
and__setstate__
. There are other restrictions, especially if it contains instances of other custom Python objects, it may be enough for all of them to implement__getstate__
and__setstate__
.file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer where the
DataZip
will be saved.kwargs – keyword arguments will be passed to
DataZip
.
- Returns:
None
- Return type:
None
Examples
Create an object that you would like to save as a
DataZip
.>>> from etoolbox.datazip._test_classes import _TestKlass >>> obj = _TestKlass(a=5, b={"c": [2, 3.5]}) >>> obj _TestKlass(a=5, b={'c': [2, 3.5]})
Save the object as a
DataZip
.>>> buffer = BytesIO() >>> DataZip.dump(obj, buffer) >>> del obj
Get it back.
>>> obj = DataZip.load(buffer) >>> obj _TestKlass(a=5, b={'c': [2, 3.5]})
- static load(file, klass=None)[source]¶
Return the reconstituted object specified in the file.
- Parameters:
file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer from which the
DataZip
will be read.klass (type | None) – (Optional) allows passing the class when it is known, this is handy when it is not possible to import the module that defines the class that
file
represents.
- Returns:
Object from
DataZip
.- Return type:
Any
Examples
See
DataZip.dump()
for examples.
- classmethod replace(file_or_new_buffer, old_buffer=None, save_old=False, iterwrap=None, **kwargs)[source]¶
Replace an old
DataZip
with an editable new one.Note: Data and keys that are copied over by this function cannot be reliably mutated.``kwargs`` must be used to replace the data associated with keys that exist in the old
DataZip
.- Parameters:
file_or_new_buffer – Either the path to the file to be replaced or the new buffer.
old_buffer – only required if
file_or_new_buffer
is a buffer.save_old – if True, the old
DataZip
will be saved with “_old” appended, if False it will be deleted when the newDataZip
is closed.iterwrap – this will be used to wrap the iterator that handles copying data to the new
DataZip
to enable a progress bar, i.e.tqdm
.kwargs – data that should be written into the new
DataZip
, for any keys that were in the oldDataZip
, the new value provided here will be used instead.
- Returns:
New editable
DataZip
with old data copied into it.
Examples
Create a new test file object and put a datazip in it.
>>> file = Path.home() / "test.zip" >>> with DataZip(file=file, mode="w") as z0: ... z0["series"] = pd.Series([1, 2, 4], name="series")
Create a replacement DataZip.
>>> z1 = DataZip.replace(file, save_old=False)
The replacement has the old content.
>>> z1["series"] 0 1 1 2 2 4 Name: series, dtype: int64
We can also now add to it.
>>> z1["foo"] = "bar"
While the replacement is open, the old verion still exists.
>>> (Path.home() / "test_old.zip").exists() True
Now we close the replacement which deletes the old file.
>>> z1.close() >>> (Path.home() / "test_old.zip").exists() False
Reopening the replacement, we see it contains all the objects.
>>> z2 = DataZip(file, "r")
>>> z2["series"] 0 1 1 2 2 4 Name: series, dtype: int64
>>> z1["foo"] 'bar'
And now some final test cleanup.
>>> z2.close() >>> file.unlink()
- get(key, default=None)[source]¶
Retrieve an item if it is there otherwise return default.
- Parameters:
key (str)
- Return type:
etoolbox.datazip._types.DZable
- reset_ids()[source]¶
Reset the internal record of stored ids.
Because ‘two objects with non-overlapping lifetimes may have the same
id()
value’, it can be useful to reset the set of seen ids when you are adding objects with non-overlapping lifetimes.See
id()
.- Return type:
None
- items()[source]¶
Lazily read name/key valye pairs from a
DataZip
.- Return type:
collections.abc.Generator[str, etoolbox.datazip._types.DZable]
- read_dfs()[source]¶
Read all dfs lazily.
DeprecationWarning
read_dfs
will be removed in a future version, useDataZip.items()
.- Return type:
collections.abc.Generator[tuple[str, pandas.DataFrame | pandas.Series]]