etoolbox.datazip.core¶
Code for DataZip.
Classes¶
A |
Module Contents¶
- class etoolbox.datazip.core.DataZip(file, mode='r', ignore_pd_dtypes=False, *args, ids_for_dedup=True, **kwargs)[source]¶
Bases:
zipfile.ZipFileA
ZipFilewith methods for easier use with Python objects.Create a DataZip.
- Parameters:
file (str | os.PathLike | io.BytesIO) – Either the path to the file, or a file-like object. If it is a path, the file will be opened and closed by DataZip.
mode (Literal['r', 'w']) – The mode can be either read ‘r’, or write ‘w’.
recipes – Deprecated.
compression – ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib), ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).
ignore_pd_dtypes – if True, any dtypes stored in a DataZip for
pandas.DataFramecolumns orpandas.Serieswill be ignored. This may be useful when using global settings formode.dtype_backendormode.use_nullable_dtypesto force the use ofpyarrowtypes.args – additional positional will be passed to
zipfile.ZipFile.__init__().ids_for_dedup – If True, multiple references to the same object will not cause the object to be stored multiple times. If False, the object will be stored as many times as it has references. True can save space but because ids are not unique for objects with non-overlapping lifetimes, setting to True can result in subsequent new objects NOT being stored because they share an id with an earlier object.
kwargs – keyword arguments will be passed to
zipfile.ZipFile.__init__().
Examples
First we can create a
DataZip. In this case we are using a buffer (io.BytesIO) for convenience. In most cases though,filewould be apathlib.Pathorstrthat represents a file. In these cases a.zipextension will be added if it is not there already.>>> buffer = BytesIO() # can also be a file-like object >>> with DataZip(file=buffer, mode="w") as z0: ... z0["df"] = pd.DataFrame({(0, "a"): [2.4, 8.9], (0, "b"): [3.5, 6.2]}) ... z0["foo"] = { ... "a": (1, (2, {3})), ... "b": frozenset({1.5, 3}), ... "c": 0.9 + 0.2j, ... }
Getting items from
DataZip, like setting them, uses standard Python subscripting.For
pandas.DataFrame, it stores them asparquetand preservespandas.MultiIndexcolumns, even when they cannot normally be stored in aparquetfile.>>> with DataZip(buffer, "r") as z1: ... z1["df"] 0 a b 0 2.4 3.5 1 8.9 6.2
While always preferable to use a context manager as above, here it’s more convenient to keep the object open. Even more unusual types that can’t normally be stored in json should work.
>>> z1 = DataZip(buffer, "r") >>> z1["foo"] {'a': (1, (2, {3})), 'b': frozenset({1.5, 3}), 'c': (0.9+0.2j)}
Checking to see if an item is in a
DataZipuses standard Python syntax.>>> "df" in z1 True
You can also check by filename. And check the number of items.
>>> "df.parquet" in z1 True
>>> len(z1) 2
When not used with a context manager,
DataZipshould close itself automatically but it’s not a bad idea to make sure.>>> z1.close()
A
DataZipis a write-once, read-many affair because of the wayzipfiles work. Appending to aDataZipcan be done with theDataZip.replace()method.>>> buffer1 = BytesIO() >>> with DataZip.replace(buffer1, buffer, foo=5, bar=6) as z: ... z["new"] = "foo" ... z["foo"] 5
- static dump(obj, file, **kwargs)[source]¶
Write the DataZip representation of
objtofile.- Parameters:
obj (Any) – A Python object, it must implement
__getstate__and__setstate__. There are other restrictions, especially if it contains instances of other custom Python objects, it may be enough for all of them to implement__getstate__and__setstate__.file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer where the
DataZipwill be saved.kwargs – keyword arguments will be passed to
DataZip.
- Returns:
None
- Return type:
None
Examples
Create an object that you would like to save as a
DataZip.>>> from etoolbox.datazip._test_classes import _TestKlass >>> obj = _TestKlass(a=5, b={"c": [2, 3.5]}) >>> obj _TestKlass(a=5, b={'c': [2, 3.5]})
Save the object as a
DataZip.>>> buffer = BytesIO() >>> DataZip.dump(obj, buffer) >>> del obj
Get it back.
>>> obj = DataZip.load(buffer) >>> obj _TestKlass(a=5, b={'c': [2, 3.5]})
- static load(file, klass=None)[source]¶
Return the reconstituted object specified in the file.
- Parameters:
file (pathlib.Path | str | io.BytesIO) – a file-like object, or a buffer from which the
DataZipwill be read.klass (type | None) – (Optional) allows passing the class when it is known, this is handy when it is not possible to import the module that defines the class that
filerepresents.
- Returns:
Object from
DataZip.- Return type:
Any
Examples
See
DataZip.dump()for examples.
- classmethod replace(file_or_new_buffer, old_buffer=None, save_old=False, iterwrap=None, **kwargs)[source]¶
Replace an old
DataZipwith an editable new one.Note: Data and keys that are copied over by this function cannot be reliably mutated.``kwargs`` must be used to replace the data associated with keys that exist in the old
DataZip.- Parameters:
file_or_new_buffer – Either the path to the file to be replaced or the new buffer.
old_buffer – only required if
file_or_new_bufferis a buffer.save_old – if True, the old
DataZipwill be saved with “_old” appended, if False it will be deleted when the newDataZipis closed.iterwrap – this will be used to wrap the iterator that handles copying data to the new
DataZipto enable a progress bar, i.e.tqdm.kwargs – data that should be written into the new
DataZip, for any keys that were in the oldDataZip, the new value provided here will be used instead.
- Returns:
New editable
DataZipwith old data copied into it.
Examples
Create a new test file object and put a datazip in it.
>>> file = Path.home() / "test.zip" >>> with DataZip(file=file, mode="w") as z0: ... z0["series"] = pd.Series([1, 2, 4], name="series")
Create a replacement DataZip.
>>> z1 = DataZip.replace(file, save_old=False)
The replacement has the old content.
>>> z1["series"] 0 1 1 2 2 4 Name: series, dtype: int64
We can also now add to it.
>>> z1["foo"] = "bar"
While the replacement is open, the old verion still exists.
>>> (Path.home() / "test_old.zip").exists() True
Now we close the replacement which deletes the old file.
>>> z1.close() >>> (Path.home() / "test_old.zip").exists() False
Reopening the replacement, we see it contains all the objects.
>>> z2 = DataZip(file, "r")
>>> z2["series"] 0 1 1 2 2 4 Name: series, dtype: int64
>>> z1["foo"] 'bar'
And now some final test cleanup.
>>> z2.close() >>> file.unlink()
- get(key, default=None)[source]¶
Retrieve an item if it is there otherwise return default.
- Parameters:
key (str)
- Return type:
etoolbox.datazip._types.DZable
- reset_ids()[source]¶
Reset the internal record of stored ids.
Because ‘two objects with non-overlapping lifetimes may have the same
id()value’, it can be useful to reset the set of seen ids when you are adding objects with non-overlapping lifetimes.See
id().- Return type:
None
- items()[source]¶
Lazily read name/key valye pairs from a
DataZip.- Return type:
collections.abc.Generator[str, etoolbox.datazip._types.DZable]
- read_dfs()[source]¶
Read all dfs lazily.
DeprecationWarning
read_dfswill be removed in a future version, useDataZip.items().- Return type:
collections.abc.Generator[tuple[str, pandas.DataFrame | pandas.Series]]