eToolBox Release Notes¶
0.3.1 (2024-XX-XX)¶
What’s New?¶
Use
pyarrow
directly inpd_read_pudl()
to avoid having dates cast to objects.Compatibility with Python 3.13 tested and included in CI.
Declaring optional cloud dependencies of
pandas
andpolars
explicitly.Tools for working with data stored on Azure.
DataZip
now recognizes alternative methods for getting and setting object state so that an object can specify a serialization forDataZip
that is different than that forpickle
. These new methods are_dzgetstate_
and_dzsetstate_
.storage_options()
to simplify reading from/writing to Azure usingpandas
orpolars
.generator_ownership()
compiles ownership information for all generators using data frompudl
.New CLI built off a single command
rmi
withcloud
andpudl
subcommands for cleaning caches and configs, showing the contents of caches, and in the cloud case, getting, putting, and listing files.DataZip
will not append.zip
suffix to file paths passed to its init as strings.Added
simplify_strings()
topudl_helpers
.
Bug Fixes¶
Fixed a bug in the implementation of the alternative serialization methods that caused recursion or other errors when serializing an object whose class implemented
__getattr__
.Attempt to fix doctest bug caused by pytest logging, see pytest#5908
0.3.0 (2024-10-07)¶
What’s New?¶
New functions to read
pudl
tables from parquets in an open-access AWS bucket usingpd_read_pudl()
,pl_read_pudl()
, andpl_scan_pudl()
which handle caching.polars
AWS client does not currently work souse_polars
must be set toFalse
.New
pudl_list()
to show a list of releases or tables within a release.Restricting
platformdirs
version to >= 3.0 when config location changed.Removed:
read_pudl_table()
get_pudl_tables_as_dz()
make_pudl_tabl()
lazy_import()
Created
etoolbox.utils.logging_utils
with helpers to setup and format loggers in a more performant and structured way based on mCoding suggestion. Also replaced module-level loggers with library-wide logger and removed logger configuration frometoolbox
because it is a library. This requires Python>=3.12.Minor performance improvements to
DataZip.keys()
andDataZip.__len__()
.Fixed links to docs for
polars
,plotly
,platformdirs
,fsspec
, andpudl
. At least in theory.Optimization in
DataZip.__getitem__()
for reading a single value from a nested structure without decoding all enclosing objects, we useisinstance()
anddict.get()
rather than try/except to handle non-dict objects and missing keys.New CLI utility
pudl-table-rename
that renames PUDL tables in a set of files to the new names used by PUDL.Allow older versions of
polars
, this is a convenience for some other projects that have not adapted to >=1.0 changes but we do not test against older versions.
Bug Fixes¶
Fixed a bug where
etoolbox
could not be used iftqdm
was not installed. As it is an optional dependency,_optional
should be able to fully address that issue.Fixed a bug where import of
typing.override()
inetoolbox.utils.logging_utils
broke compatibility with Python 3.11 since the function was added in 3.12.
0.2.0 (2024-02-28)¶
Complete redesign of system internals and standardization of the data format. This resulted in a couple key improvements:
Performance Decoding is now lazy, so structures and objects are only rebuilt when they are retrieved, rather than when the file is opened. Encoding is only done once, rather than once to make sure it will work, and then again when the data is written on close. Further, the correct encoder/decoder is selected using
dict
lookups rather than chains ofisinstance()
.Data Format Rather than a convoluted system to flatten the object hierarchy, we preserve the hierarchy in the
__attributes__.json
file. We also provide encoders and decoders that allows all Python builtins as well as other types to be stored injson
. Any data that cannot be encoded tojson
is saved elsewhere and the entry in__attributes__.json
contains a pointer to where the data is actually stored. Further, rather than storing some metadata in__attributes__.json
and some elsewhere, now all metadata is stored alongside the data or pointer in__attributes__.json
.Custom Classes We no longer save custom objects as their own
DataZip
. Their location in the object hierarchy is preserved with a pointer and associated metadata. The object’s state is stored separately in a hidden key,__state__
in__attributes__.json
.References The old format stored every object as many times as it was referenced. This meant that objects could be stored multiple times and when the hierarchy was recreated, these objects would be copies. The new process for storing custom classes,
pandas.DataFrame
,pandas.Series
, andnumpy.array
usesid()
to make sure we only store data once and that these relationships are recreated when loading data from aDataZip
.API
DataZip
behaves a little like adict
. It hasDataZip.get()
,DataZip.items()
, andDataZip.keys()
which do what you would expect. It also implements dunder methods to allow membership checking usingin
,len()
, and subscripts to get and set items (i.e.obj[key] = value
) these all also behave as you would expect, except that setting an item raises aKeyError
if the key is already in use. One additional feature with lookups is that you can provide multiple keys which are looked up recursively allowing efficient access to data in nested structures.DataZip.dump()
andDataZip.load()
are static methods that allow you to directly save and load an object into aDataZip
, similar topickle.dump()
andpickle.load()
except they handle opening and closing the file as well. Finally,DataZip.replace()
is a little liketyping.NamedTuple._replace()
; it copies the contents of oneDataZip
into a new one, with select keys replaced.
Added dtype metadata for
pandas
objects as well as ability to ignore that metadata to allow use ofpyarrow
dtypes.Switching to use
ujson
rather than the standard library version for performance.Added optional support for
polars.DataFrame
,polars.LazyFrame
, andpolars.Series
inDataZip
.Added
PretendPudlTabl
when passed as theklass
argument toDataZip.load()
, it allows accessing the dfs in a zippedpudl.PudlTabl
as you would normally but avoiding thepudl
dependency.Code cleanup along with adoption of ruff and removal of bandit, flake8, isort, etc.
Added
lazy_import()
to lazily import or proxy a module, inspired bypolars.dependencies.lazy_import
.Created tools for proxying
pudl.PudlTabl
to provide access to cached PUDL data without requiring thatpudl
is installed, or at least imported. The process of either loading aPretendPudlTabl
from cache, or creating and then caching apudl.PudlTabl
is handled bymake_pudl_tabl()
.Copied a number of helper functions that we often use from
pudl.helpers
topudl_helpers
so they can be used without installing or importingpudl
.Added a very light adaptation of the python-remotezip package to access files within a zip archive without downloading the full archive.
Updates to
DataZip
encoding and decoding ofpandas.DataFrame
so they work withpandas
version 2.0.0.Updates to
make_pudl_tabl()
and associated functions and classes so that it works with new and changing aspects ofpudl.PudlTabl
, specifically those raised in catalyst#2503. Added testing for fullmake_pudl_tabl()
functionality.Added to
get_pudl_table()
which reads a table from apudl.sqlite
that is stored where it is expected.Added support for
polars.DataFrame
,polars.LazyFrame
, andpolars.Series
toetoolbox.utils.testing.assert_equal()
.plotly.Figure
are now stored as pickles so they can be recreated.Updates to
get_pudl_sql_url()
so that it doesn’t require PUDL environment variables or config files if the sqlite is atpudl-work/output/pudl.sqlite
, and tells the user to put the sqlite there if the it cannot be found another way.New
conform_pudl_dtypes()
function that casts PUDL columns to the dtypes used inPudlTabl
, useful when loading tables from a sqlite that doesn’t preserve all dtype info.Added
ungzip()
to help with un-gzippingpudl.sqlite.gz
and now using the gzipped version in tests.Switching two cases of
with suppress...
totry - except - pass
inDataZip
to take advantage of zero-cost exceptions.Deprecations these will be removed in the next release along with supporting infrastructure:
lazy_import()
and the rest of thelazy_import
module.PUDL_DTYPES
, useconform_pudl_dtypes()
instead.make_pudl_tabl()
,PretendPudlTablCore
,PretendPudlTablCore
; read tables directly from the sqlite:import pandas as pd import sqlalchemy as sa from etoolbox.utils.pudl import get_pudl_sql_url, conform_pudl_dtypes pd.read_sql_table(table_name, sa.create_engine(get_pudl_sql_url())).pipe( conform_pudl_dtypes )
import polars as pl from etoolbox.utils.pudl import get_pudl_sql_url pl.read_database("SELECT * FROM table_name", get_pudl_sql_url())
Bug Fixes¶
Allow
typing.NamedTuple
to be used as keys in adict
, and acollections.defaultdict
.Fixed a bug in
make_pudl_tabl()
where creating and caching a newpudl.PudlTabl
would fail to load the PUDL package.Fixed a bug where attempting to retrieve an empty
pandas.DataFrame
raised anIndexError
whenignore_pd_dtypes
isFalse
.Updated the link for the PUDL database.
Known Issues¶
Some legacy
DataZip
files cannot be fully read, especially those with nested structures and custom classes.DataZip
ignoresfunctools.partial()
objects, at least in most dicts.
0.1.0 (2023-02-27)¶
What’s New?¶
Migrating
DataZip
from rmi.dispatch where it didn’t really belong. Also added additional functionality including recursive writing and reading oflist
,dict
, andtuple
objects.Created
IOMixin
andIOWrapper
to make it easier to addDataZip
to other classes.Migrating
compare_dfs()
from the Hub.Updates to
DataZip
,IOMixin
, andIOWrapper
to better better manage attributes missing from original object or file representation of object. Including ability to use differently organized versions ofDataZip
.Clean up of
DataZip
internals, both within the object and in laying out files. Particularly how metadata and attributes are stored. AddedDataZip.readm()
andDataZip.writem()
to read and write additional metadata not core toDataZip
.Added support for storing
numpy.array
objects inDataZip
usingnumpy.load()
andnumpy.save()
.DataZip
now handles writing attributes and metadata usingDataZip.close()
soDataZip
can now be used with or without a context manager.Added
isclose()
, similar tonumpy.isclose()
but allowing comparison of arrays containing strings, especially useful withpandas.Series
.Added a module
etoolbox.utils.match
containing the helpers Raymond Hettinger demonstrated in his talk at PyCon Italia for using Python’scase
/match
syntax.Added support for Python 3.11.
Added support for storing
plotly
figures aspdf
inDataZip
.DataZip.close()
soDataZip
can now be used with or without a context manager.Added support for checking whether a file or attribute is stored in
DataZip
usingDataZip.__contains__()
, i.e. using Python’sin
.Added support for subscript-based, getting and setting data in
DataZip
.Custom Python objects can be serialized with
DataZip
if they implement__getstate__
and__setstate__
, or can be serialized using the default logic described inobject.__getstate__()
. That default logic is now implemented inDataZip.default_getstate()
andDataZip.default_setstate()
. This replaces the use ofto_file
andfrom_file
byDataZip
.IOMixin
has been updated accordingly.Added static methods
DataZip.dump()
andDataZip.load()
for serializing a single Python object, these are designed to be similar to howpickle.dump()
andpickle.load()
work.Removing
IOWrapper
.Added a
DataZip.replace()
that copies the contents of an oldDataZip
into a new copy of it after which you can add to it.Extended JSON encoding / decoding to process an expanded set of builtins, standard library, and other common objects including
tuple
,set
,frozenset
,complex
,typing.NamedTuple
,datetime.datetime
,pathlib.Path
, andpandas.Timestamp
.Adding centralized testing helpers.
Added a subclass of
PudlTabl
that adds back__getstate__
and__setstate__
to enable caching, this caching will not work for tables that are not stored in the object which will be an increasing portion of tables as discussed here.
Bug Fixes¶
Fixed an issue where a single column
pandas.DataFrame
was recreated as apandas.Series
. Now this should be backwards compatible by applyingpandas.DataFrame.squeeze
if object metadata is not available.Fixed a bug that prevented certain kinds of objects from working properly under 3.11.
Fixed an issue where the name for a
pandas.Series
might get mangled or changed.