eToolBox Release Notes¶
0.4.1 (2025-XX-XX)¶
What’s New?¶
Add colocation results example to ref:eToolBox and R <etb-r-label> and simplify structure.
etb cloud initwalks you through setup if no arguments are provided.Azure account name is is set / stored rather than hard coded.
Update and cleanup readme.
read_patio_file()is nowread_cloud_file()and only takes a filename which can represent any file in any of the account’s containers. It also supports reading all filetypes thatwrite_cloud_file()does.write_patio_econ_results()is nowwrite_cloud_file()and only takes a filename which can represent any file in any of the account’s containers.Remove
remote_zipas we never used it or actively maintained it. If that functionality is needed, use the original package python-remotezip.
Bug Fixes¶
Fixed a bug in
read_patio_file()where the fallback process for missing specified file extensions was incorrect for csv and parquet files.
0.4.0 (2025-05-08)¶
What’s New?¶
Use
pyarrowdirectly inpd_read_pudl()to avoid having dates cast to objects.Compatibility with Python 3.13 tested and included in CI.
Declaring optional cloud dependencies of
pandasandpolarsexplicitly.Tools for working with data stored on Azure.
DataZipnow recognizes alternative methods for getting and setting object state so that an object can specify a serialization forDataZipthat is different than that forpickle. These new methods are_dzgetstate_and_dzsetstate_.storage_options()to simplify reading from/writing to Azure usingpandasorpolars.generator_ownership()compiles ownership information for all generators using data frompudl.New CLI built off a single command
rmioretbwithcloudandpudlsubcommands for cleaning caches and configs, showing the contents of caches, and in the cloud case, getting, putting, and listing files.DataZipwill not append.zipsuffix to file paths passed to its init as strings.Added
simplify_strings()topudl_helpers.Subclass of
logging.Formatter,SafeFormatterthat can fill extra values with defaults when they are not provided in the logging call. See here for more info on the extra kwarg in logging calls.Option to disable use of ids in
DataZipto keep track of multiple references to the same object usingids_for_dedupkwarg.Instructions and additional helper functions to support using eToolBox from R, specifically
read_patio_resource_results(),read_patio_file(), andwrite_patio_econ_results(), see eToolBox and R for details.Use azcopy under the hood in
get()andput()which is faster and more easily allows keeping directories in sync by only transferring the differences.pl_scan_pudl()now works withuse_polars=Truewhich avoids usingfsspecin favor ofpolarsfaster implementation that can avoiding downloading whole parquets when using predicate pushdown. Unfortunately this means there is no local caching.write_patio_econ_results()now works withstrandbytesfor writing.json,.csv,.txt, &c.Added
etb pudl listcommand to the CLI for seeing pudl releases and data in releases, as well asetb pudl getto download a table and save it as a csv.Improved CLI using
clickand new CLI documentation.Remove
get_pudl_sql_url()andPretendPudlTabl.Migrate
toxand GitHub Action tooling touv.
Bug Fixes¶
Fixed a bug in the implementation of the alternative serialization methods that caused recursion or other errors when serializing an object whose class implemented
__getattr__.Attempt to fix doctest bug caused by pytest logging, see pytest#5908
Fixed a bug that meant only zips created with
DataZip.dump()could be opened withDataZip.load().Fixed a bug where certain
pandas.DataFramecolumns of dtypeobject, specifically columns withboolandNonebecame lists rather than DataFrame columns when theread_patio_resource_results()is called from R.
0.3.0 (2024-10-07)¶
What’s New?¶
New functions to read
pudltables from parquets in an open-access AWS bucket usingpd_read_pudl(),pl_read_pudl(), andpl_scan_pudl()which handle caching.polarsAWS client does not currently work souse_polarsmust be set toFalse.New
pudl_list()to show a list of releases or tables within a release.Restricting
platformdirsversion to >= 3.0 when config location changed.Removed:
read_pudl_table()get_pudl_tables_as_dz()make_pudl_tabl()lazy_import()
Created
etoolbox.utils.logging_utilswith helpers to setup and format loggers in a more performant and structured way based on mCoding suggestion. Also replaced module-level loggers with library-wide logger and removed logger configuration frometoolboxbecause it is a library. This requires Python>=3.12.Minor performance improvements to
DataZip.keys()andDataZip.__len__().Fixed links to docs for
polars,plotly,platformdirs,fsspec, andpudl. At least in theory.Optimization in
DataZip.__getitem__()for reading a single value from a nested structure without decoding all enclosing objects, we useisinstance()anddict.get()rather than try/except to handle non-dict objects and missing keys.New CLI utility
pudl-table-renamethat renames PUDL tables in a set of files to the new names used by PUDL.Allow older versions of
polars, this is a convenience for some other projects that have not adapted to >=1.0 changes but we do not test against older versions.
Bug Fixes¶
Fixed a bug where
etoolboxcould not be used iftqdmwas not installed. As it is an optional dependency,_optionalshould be able to fully address that issue.Fixed a bug where import of
typing.override()inetoolbox.utils.logging_utilsbroke compatibility with Python 3.11 since the function was added in 3.12.
0.2.0 (2024-02-28)¶
Complete redesign of system internals and standardization of the data format. This resulted in a couple key improvements:
Performance Decoding is now lazy, so structures and objects are only rebuilt when they are retrieved, rather than when the file is opened. Encoding is only done once, rather than once to make sure it will work, and then again when the data is written on close. Further, the correct encoder/decoder is selected using
dictlookups rather than chains ofisinstance().Data Format Rather than a convoluted system to flatten the object hierarchy, we preserve the hierarchy in the
__attributes__.jsonfile. We also provide encoders and decoders that allows all Python builtins as well as other types to be stored injson. Any data that cannot be encoded tojsonis saved elsewhere and the entry in__attributes__.jsoncontains a pointer to where the data is actually stored. Further, rather than storing some metadata in__attributes__.jsonand some elsewhere, now all metadata is stored alongside the data or pointer in__attributes__.json.Custom Classes We no longer save custom objects as their own
DataZip. Their location in the object hierarchy is preserved with a pointer and associated metadata. The object’s state is stored separately in a hidden key,__state__in__attributes__.json.References The old format stored every object as many times as it was referenced. This meant that objects could be stored multiple times and when the hierarchy was recreated, these objects would be copies. The new process for storing custom classes,
pandas.DataFrame,pandas.Series, andnumpy.arrayusesid()to make sure we only store data once and that these relationships are recreated when loading data from aDataZip.API
DataZipbehaves a little like adict. It hasDataZip.get(),DataZip.items(), andDataZip.keys()which do what you would expect. It also implements dunder methods to allow membership checking usingin,len(), and subscripts to get and set items (i.e.obj[key] = value) these all also behave as you would expect, except that setting an item raises aKeyErrorif the key is already in use. One additional feature with lookups is that you can provide multiple keys which are looked up recursively allowing efficient access to data in nested structures.DataZip.dump()andDataZip.load()are static methods that allow you to directly save and load an object into aDataZip, similar topickle.dump()andpickle.load()except they handle opening and closing the file as well. Finally,DataZip.replace()is a little liketyping.NamedTuple._replace(); it copies the contents of oneDataZipinto a new one, with select keys replaced.
Added dtype metadata for
pandasobjects as well as ability to ignore that metadata to allow use ofpyarrowdtypes.Switching to use
ujsonrather than the standard library version for performance.Added optional support for
polars.DataFrame,polars.LazyFrame, andpolars.SeriesinDataZip.Added
PretendPudlTablwhen passed as theklassargument toDataZip.load(), it allows accessing the dfs in a zippedpudl.PudlTablas you would normally but avoiding thepudldependency.Code cleanup along with adoption of ruff and removal of bandit, flake8, isort, etc.
Added
lazy_import()to lazily import or proxy a module, inspired bypolars.dependencies.lazy_import.Created tools for proxying
pudl.PudlTablto provide access to cached PUDL data without requiring thatpudlis installed, or at least imported. The process of either loading aPretendPudlTablfrom cache, or creating and then caching apudl.PudlTablis handled bymake_pudl_tabl().Copied a number of helper functions that we often use from
pudl.helperstopudl_helpersso they can be used without installing or importingpudl.Added a very light adaptation of the python-remotezip package to access files within a zip archive without downloading the full archive.
Updates to
DataZipencoding and decoding ofpandas.DataFrameso they work withpandasversion 2.0.0.Updates to
make_pudl_tabl()and associated functions and classes so that it works with new and changing aspects ofpudl.PudlTabl, specifically those raised in catalyst#2503. Added testing for fullmake_pudl_tabl()functionality.Added to
get_pudl_table()which reads a table from apudl.sqlitethat is stored where it is expected.Added support for
polars.DataFrame,polars.LazyFrame, andpolars.Seriestoetoolbox.utils.testing.assert_equal().plotly.Figureare now stored as pickles so they can be recreated.Updates to
get_pudl_sql_url()so that it doesn’t require PUDL environment variables or config files if the sqlite is atpudl-work/output/pudl.sqlite, and tells the user to put the sqlite there if the it cannot be found another way.New
conform_pudl_dtypes()function that casts PUDL columns to the dtypes used inPudlTabl, useful when loading tables from a sqlite that doesn’t preserve all dtype info.Added
ungzip()to help with un-gzippingpudl.sqlite.gzand now using the gzipped version in tests.Switching two cases of
with suppress...totry - except - passinDataZipto take advantage of zero-cost exceptions.Deprecations these will be removed in the next release along with supporting infrastructure:
lazy_import()and the rest of thelazy_importmodule.PUDL_DTYPES, useconform_pudl_dtypes()instead.make_pudl_tabl(),PretendPudlTablCore,PretendPudlTablCore; read tables directly from the sqlite:import pandas as pd import sqlalchemy as sa from etoolbox.utils.pudl import get_pudl_sql_url, conform_pudl_dtypes pd.read_sql_table(table_name, sa.create_engine(get_pudl_sql_url())).pipe( conform_pudl_dtypes )
import polars as pl from etoolbox.utils.pudl import get_pudl_sql_url pl.read_database("SELECT * FROM table_name", get_pudl_sql_url())
Bug Fixes¶
Allow
typing.NamedTupleto be used as keys in adict, and acollections.defaultdict.Fixed a bug in
make_pudl_tabl()where creating and caching a newpudl.PudlTablwould fail to load the PUDL package.Fixed a bug where attempting to retrieve an empty
pandas.DataFrameraised anIndexErrorwhenignore_pd_dtypesisFalse.Updated the link for the PUDL database.
Known Issues¶
Some legacy
DataZipfiles cannot be fully read, especially those with nested structures and custom classes.DataZipignoresfunctools.partial()objects, at least in most dicts.
0.1.0 (2023-02-27)¶
What’s New?¶
Migrating
DataZipfrom rmi.dispatch where it didn’t really belong. Also added additional functionality including recursive writing and reading oflist,dict, andtupleobjects.Created
IOMixinandIOWrapperto make it easier to addDataZipto other classes.Migrating
compare_dfs()from the Hub.Updates to
DataZip,IOMixin, andIOWrapperto better better manage attributes missing from original object or file representation of object. Including ability to use differently organized versions ofDataZip.Clean up of
DataZipinternals, both within the object and in laying out files. Particularly how metadata and attributes are stored. AddedDataZip.readm()andDataZip.writem()to read and write additional metadata not core toDataZip.Added support for storing
numpy.arrayobjects inDataZipusingnumpy.load()andnumpy.save().DataZipnow handles writing attributes and metadata usingDataZip.close()soDataZipcan now be used with or without a context manager.Added
isclose(), similar tonumpy.isclose()but allowing comparison of arrays containing strings, especially useful withpandas.Series.Added a module
etoolbox.utils.matchcontaining the helpers Raymond Hettinger demonstrated in his talk at PyCon Italia for using Python’scase/matchsyntax.Added support for Python 3.11.
Added support for storing
plotlyfigures aspdfinDataZip.DataZip.close()soDataZipcan now be used with or without a context manager.Added support for checking whether a file or attribute is stored in
DataZipusingDataZip.__contains__(), i.e. using Python’sin.Added support for subscript-based, getting and setting data in
DataZip.Custom Python objects can be serialized with
DataZipif they implement__getstate__and__setstate__, or can be serialized using the default logic described inobject.__getstate__(). That default logic is now implemented inDataZip.default_getstate()andDataZip.default_setstate(). This replaces the use ofto_fileandfrom_filebyDataZip.IOMixinhas been updated accordingly.Added static methods
DataZip.dump()andDataZip.load()for serializing a single Python object, these are designed to be similar to howpickle.dump()andpickle.load()work.Removing
IOWrapper.Added a
DataZip.replace()that copies the contents of an oldDataZipinto a new copy of it after which you can add to it.Extended JSON encoding / decoding to process an expanded set of builtins, standard library, and other common objects including
tuple,set,frozenset,complex,typing.NamedTuple,datetime.datetime,pathlib.Path, andpandas.Timestamp.Adding centralized testing helpers.
Added a subclass of
PudlTablthat adds back__getstate__and__setstate__to enable caching, this caching will not work for tables that are not stored in the object which will be an increasing portion of tables as discussed here.
Bug Fixes¶
Fixed an issue where a single column
pandas.DataFramewas recreated as apandas.Series. Now this should be backwards compatible by applyingpandas.DataFrame.squeezeif object metadata is not available.Fixed a bug that prevented certain kinds of objects from working properly under 3.11.
Fixed an issue where the name for a
pandas.Seriesmight get mangled or changed.