eToolBox Release Notes

0.3.1 (2024-XX-XX)

What’s New?

  • Use pyarrow directly in pd_read_pudl() to avoid having dates cast to objects.

  • Compatibility with Python 3.13 tested and included in CI.

  • Declaring optional cloud dependencies of pandas and polars explicitly.

  • Tools for working with data stored on Azure.

  • DataZip now recognizes alternative methods for getting and setting object state so that an object can specify a serialization for DataZip that is different than that for pickle. These new methods are _dzgetstate_ and _dzsetstate_.

  • storage_options() to simplify reading from/writing to Azure using pandas or polars.

  • generator_ownership() compiles ownership information for all generators using data from pudl.

  • New CLI built off a single command rmi with cloud and pudl subcommands for cleaning caches and configs, showing the contents of caches, and in the cloud case, getting, putting, and listing files.

  • DataZip will not append .zip suffix to file paths passed to its init as strings.

  • Added simplify_strings() to pudl_helpers.

Bug Fixes

  • Fixed a bug in the implementation of the alternative serialization methods that caused recursion or other errors when serializing an object whose class implemented __getattr__.

  • Attempt to fix doctest bug caused by pytest logging, see pytest#5908

0.3.0 (2024-10-07)

What’s New?

  • New functions to read pudl tables from parquets in an open-access AWS bucket using pd_read_pudl(), pl_read_pudl(), and pl_scan_pudl() which handle caching. polars AWS client does not currently work so use_polars must be set to False.

  • New pudl_list() to show a list of releases or tables within a release.

  • Restricting platformdirs version to >= 3.0 when config location changed.

  • Removed:

    • read_pudl_table()

    • get_pudl_tables_as_dz()

    • make_pudl_tabl()

    • lazy_import()

  • Created etoolbox.utils.logging_utils with helpers to setup and format loggers in a more performant and structured way based on mCoding suggestion. Also replaced module-level loggers with library-wide logger and removed logger configuration from etoolbox because it is a library. This requires Python>=3.12.

  • Minor performance improvements to DataZip.keys() and DataZip.__len__().

  • Fixed links to docs for polars, plotly, platformdirs, fsspec, and pudl. At least in theory.

  • Work toward benchmarks for DataZip vs pickle.

  • Optimization in DataZip.__getitem__() for reading a single value from a nested structure without decoding all enclosing objects, we use isinstance() and dict.get() rather than try/except to handle non-dict objects and missing keys.

  • New CLI utility pudl-table-rename that renames PUDL tables in a set of files to the new names used by PUDL.

  • Allow older versions of polars, this is a convenience for some other projects that have not adapted to >=1.0 changes but we do not test against older versions.

Bug Fixes

  • Fixed a bug where etoolbox could not be used if tqdm was not installed. As it is an optional dependency, _optional should be able to fully address that issue.

  • Fixed a bug where import of typing.override() in etoolbox.utils.logging_utils broke compatibility with Python 3.11 since the function was added in 3.12.

0.2.0 (2024-02-28)

  • Complete redesign of system internals and standardization of the data format. This resulted in a couple key improvements:

    • Performance Decoding is now lazy, so structures and objects are only rebuilt when they are retrieved, rather than when the file is opened. Encoding is only done once, rather than once to make sure it will work, and then again when the data is written on close. Further, the correct encoder/decoder is selected using dict lookups rather than chains of isinstance().

    • Data Format Rather than a convoluted system to flatten the object hierarchy, we preserve the hierarchy in the __attributes__.json file. We also provide encoders and decoders that allows all Python builtins as well as other types to be stored in json. Any data that cannot be encoded to json is saved elsewhere and the entry in __attributes__.json contains a pointer to where the data is actually stored. Further, rather than storing some metadata in __attributes__.json and some elsewhere, now all metadata is stored alongside the data or pointer in __attributes__.json.

    • Custom Classes We no longer save custom objects as their own DataZip. Their location in the object hierarchy is preserved with a pointer and associated metadata. The object’s state is stored separately in a hidden key, __state__ in __attributes__.json.

    • References The old format stored every object as many times as it was referenced. This meant that objects could be stored multiple times and when the hierarchy was recreated, these objects would be copies. The new process for storing custom classes, pandas.DataFrame, pandas.Series, and numpy.array uses id() to make sure we only store data once and that these relationships are recreated when loading data from a DataZip.

    • API DataZip behaves a little like a dict. It has DataZip.get(), DataZip.items(), and DataZip.keys() which do what you would expect. It also implements dunder methods to allow membership checking using in, len(), and subscripts to get and set items (i.e. obj[key] = value) these all also behave as you would expect, except that setting an item raises a KeyError if the key is already in use. One additional feature with lookups is that you can provide multiple keys which are looked up recursively allowing efficient access to data in nested structures. DataZip.dump() and DataZip.load() are static methods that allow you to directly save and load an object into a DataZip, similar to pickle.dump() and pickle.load() except they handle opening and closing the file as well. Finally, DataZip.replace() is a little like typing.NamedTuple._replace(); it copies the contents of one DataZip into a new one, with select keys replaced.

  • Added dtype metadata for pandas objects as well as ability to ignore that metadata to allow use of pyarrow dtypes.

  • Switching to use ujson rather than the standard library version for performance.

  • Added optional support for polars.DataFrame, polars.LazyFrame, and polars.Series in DataZip.

  • Added PretendPudlTabl when passed as the klass argument to DataZip.load(), it allows accessing the dfs in a zipped pudl.PudlTabl as you would normally but avoiding the pudl dependency.

  • Code cleanup along with adoption of ruff and removal of bandit, flake8, isort, etc.

  • Added lazy_import() to lazily import or proxy a module, inspired by polars.dependencies.lazy_import.

  • Created tools for proxying pudl.PudlTabl to provide access to cached PUDL data without requiring that pudl is installed, or at least imported. The process of either loading a PretendPudlTabl from cache, or creating and then caching a pudl.PudlTabl is handled by make_pudl_tabl().

  • Copied a number of helper functions that we often use from pudl.helpers to pudl_helpers so they can be used without installing or importing pudl.

  • Added a very light adaptation of the python-remotezip package to access files within a zip archive without downloading the full archive.

  • Updates to DataZip encoding and decoding of pandas.DataFrame so they work with pandas version 2.0.0.

  • Updates to make_pudl_tabl() and associated functions and classes so that it works with new and changing aspects of pudl.PudlTabl, specifically those raised in catalyst#2503. Added testing for full make_pudl_tabl() functionality.

  • Added to get_pudl_table() which reads a table from a pudl.sqlite that is stored where it is expected.

  • Added support for polars.DataFrame, polars.LazyFrame, and polars.Series to etoolbox.utils.testing.assert_equal().

  • plotly.Figure are now stored as pickles so they can be recreated.

  • Updates to get_pudl_sql_url() so that it doesn’t require PUDL environment variables or config files if the sqlite is at pudl-work/output/pudl.sqlite, and tells the user to put the sqlite there if the it cannot be found another way.

  • New conform_pudl_dtypes() function that casts PUDL columns to the dtypes used in PudlTabl, useful when loading tables from a sqlite that doesn’t preserve all dtype info.

  • Added ungzip() to help with un-gzipping pudl.sqlite.gz and now using the gzipped version in tests.

  • Switching two cases of with suppress... to try - except - pass in DataZip to take advantage of zero-cost exceptions.

  • Deprecations these will be removed in the next release along with supporting infrastructure:

    • lazy_import() and the rest of the lazy_import module.

    • PUDL_DTYPES, use conform_pudl_dtypes() instead.

    • make_pudl_tabl(), PretendPudlTablCore, PretendPudlTablCore; read tables directly from the sqlite:

      import pandas as pd
      import sqlalchemy as sa
      
      from etoolbox.utils.pudl import get_pudl_sql_url, conform_pudl_dtypes
      
      pd.read_sql_table(table_name, sa.create_engine(get_pudl_sql_url())).pipe(
           conform_pudl_dtypes
       )
      
      import polars as pl
      
      from etoolbox.utils.pudl import get_pudl_sql_url
      
      pl.read_database("SELECT * FROM table_name", get_pudl_sql_url())
      

Bug Fixes

  • Allow typing.NamedTuple to be used as keys in a dict, and a collections.defaultdict.

  • Fixed a bug in make_pudl_tabl() where creating and caching a new pudl.PudlTabl would fail to load the PUDL package.

  • Fixed a bug where attempting to retrieve an empty pandas.DataFrame raised an IndexError when ignore_pd_dtypes is False.

  • Updated the link for the PUDL database.

Known Issues

  • Some legacy DataZip files cannot be fully read, especially those with nested structures and custom classes.

  • DataZip ignores functools.partial() objects, at least in most dicts.

0.1.0 (2023-02-27)

What’s New?

  • Migrating DataZip from rmi.dispatch where it didn’t really belong. Also added additional functionality including recursive writing and reading of list, dict, and tuple objects.

  • Created IOMixin and IOWrapper to make it easier to add DataZip to other classes.

  • Migrating compare_dfs() from the Hub.

  • Updates to DataZip, IOMixin, and IOWrapper to better better manage attributes missing from original object or file representation of object. Including ability to use differently organized versions of DataZip.

  • Clean up of DataZip internals, both within the object and in laying out files. Particularly how metadata and attributes are stored. Added DataZip.readm() and DataZip.writem() to read and write additional metadata not core to DataZip.

  • Added support for storing numpy.array objects in DataZip using numpy.load() and numpy.save().

  • DataZip now handles writing attributes and metadata using DataZip.close() so DataZip can now be used with or without a context manager.

  • Added isclose(), similar to numpy.isclose() but allowing comparison of arrays containing strings, especially useful with pandas.Series.

  • Added a module etoolbox.utils.match containing the helpers Raymond Hettinger demonstrated in his talk at PyCon Italia for using Python’s case/match syntax.

  • Added support for Python 3.11.

  • Added support for storing plotly figures as pdf in DataZip. DataZip.close() so DataZip can now be used with or without a context manager.

  • Added support for checking whether a file or attribute is stored in DataZip using DataZip.__contains__(), i.e. using Python’s in.

  • Added support for subscript-based, getting and setting data in DataZip.

  • Custom Python objects can be serialized with DataZip if they implement __getstate__ and __setstate__, or can be serialized using the default logic described in object.__getstate__(). That default logic is now implemented in DataZip.default_getstate() and DataZip.default_setstate(). This replaces the use of to_file and from_file by DataZip. IOMixin has been updated accordingly.

  • Added static methods DataZip.dump() and DataZip.load() for serializing a single Python object, these are designed to be similar to how pickle.dump() and pickle.load() work.

  • Removing IOWrapper.

  • Added a DataZip.replace() that copies the contents of an old DataZip into a new copy of it after which you can add to it.

  • Extended JSON encoding / decoding to process an expanded set of builtins, standard library, and other common objects including tuple, set, frozenset, complex, typing.NamedTuple, datetime.datetime, pathlib.Path, and pandas.Timestamp.

  • Adding centralized testing helpers.

  • Added a subclass of PudlTabl that adds back __getstate__ and __setstate__ to enable caching, this caching will not work for tables that are not stored in the object which will be an increasing portion of tables as discussed here.

Bug Fixes

  • Fixed an issue where a single column pandas.DataFrame was recreated as a pandas.Series. Now this should be backwards compatible by applying pandas.DataFrame.squeeze if object metadata is not available.

  • Fixed a bug that prevented certain kinds of objects from working properly under 3.11.

  • Fixed an issue where the name for a pandas.Series might get mangled or changed.

Known Issues

  • Recipe system is fragile and bespoke, there really should be a better way…

  • tuple nested inside other objects may be returned as list.