Refactor the `data.__init__.py` module by jrycw · Pull Request #525 · posit-dev/great-tables

jrycw · 2024-11-28T14:00:18Z

Hello team,

This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:

To get a Pandas dataframe for the sza dataset, use Dataset.sza.to_pandas().
To get a Polars dataframe for the same dataset, use Dataset.sza.to_polars().
To get a PyArrow dataframe for the same dataset, use Dataset.sza.to_pyarrow() (implementation of _convert_to_pyarrow() and to_pyarrow() is needed to support this).

This way, users can use autocomplete to select both the dataset and the desired dataframe type.

To facilitate the transition:

Each dataset name begins with an uppercase letter (e.g., Sza).
A lowercase variable (e.g., sza) is provided as a Pandas dataframe, created using to_pandas().

If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:

Remove the warning message.
Remove the lowercase variables representing Pandas dataframes, and rename the dataset classes to lowercase.
Update __all__.
Update documentation and tests to reflect the new API.
Address code sections marked with # remove pandas for further cleanup.

I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.

codecov · 2024-11-28T14:02:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.77%. Comparing base (925fa41) to head (4486114).
⚠️ Report is 349 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
- Coverage   89.79%   89.77%   -0.03%     
==========================================
  Files          45       45              
  Lines        5321     5406      +85     
==========================================
+ Hits         4778     4853      +75     
- Misses        543      553      +10

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jrycw · 2024-11-28T14:19:47Z

The Dataset API can be used as follows:

from great_tables import GT
from great_tables.data import Dataset

exibble = Dataset.exibble

# Pandas users
df_pd = exibble.to_pandas()
GT(df_pd).show()


# Polars users
df_pl = exibble.to_polars()
GT(df_pl).show()


# PyArrow users
...

machow

Hey, thanks for working on this! This is a huge step towards not depending on a specific DataFrame library for our examples!

One thing I wonder is whether it might help cut out boilerplate to use a style with a bit more composition, rather than class inheritance.

machow · 2024-12-04T22:08:34Z

+    def to_pyarrow(cls): ...
+
+
+class Countrypops(_DataFrameConverter):


Currently, this approach works by having the class _DataFrameConverter...

specifying a filepath class attribute, etc..

subclassing it to define those attributes

One risk of this approach is that a leans a bit hard on inheritance (see composition over inheritance). WDYT of an approach where things like Countrypops a instances of _DataFrameConverter, or something similar?

In this case you could add init arguments to _DataFrameConverter, like...

class _DataFrameConverter: def __init__(self, filepath, dtype): self.filepath = filepath self.dtype = dtype Countrypops = _DataFrameConverter( DATA_MOD / "01-countrypops.csv", dtype = { "country_name": str, "country_code_2": str, "country_code_3": str, "year": int, "population": int, } ) """A docstring"""

@machow , thank you for the excellent suggestion! Since we're moving towards favoring composition, renaming _DataFrameConverter to _Dataset seems more appropriate, and passing the docstrings as arguments aligns well with this approach. In addition, I've taken the liberty of implementing the Pyarrow conversion and added a __repr__() method, though I'm uncertain about its utility—feedback on this would be appreciated.

For the internal datasets, I recommend retaining the current parsing approach with Pandas. It appears to be more robust, and I've noticed that handling certain cases with Polars, such as x-airquality.csv (missing values) and x_locales.csv (complex types, which Pandas handles well with the object dtype) is a bit tricky. Since these datasets are used internally, we should have room to explore better solutions for these edge cases.

machow · 2024-12-04T22:09:39Z

-    "mass_excess_uncert": "float64",
-}
+
+class _DataFrameConverter:


In case it's useful, I ended up creating a tiny DataFrame like implementation in reactable-py, just so I could feed example data for demos

https://github.com/machow/reactable-py/blob/main/reactable/simpleframe.py

I think a nice advantage of your approach here though is that it doesn't read the csv until you use one of the to_*() methods!

The code looks great! It seems you'll need to keep the data as an internal variable within SimpleFrame to enable many get and set actions. However, for our dataset use case, we should be able to store just the filepath and dtype rather than holding the entire dataset in memory.

jrycw · 2024-12-08T05:47:54Z

The CI failure seems unrelated to our codebase.

Run browser-actions/setup-edge@v1
Setup Edge stable
Attempting to download Edge stable...
Error: Artifact not found of Edge stable for platform linux amd6[4](https://github.com/posit-dev/great-tables/actions/runs/12219134318/job/34085425230#step:6:5)

machow · 2024-12-10T16:01:35Z

@rich-iannone do you mind taking a look at this? In particular, would love to get your take on...

What variable names should we have in great_tables.data. E.g. currently there is...
- data.Islands for the _Dataset instance, which has to_polars(), to_pandas() methods
- data.islands (note the lowercase i) for the `pd.DataFrame
What should the default be now, and in the long term? (see below)

Here are some possible approaches we could take to data.

Backwards compatible: keep pandas the default, provide new variables for `_Dataset` objects

e.g. data.Islands and data.islands both importable
I'd prefer we eventually just have 1 thing. We could always make a _Dataset instance run some to_*() method by default, when passed to GT(), if e.g. polars is installed, etc...

Agnostic: shift to `_Dataset` objects (e.g. `Islands`)

deprecate pandas-by-import:
- when users import variables like data.islands
- give a warning that its deprecated
- suggest user imports data.Islands and runs the .to_pandas() method

Agnostic: shift to `_Dataset` and eventually rename (e.g. `Islands` -> `islands`)

deprecate pandas-by-import, with alternative import path: when users import variables like data.islands
- when users import variables like data.islands
- give a warning that its deprecated
- suggest user imports data.v0.islands
- provide to_pandas(), to_polars(), to_...(), on data.islands -- as a subclass of pandas or something

rich-iannone · 2024-12-10T16:19:00Z

@machow out of the three options you provided, I prefer the third one: "shift to _Dataset and eventually rename (e.g. Islands -> islands)".

machow · 2024-12-10T18:22:21Z

Noice -- another way to do the third thing could be something like this?:

data.islands: current behavior, deprecate -- tell them to do v0 import.
data.v0.islands: old behavior
data.v1.islands: new behavior
X months from now, data.islands: new behavior (and we are fully dataframe agnostic!)

edit: eh, lemme think about this compared to the original third option 😭

jrycw · 2024-12-14T15:25:35Z

Include the previously discussed issue #91 as a reference.

rich-iannone · 2025-01-14T16:08:13Z

From pairing with @machow, we landed on using:

a SimpleFrame for data.dataset
providing the to_polars() and to_pandas() methods on data.dataset (e.g., data.dataset.to_polars())

The release process would then be

next release: ensure that datasets have to_polars() and to_pandas() methods
breaking release: implement SimpleFrame for datasets (along with the above methods)

machow · 2025-01-14T16:22:27Z

@jrycw if it's okay with you, I can take up this issue--since it's a good chance to generalize simple frame from reactable to here. Thanks for doing so much exploration on this -- the .to_*() methods are super helpful to see, in terms of what the different libraries need to work!

If it's okay, I'll close this and open something with simple frame vendored into Great Tables

jrycw · 2025-01-14T16:44:35Z

@machow Sure! SimpleFrame seems like a great choice. I wonder if I would have gone for it had I known about this option earlier.

machow · 2025-01-14T17:50:05Z

I think this PR as is made a lot of sense -- it pushed us to discuss where we wanted example data to be a year from now. So I definitely think it was the right choice in terms of getting things moving!

jrycw added 2 commits November 28, 2024 21:51

Refactor the data.__init__.py module

2400d52

Update basic-styling.qmd

f8ac978

machow requested changes Dec 4, 2024

View reviewed changes

jrycw added 2 commits December 8, 2024 13:16

Refactor data.__init__.py to favor composition over inheritance

762ec6c

Update test_datasets.py to test the conversion of different dataframes

4486114

machow mentioned this pull request Jan 14, 2025

Enable polars users to easily access to package datasets #91

Open

		def to_pyarrow(cls): ...


		class Countrypops(_DataFrameConverter):

Conversation

jrycw commented Nov 28, 2024

Uh oh!

codecov Bot commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jrycw commented Nov 28, 2024

Uh oh!

machow left a comment

Choose a reason for hiding this comment

Uh oh!

machow Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

jrycw Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machow Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

jrycw Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrycw commented Dec 8, 2024

Uh oh!

machow commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backwards compatible: keep pandas the default, provide new variables for _Dataset objects

Agnostic: shift to _Dataset objects (e.g. Islands)

Agnostic: shift to _Dataset and eventually rename (e.g. Islands -> islands)

Uh oh!

rich-iannone commented Dec 10, 2024

Uh oh!

machow commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrycw commented Dec 14, 2024

Uh oh!

rich-iannone commented Jan 14, 2025

Uh oh!

machow commented Jan 14, 2025

Uh oh!

jrycw commented Jan 14, 2025

Uh oh!

machow commented Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Nov 28, 2024 •

edited

Loading

jrycw Dec 8, 2024 •

edited

Loading

jrycw Dec 8, 2024 •

edited

Loading

machow commented Dec 10, 2024 •

edited

Loading

Backwards compatible: keep pandas the default, provide new variables for `_Dataset` objects

Agnostic: shift to `_Dataset` objects (e.g. `Islands`)

Agnostic: shift to `_Dataset` and eventually rename (e.g. `Islands` -> `islands`)

machow commented Dec 10, 2024 •

edited

Loading