Skip to content

Refactor the data.__init__.py module#525

Open
jrycw wants to merge 4 commits into
posit-dev:mainfrom
jrycw:data-access
Open

Refactor the data.__init__.py module#525
jrycw wants to merge 4 commits into
posit-dev:mainfrom
jrycw:data-access

Conversation

@jrycw

@jrycw jrycw commented Nov 28, 2024

Copy link
Copy Markdown
Collaborator

Hello team,

This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:

  • To get a Pandas dataframe for the sza dataset, use Dataset.sza.to_pandas().
  • To get a Polars dataframe for the same dataset, use Dataset.sza.to_polars().
  • To get a PyArrow dataframe for the same dataset, use Dataset.sza.to_pyarrow() (implementation of _convert_to_pyarrow() and to_pyarrow() is needed to support this).

This way, users can use autocomplete to select both the dataset and the desired dataframe type.

To facilitate the transition:

  • Each dataset name begins with an uppercase letter (e.g., Sza).
  • A lowercase variable (e.g., sza) is provided as a Pandas dataframe, created using to_pandas().

If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:

  • Remove the warning message.
  • Remove the lowercase variables representing Pandas dataframes, and rename the dataset classes to lowercase.
  • Update __all__.
  • Update documentation and tests to reflect the new API.
  • Address code sections marked with # remove pandas for further cleanup.

I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.

@codecov

codecov Bot commented Nov 28, 2024

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.77%. Comparing base (925fa41) to head (4486114).
⚠️ Report is 349 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
- Coverage   89.79%   89.77%   -0.03%     
==========================================
  Files          45       45              
  Lines        5321     5406      +85     
==========================================
+ Hits         4778     4853      +75     
- Misses        543      553      +10     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jrycw

jrycw commented Nov 28, 2024

Copy link
Copy Markdown
Collaborator Author

The Dataset API can be used as follows:

from great_tables import GT
from great_tables.data import Dataset

exibble = Dataset.exibble

# Pandas users
df_pd = exibble.to_pandas()
GT(df_pd).show()


# Polars users
df_pl = exibble.to_polars()
GT(df_pl).show()


# PyArrow users
...

@machow machow left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for working on this! This is a huge step towards not depending on a specific DataFrame library for our examples!

One thing I wonder is whether it might help cut out boilerplate to use a style with a bit more composition, rather than class inheritance.

Comment thread great_tables/data/__init__.py Outdated
def to_pyarrow(cls): ...


class Countrypops(_DataFrameConverter):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this approach works by having the class _DataFrameConverter...

  • specifying a filepath class attribute, etc..
  • subclassing it to define those attributes

One risk of this approach is that a leans a bit hard on inheritance (see composition over inheritance). WDYT of an approach where things like Countrypops a instances of _DataFrameConverter, or something similar?

In this case you could add init arguments to _DataFrameConverter, like...

class _DataFrameConverter:
    def __init__(self, filepath, dtype):
        self.filepath = filepath
        self.dtype = dtype

Countrypops = _DataFrameConverter(
    DATA_MOD / "01-countrypops.csv",
    dtype = {
        "country_name": str,
        "country_code_2": str,
        "country_code_3": str,
        "year": int,
        "population": int,
    }    
)
"""A docstring"""

@jrycw jrycw Dec 8, 2024

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machow , thank you for the excellent suggestion! Since we're moving towards favoring composition, renaming _DataFrameConverter to _Dataset seems more appropriate, and passing the docstrings as arguments aligns well with this approach. In addition, I've taken the liberty of implementing the Pyarrow conversion and added a __repr__() method, though I'm uncertain about its utility—feedback on this would be appreciated.

For the internal datasets, I recommend retaining the current parsing approach with Pandas. It appears to be more robust, and I've noticed that handling certain cases with Polars, such as x-airquality.csv (missing values) and x_locales.csv (complex types, which Pandas handles well with the object dtype) is a bit tricky. Since these datasets are used internally, we should have room to explore better solutions for these edge cases.

Comment thread great_tables/data/__init__.py Outdated
"mass_excess_uncert": "float64",
}

class _DataFrameConverter:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it's useful, I ended up creating a tiny DataFrame like implementation in reactable-py, just so I could feed example data for demos

https://github.com/machow/reactable-py/blob/main/reactable/simpleframe.py

I think a nice advantage of your approach here though is that it doesn't read the csv until you use one of the to_*() methods!

@jrycw jrycw Dec 8, 2024

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great! It seems you'll need to keep the data as an internal variable within SimpleFrame to enable many get and set actions. However, for our dataset use case, we should be able to store just the filepath and dtype rather than holding the entire dataset in memory.

@jrycw

jrycw commented Dec 8, 2024

Copy link
Copy Markdown
Collaborator Author

The CI failure seems unrelated to our codebase.

Run browser-actions/setup-edge@v1
Setup Edge stable
Attempting to download Edge stable...
Error: Artifact not found of Edge stable for platform linux amd6[4](https://github.com/posit-dev/great-tables/actions/runs/12219134318/job/34085425230#step:6:5)

@machow

machow commented Dec 10, 2024

Copy link
Copy Markdown
Collaborator

@rich-iannone do you mind taking a look at this? In particular, would love to get your take on...

  • What variable names should we have in great_tables.data. E.g. currently there is...
    • data.Islands for the _Dataset instance, which has to_polars(), to_pandas() methods
    • data.islands (note the lowercase i) for the `pd.DataFrame
  • What should the default be now, and in the long term? (see below)

Here are some possible approaches we could take to data.

Backwards compatible: keep pandas the default, provide new variables for _Dataset objects

  • e.g. data.Islands and data.islands both importable
  • I'd prefer we eventually just have 1 thing. We could always make a _Dataset instance run some to_*() method by default, when passed to GT(), if e.g. polars is installed, etc...

Agnostic: shift to _Dataset objects (e.g. Islands)

  • deprecate pandas-by-import:
    • when users import variables like data.islands
    • give a warning that its deprecated
    • suggest user imports data.Islands and runs the .to_pandas() method

Agnostic: shift to _Dataset and eventually rename (e.g. Islands -> islands)

  • deprecate pandas-by-import, with alternative import path: when users import variables like data.islands
    • when users import variables like data.islands
    • give a warning that its deprecated
    • suggest user imports data.v0.islands
    • provide to_pandas(), to_polars(), to_...(), on data.islands -- as a subclass of pandas or something

@rich-iannone

Copy link
Copy Markdown
Member

@machow out of the three options you provided, I prefer the third one: "shift to _Dataset and eventually rename (e.g. Islands -> islands)".

@machow

machow commented Dec 10, 2024

Copy link
Copy Markdown
Collaborator

Noice -- another way to do the third thing could be something like this?:

  • data.islands: current behavior, deprecate -- tell them to do v0 import.
  • data.v0.islands: old behavior
  • data.v1.islands: new behavior
  • X months from now, data.islands: new behavior (and we are fully dataframe agnostic!)

edit: eh, lemme think about this compared to the original third option 😭

@jrycw

jrycw commented Dec 14, 2024

Copy link
Copy Markdown
Collaborator Author

Include the previously discussed issue #91 as a reference.

@rich-iannone

Copy link
Copy Markdown
Member

From pairing with @machow, we landed on using:

  • a SimpleFrame for data.dataset
  • providing the to_polars() and to_pandas() methods on data.dataset (e.g., data.dataset.to_polars())

The release process would then be

  • next release: ensure that datasets have to_polars() and to_pandas() methods
  • breaking release: implement SimpleFrame for datasets (along with the above methods)

@machow

machow commented Jan 14, 2025

Copy link
Copy Markdown
Collaborator

@jrycw if it's okay with you, I can take up this issue--since it's a good chance to generalize simple frame from reactable to here. Thanks for doing so much exploration on this -- the .to_*() methods are super helpful to see, in terms of what the different libraries need to work!

If it's okay, I'll close this and open something with simple frame vendored into Great Tables

@jrycw

jrycw commented Jan 14, 2025

Copy link
Copy Markdown
Collaborator Author

@machow Sure! SimpleFrame seems like a great choice. I wonder if I would have gone for it had I known about this option earlier.

@machow

machow commented Jan 14, 2025

Copy link
Copy Markdown
Collaborator

I think this PR as is made a lot of sense -- it pushed us to discuss where we wanted example data to be a year from now. So I definitely think it was the right choice in terms of getting things moving!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants