Skip to content

feat: add __arrow_c_stream__ function#11338

Open
jules-ch wants to merge 19 commits into
pydata:mainfrom
jules-ch:arrow-pycapsule-datarray
Open

feat: add __arrow_c_stream__ function#11338
jules-ch wants to merge 19 commits into
pydata:mainfrom
jules-ch:arrow-pycapsule-datarray

Conversation

@jules-ch

@jules-ch jules-ch commented May 13, 2026

Copy link
Copy Markdown

Description

Add pyarrow capsule method to quickly convert datarray to polars

The function is mostly zero copy, only the coordinates grid need to be computed.

I wanted to implement the __arrow_c_array__ function to return a fixed_shape_tensor but somehow polars prioritize this over __arrow_c_stream__ method.

So for convenience I leave this here for now.

Feel free to close this PR and discuss this further in a dedicated issue if you want.

We can go one step further to save memory with pa.DictionaryArray to use the indice encoding that pyarrow supports out of the box we just need to create the indices using numpy before.

This enable :

>>> import polars as pl
>>> import pyarrow as pa
>>> ds = xr.tutorial.load_dataset("air_temperature")
>>> df = pl.from_arrow(ds.air)
>>> df
shape: (3_869_000, 4)
┌─────────────────────┬──────┬───────┬────────┐
│ timelatlonair    │
│ ------------    │
│ datetime[ns]        ┆ f32f32f64    │
╞═════════════════════╪══════╪═══════╪════════╡
│ 2013-01-01 00:00:0075.0200.0241.2  │
│ 2013-01-01 00:00:0075.0202.5242.5  │
│ 2013-01-01 00:00:0075.0205.0243.5  │
│ 2013-01-01 00:00:0075.0207.5244.0  │
│ 2013-01-01 00:00:0075.0210.0244.1  │
│ …                   ┆ …    ┆ …     ┆ …      │
│ 2014-12-31 18:00:0015.0320.0297.39 │
│ 2014-12-31 18:00:0015.0322.5297.19 │
│ 2014-12-31 18:00:0015.0325.0296.49 │
│ 2014-12-31 18:00:0015.0327.5296.19 │
│ 2014-12-31 18:00:0015.0330.0295.69 │
└─────────────────────┴──────┴───────┴────────┘

Checklist

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR. Tools: {e.g., Claude, Codex, GitHub Copilot, ChatGPT, etc.}

@jules-ch jules-ch force-pushed the arrow-pycapsule-datarray branch 3 times, most recently from bc5c796 to e5e19d6 Compare May 13, 2026 18:32
@jules-ch jules-ch marked this pull request as draft May 14, 2026 13:08
Comment thread xarray/core/dataarray.py
Comment on lines +491 to +492
if not values.flags.c_contiguous:
values = np.ascontiguousarray(values)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only use values.ravel down there to ensure contiguous array.

@jules-ch

jules-ch commented May 14, 2026

Copy link
Copy Markdown
Author

Thought

For xr.Dataset I think we can loop over dataarrays convert them to pyarrow tables and join them together in one big pyarrow Table.

It will be sparse if dataarrays does not have the same coords, but that's another PR altogether.

Edit:

Well that worked better than I thought it would:

>>> pl.DataFrame(ds)
shape: (694_080, 7)
┌───────┬───────┬──────────┬───────────┬──────────┬──────────┬──────────┐
│ monthlevellatitudelongitudezuv        │
│ ---------------------      │
│ i32i32f32f32f64f64f64      │
╞═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╪══════════╡
│ 1850-60.060.011536.644.9681810.351744 │
│       ┆       ┆          ┆           ┆ 4639     ┆          ┆          │
│ 1850-60.060.7511534.915.0153620.359389 │
│       ┆       ┆          ┆           ┆ 9612     ┆          ┆          │
│ 1850-60.061.511531.465.0153620.367034 │
│       ┆       ┆          ┆           ┆ 9557     ┆          ┆          │
│ 1850-60.062.2511529.745.0468160.359389 │
│       ┆       ┆          ┆           ┆ 4529     ┆          ┆          │
│ 1850-60.063.011528.015.0468160.328331 │
│       ┆       ┆          ┆           ┆ 9502     ┆          ┆          │
│ …     ┆ …     ┆ …        ┆ …         ┆ …        ┆ …        ┆ …        │
│ 7500-57.08.2549749.4517.249430.429629 │
│       ┆       ┆          ┆           ┆ 30993        ┆          │
│ 7500-57.09.049747.7217.249430.375157 │
│       ┆       ┆          ┆           ┆ 80723        ┆          │
│ 7500-57.09.7549746.0017.375250.328331 │
│       ┆       ┆          ┆           ┆ 3044     ┆          ┆          │
│ 7500-57.010.549744.2717.375250.281027 │
│       ┆       ┆          ┆           ┆ 8017     ┆          ┆          │
│ 7500-57.011.2549742.5517.562400.2342   │
│       ┆       ┆          ┆           ┆ 29892        ┆          │
└───────┴───────┴──────────┴───────────┴──────────┴──────────┴──────────┘

Will make an another PR if this one get merged.

@jules-ch

Copy link
Copy Markdown
Author

I am going to need guidance to add the polars dependency to pixi, there are quite a lot of groups and I am not very familiar with what is used for the CI testing so that we can have some testing or we only support the pyarrow conversion and we do not test the xarray -> polars one.

Comment thread xarray/core/dataarray.py Outdated
@jules-ch jules-ch marked this pull request as ready for review May 22, 2026 16:26
@jules-ch jules-ch force-pushed the arrow-pycapsule-datarray branch from 6efce04 to 276c394 Compare June 8, 2026 17:33
@jules-ch

jules-ch commented Jun 8, 2026

Copy link
Copy Markdown
Author

I think the implementation is near ready here, the naive implem was not working with coordinates with mutliple dimensions, I fixed that with manual broadcasting which is mostly views at memory level until copy is needed for coordinates values.

On a small benchmark, the conversion to polars is around 10x faster than a bare:

df = pl.DataFrame(da.to_dataframe().reset_index())

Comment thread xarray/core/dataarray.py Outdated
Comment thread xarray/core/dataarray.py
Comment on lines +560 to +576
# Order axes based on Variable dims
dim_order = [coord.dims.index(dim) for dim in dims if dim in coord.dims]

# Reorder coords values to variable dim order
ordered_coords = coord.values.transpose(dim_order)

# Expand coord dims
# coord dims (x, y) variable dims (x,y,z) -> (x, y, 1)
# NOTE: Insert a length-1 axis for each data dim missing for coordinates
# (slice(None) keeps an existing axis, np.newaxis adds one)
indexer = tuple(
slice(None) if dim in coord.dims else np.newaxis for dim in dims
)
expanded_coords = ordered_coords[indexer]

# Broadcast to full flattened shape (x, y, 1) -> (x, y, z)
broadcasted = np.broadcast_to(expanded_coords, shape)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share this with the to_dataframe path? alternatively, can we rewrite the to_dataframe path to use this function? (that would be a different PR)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit fearful of using this for now for theto_dataframe function, the path seems similar but the to Dataframe is using dataset conversion and then conversion to dataframe converting the coords to pandas MultiIndex.

I think another PR is the way to go, and I'm not sure we can factor everything here.

Comment thread xarray/tests/test_dataarray.py

@requires_dask
@requires_pyarrow
def test_dask_dataarray(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intent here? Does it load to memory?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to have a test for dask too, to ensure everything works as intented.

Also in a following PR we can implement chunking with RecordBatch for streaming dask chunks as RecordBatch for native arrow streaming.

I can remove it, if you find it out of scope.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for now it loads into memory with the casting to numpy values and then conversion to pyarrow array before constructing the table with it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now it loads into memory

We should error and ask the user to compute instead. Xarray's does not compute by default.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want explicit compute that's right ? to_dataframe also load data when using a dask backed-DataArray:

import dask.array as da
import pyarrow as pa

dask_da = xr.DataArray(
    da.from_array(np.arange(6, dtype=float).reshape(2, 3)),
    dims=["x", "y"],
    coords={"x": [0, 1], "y": [10, 20, 30]},
    name="data",
)
print(dask_da.to_dataframe())

I can raise an error if you want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I prefer the error. That behaviour seems like a bug and we should deprecate it

Comment thread xarray/core/dataarray.py
Comment thread xarray/core/dataarray.py
jules-ch and others added 2 commits June 10, 2026 07:42
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas)

2 participants