feat: add __arrow_c_stream__ function by jules-ch · Pull Request #11338 · pydata/xarray

jules-ch · 2026-05-13T18:25:50Z

Description

Add pyarrow capsule method to quickly convert datarray to polars

The function is mostly zero copy, only the coordinates grid need to be computed.

I wanted to implement the __arrow_c_array__ function to return a fixed_shape_tensor but somehow polars prioritize this over __arrow_c_stream__ method.

So for convenience I leave this here for now.

Feel free to close this PR and discuss this further in a dedicated issue if you want.

We can go one step further to save memory with pa.DictionaryArray to use the indice encoding that pyarrow supports out of the box we just need to create the indices using numpy before.

This enable :

>>> import polars as pl
>>> import pyarrow as pa
>>> ds = xr.tutorial.load_dataset("air_temperature")
>>> df = pl.from_arrow(ds.air)
>>> df
shape: (3_869_000, 4)
┌─────────────────────┬──────┬───────┬────────┐
│ time                ┆ lat  ┆ lon   ┆ air    │
│ ---                 ┆ ---  ┆ ---   ┆ ---    │
│ datetime[ns]        ┆ f32  ┆ f32   ┆ f64    │
╞═════════════════════╪══════╪═══════╪════════╡
│ 2013-01-01 00:00:00 ┆ 75.0 ┆ 200.0 ┆ 241.2  │
│ 2013-01-01 00:00:00 ┆ 75.0 ┆ 202.5 ┆ 242.5  │
│ 2013-01-01 00:00:00 ┆ 75.0 ┆ 205.0 ┆ 243.5  │
│ 2013-01-01 00:00:00 ┆ 75.0 ┆ 207.5 ┆ 244.0  │
│ 2013-01-01 00:00:00 ┆ 75.0 ┆ 210.0 ┆ 244.1  │
│ …                   ┆ …    ┆ …     ┆ …      │
│ 2014-12-31 18:00:00 ┆ 15.0 ┆ 320.0 ┆ 297.39 │
│ 2014-12-31 18:00:00 ┆ 15.0 ┆ 322.5 ┆ 297.19 │
│ 2014-12-31 18:00:00 ┆ 15.0 ┆ 325.0 ┆ 296.49 │
│ 2014-12-31 18:00:00 ┆ 15.0 ┆ 327.5 ┆ 296.19 │
│ 2014-12-31 18:00:00 ┆ 15.0 ┆ 330.0 ┆ 295.69 │
└─────────────────────┴──────┴───────┴────────┘

Checklist

Closes Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) #10135
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

AI Disclosure

This PR contains AI-generated content.
- I have tested any AI-generated content in my PR.
- I take responsibility for any AI-generated content in my PR. Tools: {e.g., Claude, Codex, GitHub Copilot, ChatGPT, etc.}

jules-ch · 2026-05-14T16:07:38Z

+        if not values.flags.c_contiguous:
+            values = np.ascontiguousarray(values)


I think we can only use values.ravel down there to ensure contiguous array.

jules-ch · 2026-05-14T16:31:58Z

Thought

For xr.Dataset I think we can loop over dataarrays convert them to pyarrow tables and join them together in one big pyarrow Table.

It will be sparse if dataarrays does not have the same coords, but that's another PR altogether.

Edit:

Well that worked better than I thought it would:

>>> pl.DataFrame(ds)
shape: (694_080, 7)
┌───────┬───────┬──────────┬───────────┬──────────┬──────────┬──────────┐
│ month ┆ level ┆ latitude ┆ longitude ┆ z        ┆ u        ┆ v        │
│ ---   ┆ ---   ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---      │
│ i32   ┆ i32   ┆ f32      ┆ f32       ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╪══════════╡
│ 1     ┆ 850   ┆ -60.0    ┆ 60.0      ┆ 11536.64 ┆ 4.968181 ┆ 0.351744 │
│       ┆       ┆          ┆           ┆ 4639     ┆          ┆          │
│ 1     ┆ 850   ┆ -60.0    ┆ 60.75     ┆ 11534.91 ┆ 5.015362 ┆ 0.359389 │
│       ┆       ┆          ┆           ┆ 9612     ┆          ┆          │
│ 1     ┆ 850   ┆ -60.0    ┆ 61.5      ┆ 11531.46 ┆ 5.015362 ┆ 0.367034 │
│       ┆       ┆          ┆           ┆ 9557     ┆          ┆          │
│ 1     ┆ 850   ┆ -60.0    ┆ 62.25     ┆ 11529.74 ┆ 5.046816 ┆ 0.359389 │
│       ┆       ┆          ┆           ┆ 4529     ┆          ┆          │
│ 1     ┆ 850   ┆ -60.0    ┆ 63.0      ┆ 11528.01 ┆ 5.046816 ┆ 0.328331 │
│       ┆       ┆          ┆           ┆ 9502     ┆          ┆          │
│ …     ┆ …     ┆ …        ┆ …         ┆ …        ┆ …        ┆ …        │
│ 7     ┆ 500   ┆ -57.0    ┆ 8.25      ┆ 49749.45 ┆ 17.24943 ┆ 0.429629 │
│       ┆       ┆          ┆           ┆ 3099     ┆ 3        ┆          │
│ 7     ┆ 500   ┆ -57.0    ┆ 9.0       ┆ 49747.72 ┆ 17.24943 ┆ 0.375157 │
│       ┆       ┆          ┆           ┆ 8072     ┆ 3        ┆          │
│ 7     ┆ 500   ┆ -57.0    ┆ 9.75      ┆ 49746.00 ┆ 17.37525 ┆ 0.328331 │
│       ┆       ┆          ┆           ┆ 3044     ┆          ┆          │
│ 7     ┆ 500   ┆ -57.0    ┆ 10.5      ┆ 49744.27 ┆ 17.37525 ┆ 0.281027 │
│       ┆       ┆          ┆           ┆ 8017     ┆          ┆          │
│ 7     ┆ 500   ┆ -57.0    ┆ 11.25     ┆ 49742.55 ┆ 17.56240 ┆ 0.2342   │
│       ┆       ┆          ┆           ┆ 2989     ┆ 2        ┆          │
└───────┴───────┴──────────┴───────────┴──────────┴──────────┴──────────┘

Will make an another PR if this one get merged.

jules-ch · 2026-05-15T08:38:28Z

I am going to need guidance to add the polars dependency to pixi, there are quite a lot of groups and I am not very familiar with what is used for the CI testing so that we can have some testing or we only support the pyarrow conversion and we do not test the xarray -> polars one.

Add pyarrow capsule method to quickly convert datarray to polars The function is mostly zero copy, only the coordinates grid need to be computed

…apping

jules-ch · 2026-06-08T20:39:27Z

I think the implementation is near ready here, the naive implem was not working with coordinates with mutliple dimensions, I fixed that with manual broadcasting which is mostly views at memory level until copy is needed for coordinates values.

On a small benchmark, the conversion to polars is around 10x faster than a bare:

df = pl.DataFrame(da.to_dataframe().reset_index())

dcherian · 2026-06-10T02:22:04Z

+            # Order axes based on Variable dims
+            dim_order = [coord.dims.index(dim) for dim in dims if dim in coord.dims]
+
+            # Reorder coords values to variable dim order
+            ordered_coords = coord.values.transpose(dim_order)
+
+            # Expand coord dims
+            # coord dims (x, y) variable dims (x,y,z) -> (x, y, 1)
+            # NOTE: Insert a length-1 axis for each data dim missing for coordinates
+            # (slice(None) keeps an existing axis, np.newaxis adds one)
+            indexer = tuple(
+                slice(None) if dim in coord.dims else np.newaxis for dim in dims
+            )
+            expanded_coords = ordered_coords[indexer]
+
+            # Broadcast to full flattened shape (x, y, 1) -> (x, y, z)
+            broadcasted = np.broadcast_to(expanded_coords, shape)


can we share this with the to_dataframe path? alternatively, can we rewrite the to_dataframe path to use this function? (that would be a different PR)

I am a bit fearful of using this for now for theto_dataframe function, the path seems similar but the to Dataframe is using dataset conversion and then conversion to dataframe converting the coords to pandas MultiIndex.

I think another PR is the way to go, and I'm not sure we can factor everything here.

dcherian · 2026-06-10T02:24:48Z

+
+    @requires_dask
+    @requires_pyarrow
+    def test_dask_dataarray(self):


what is the intent here? Does it load to memory?

I just wanted to have a test for dask too, to ensure everything works as intented.

Also in a following PR we can implement chunking with RecordBatch for streaming dask chunks as RecordBatch for native arrow streaming.

I can remove it, if you find it out of scope.

Yes for now it loads into memory with the casting to numpy values and then conversion to pyarrow array before constructing the table with it.

for now it loads into memory

We should error and ask the user to compute instead. Xarray's does not compute by default.

You want explicit compute that's right ? to_dataframe also load data when using a dask backed-DataArray:

import dask.array as da import pyarrow as pa dask_da = xr.DataArray( da.from_array(np.arange(6, dtype=float).reshape(2, 3)), dims=["x", "y"], coords={"x": [0, 1], "y": [10, 20, 30]}, name="data", ) print(dask_da.to_dataframe())

I can raise an error if you want.

yep I prefer the error. That behaviour seems like a bug and we should deprecate it

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

jules-ch force-pushed the arrow-pycapsule-datarray branch 3 times, most recently from bc5c796 to e5e19d6 Compare May 13, 2026 18:32

jules-ch marked this pull request as draft May 14, 2026 13:08

jules-ch mentioned this pull request May 14, 2026

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) #10135

Open

jules-ch commented May 14, 2026

View reviewed changes

jules-ch commented May 18, 2026

View reviewed changes

Comment thread xarray/core/dataarray.py Outdated

jules-ch marked this pull request as ready for review May 22, 2026 16:26

jules-ch added 15 commits June 8, 2026 19:33

feat: add __arrow_c_stream__ function

f70df33

Add pyarrow capsule method to quickly convert datarray to polars The function is mostly zero copy, only the coordinates grid need to be computed

chore: typing

11fd74b

chore: rename class Test

8b0c828

chore: typo

ac02132

chore: remove call to action

f17c813

chore: fix mypy

74a7251

fix: use values to ensure numpy array

e69b4f4

chore: add test for dask

57b836d

add polars to ignore missing imports

e7cc3f1

chore: add pyarrow schema method + json encoding of attrs

ba824b4

chore: use directly _coords without going through self.coords + dim m…

ecadc76

…apping

chore: add tests for pyarrow __arrow_c_schema__

52fed6b

chore: add what's new entry

5872890

feat: update to support curvilinear coordinates

5c9c8c0

chore: add docstring

276c394

jules-ch force-pushed the arrow-pycapsule-datarray branch from 6efce04 to 276c394 Compare June 8, 2026 17:33

jules-ch added 2 commits June 8, 2026 19:43

chore: add tests for curvilinear coords and out of order dims

64b695d

feat: add shape to xarray schema metadata

d998785

jules-ch mentioned this pull request Jun 9, 2026

Arrow Pycapsule __arrow_array__ conversion priority over __arrow_c_stream__ pola-rs/polars#27921

Open

2 tasks

dcherian reviewed Jun 10, 2026

View reviewed changes

Comment thread xarray/core/dataarray.py Outdated

dcherian reviewed Jun 10, 2026

View reviewed changes

Comment thread xarray/tests/test_dataarray.py

dcherian reviewed Jun 10, 2026

View reviewed changes

Comment thread xarray/core/dataarray.py

dcherian reviewed Jun 10, 2026

View reviewed changes

Comment thread xarray/core/dataarray.py

jules-ch and others added 2 commits June 10, 2026 07:42

Update xarray/core/dataarray.py

1eda714

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

Update arrow schema version key

ad8c76d

		if not values.flags.c_contiguous:
		values = np.ascontiguousarray(values)

Uh oh!

Conversation

jules-ch commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

AI Disclosure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jules-ch commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit:

Uh oh!

jules-ch commented May 15, 2026

Uh oh!

Uh oh!

jules-ch commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jules-ch commented May 13, 2026 •

edited

Loading

jules-ch commented May 14, 2026 •

edited

Loading

jules-ch commented Jun 8, 2026 •

edited

Loading