Skip to content

SparseSerialization should perserve whether is numpy #499

Description

@victor-zou

Describe the bug
It is expected that the query result for numeric arrays when use_numper/columnar=True should be numpy.ndarray instead of python list/tuple. However when your column is sparse (i.e., majority of the data is of same value like 0/-1/nan), the behavior after the introduction of sparse serialization becomes the later. The implementation does not take into account whether it is a numpy column or not.

See commit 4de5b2b

https://github.com/mymarilyn/clickhouse-driver/blame/49afa09cede2e904090d46b44c1a059bec14c598/clickhouse_driver/columns/base.py#L49

    def apply_sparse(self, items):
        default = self.column.null_value
        if self.column.after_read_items:
            default = self.column.after_read_items([default])[0]

        rv = [default] * (self.items_total - 1)
        for item_number, i in enumerate(self.sparse_indexes):
            rv[i - 1] = items[item_number]

        return rv

To Reproduce
Read any sparse column with use_numper/columnar=True

Expected behavior
Returns a numpy array as usual columns.

Versions
After commit 4de5b2b

Suggest implementation
Add another NumpyColumnSparseSerialization that

  1. save sparse indexes in numpy int array
  2. apply_sparse simply create a buffer with np.full, and buf[index]=items.
  3. it is recommended to implement read_sparse in a compiled way.

Or ad hoc introduce such simple fix:

    def apply_sparse(self, items):
        default = self.column.null_value
        if hasattr(self.column, "dtype") and not self.column.nullable:
            import numpy as np
            rv = np.full((self.items_total - 1,), dtype=items.dtype, fill_value=default)
            rv[np.array(self.sparse_indexes)-1] = items
            return rv

        if self.column.after_read_items:
            default = self.column.after_read_items([default])[0]

        rv = [default] * (self.items_total - 1)
        for item_number, i in enumerate(self.sparse_indexes):
            rv[i - 1] = items[item_number]

        return rv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions