Skip to content

feat: Expose Sketch Memory Footprint#136

Merged
tisonkun merged 5 commits into
apache:mainfrom
tabac:sketch-mem-size
Jun 9, 2026
Merged

feat: Expose Sketch Memory Footprint#136
tisonkun merged 5 commits into
apache:mainfrom
tabac:sketch-mem-size

Conversation

@tabac

@tabac tabac commented May 28, 2026

Copy link
Copy Markdown
Contributor

Introduce a estimated_size() method that returns the memory footprint of a sketch. Add implementations for HLL, CPC.

If the approach is what we're looking for I can add implementations for all sketches, either here or in a follow up PR.

Addresses #135.

@tabac tabac changed the title Expose Sketch Memory Footprint feat: Expose Sketch Memory Footprint May 28, 2026
@tabac

tabac commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

@tisonkun does this approach look OK to you? If so I can update the remaining sketches.

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

I have two comments:

  1. Naming is important. We can spend a few more time on considering names. So far, I may prefer estimated_bytes because (1) it clearly describes the unit is bytes (2) we may not always have the accurate size so leaving it as a hint should be better; this is what this info is typically used IMO. (Disclose: the name estimated_bytes is used by ScopeDB for similar cases.)

  2. I'd ping @leerho @AlexanderSaydakov @proost @freakyzoidberg for information because this API can be helpful for other (lang) datasketches impls.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a public size(&self) -> usize method that reports the in-memory footprint (stack + heap) of HllSketch and CpcSketch, addressing issue #135. Internal heap_size() helpers are added on the backing storage types (HLL Container, AuxMap, Array4/6/8, and CPC PairTable) and combined with std::mem::size_of::<Self>() at the sketch level.

Changes:

  • Add HllSketch::size() that dispatches on Mode and sums backing-storage heap with size_of::<Self>().
  • Add CpcSketch::size() that sums sliding_window + optional PairTable heap with size_of::<Self>().
  • Add heap_size() helpers on Container, AuxMap, Array4, Array6, Array8, and PairTable.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datasketches/src/hll/sketch.rs Adds public HllSketch::size() that aggregates per-mode heap size with size_of::<Self>().
datasketches/src/hll/container.rs Adds Container::heap_size() counting backing coupon slice.
datasketches/src/hll/aux_map.rs Adds AuxMap::heap_size() counting backing entries slice.
datasketches/src/hll/array4.rs Adds Array4::heap_size() summing main bytes and optional aux map.
datasketches/src/hll/array6.rs Adds Array6::heap_size() returning packed-bytes length.
datasketches/src/hll/array8.rs Adds Array8::heap_size() returning per-slot bytes length.
datasketches/src/cpc/sketch.rs Adds public CpcSketch::size() summing sliding window and surprising-value-table heap.
datasketches/src/cpc/pair_table.rs Adds PairTable::heap_size() counting slots storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datasketches/src/cpc/sketch.rs Outdated
Comment on lines +456 to +461
let heap_size = self.sliding_window.len()
+ self
.surprising_value_table
.as_ref()
.map(|t| t.heap_size())
.unwrap_or(0);
Comment thread datasketches/src/cpc/pair_table.rs Outdated
Comment on lines +250 to +253
/// Returns the size of the heap allocations in bytes
pub fn heap_size(&self) -> usize {
self.slots.len() * std::mem::size_of::<u32>()
}
@tisonkun

Copy link
Copy Markdown
Member

Call it footprint is also possible, but it can require too much cabal knowledge to get the idiom.

heap_size fails to (1) specify the unit (2) speak too many details (heap).

@tabac

tabac commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Generally LGTM.

I have two comments:

  1. Naming is important. We can spend a few more time on considering names. So far, I may prefer estimated_bytes because (1) it clearly describes the unit is bytes (2) we may not always have the accurate size so leaving it as a hint should be better; this is what this info is typically used IMO. (Disclose: the name estimated_bytes is used by ScopeDB for similar cases.)

Makes sense to me, just to make a couple of naming suggestions: estimated_size_bytes is one, the other being mem_footprint but I prefer the first one.

@tisonkun

tisonkun commented May 29, 2026

Copy link
Copy Markdown
Member

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

@tabac

tabac commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

OK, I can push a commit with the rename. Just to make sure: we want both size() and heap_size() to be renamed to estimated_size() right?

@tisonkun

Copy link
Copy Markdown
Member

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

OK, I can push a commit with the rename. Just to make sure: we want both size() and heap_size() to be renamed to estimated_size() right?

That's correct.

@tabac

tabac commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@tisonkun please take another look when you have the time. I can introduce this method in the remaining sketches in another PR, or here if it's preferable.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comment thread datasketches/src/hll/sketch.rs
Comment thread datasketches/src/cpc/sketch.rs
Comment on lines +429 to +437
/// Returns the estimated size of the heap allocations in bytes
pub fn estimated_size(&self) -> usize {
self.bytes.len()
+ self
.aux_map
.as_ref()
.map(|a| a.estimated_size())
.unwrap_or(0)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can add but not necessay

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'd wait a bit to give an oppotunity to @notfilippo for review. Or I'll merge this patch early tomorrow. Review after commit is always possible.

@tisonkun

tisonkun commented Jun 1, 2026

Copy link
Copy Markdown
Member

@tabac If you have time right now, feel free to add impls for other sketches and I can do another review to merge them at once.

@tabac

tabac commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@tisonkun I pushed another commit for the Bloom, Tdigest and Theta sketches which are straight forward compared to CountMin and Frequencies that store a generic type T. I will follow up for the remaining two in another PR.

My only concern is that all the functions use the same name but some report only heap memory while others report stack + heap. This could be confusing although we mention in the comment.

@tisonkun

tisonkun commented Jun 3, 2026

Copy link
Copy Markdown
Member

My only concern is that all the functions use the same name but some report only heap memory while others report stack + heap. This could be confusing although we mention in the comment.

This is why we use estimated_bytes as the name. I'll review all the impl and consider whether it is a reasonable estimation.

@notfilippo notfilippo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the estimated_ contract is fine. I think in a follow-up we could also introduce tracking of correct size for sketches that can report correct sizing information.

Comment thread datasketches/Cargo.toml Outdated

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

@tisonkun tisonkun merged commit 329913a into apache:main Jun 9, 2026
10 checks passed
@tisonkun

tisonkun commented Jun 9, 2026

Copy link
Copy Markdown
Member

I thin the estimated_ contract is fine. I think in a follow-up we could also introduce tracking of correct size for sketches that can report correct sizing information.

See #137

@tisonkun

Copy link
Copy Markdown
Member

@tabac BTW I can see that you maintain the hyperloglogplus crate and have a few repos around sketches.

Feel free to compare the datasketches-rust's impls with your crate and bring the goods to one another :D

Feedback is always welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants