Skip to content

[Refactor] Consolidate HdfsScannerParams into HdfsScannerContext, pass by pointer, and eliminate HdfsScannerState#74643

Merged
dirtysalt merged 12 commits into
StarRocks:mainfrom
dirtysalt:refactor-hdfs-scanner-ctx
Jun 11, 2026
Merged

[Refactor] Consolidate HdfsScannerParams into HdfsScannerContext, pass by pointer, and eliminate HdfsScannerState#74643
dirtysalt merged 12 commits into
StarRocks:mainfrom
dirtysalt:refactor-hdfs-scanner-ctx

Conversation

@dirtysalt

@dirtysalt dirtysalt commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Why I'm doing:

The scanner initialization path had structural inefficiencies:

  1. Two copying steps per scan range: HiveDataSource copied _scanner_ctx into a local, then HdfsScanner::init() copied it again into the scanner member — two full deep copies of all vectors and maps.

  2. Pool-allocated state with pointer indirection: HdfsScannerState was heap-allocated via obj_pool, requiring a raw pointer (ctx.state) to access predicate and split state. The indirection existed solely to keep HdfsScannerContext copyable — but since HiveDataSource and HdfsScanner are strictly 1:1, the context never needs to be copied.

  3. Duplicate storage in HiveDataSource: Many fields (_hive_table, _hive_column_names, _column_access_paths, _extended_column_expr_ctxs, _min_max_tuple_desc, _scan_range_id, _partition_filter.values) were stored separately in HiveDataSource and then assigned as pointer references into _scanner_ctx — unnecessary indirection.

What I'm doing:

1. Pass context by pointer, zero-copy

Before:
  HiveDataSource::_scanner_ctx  (template)
      → copy into local scanner_ctx          ← copy #1
      → scanner->init(state, scanner_ctx)
          → _scanner_ctx = scanner_ctx       ← copy #2

After:
  HiveDataSource::_scanner_ctx  (canonical, per-range fields set in-place)
      → scanner->init(state, &_scanner_ctx)  ← pointer, zero copies
          → _scanner_ctx = &_scanner_ctx     ← pointer assignment

2. Eliminate HdfsScannerState — embedded value structs

Non-copyable predicate and split state moves directly into HdfsScannerContext as two embedded value structs (no heap allocation, no raw pointer):

HdfsScannerContext {
    SplitState split {       ← value struct
        split_tasks, has_split_tasks, estimated_mem_usage_per_split_task
    };
    PredicateState predicates {  ← value struct
        conjuncts_manager, predicate_free_pool, predicate_parser,
        predicate_tree, runtime_filter_scan_range_pruner
    };
};

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@github-actions github-actions Bot requested review from GavinMar and trueeyu June 10, 2026 11:36

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c86fe94d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/connector/hive_connector.cpp
@dirtysalt dirtysalt force-pushed the refactor-hdfs-scanner-ctx branch from 0c86fe9 to fcf8d00 Compare June 10, 2026 21:47
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
@dirtysalt dirtysalt force-pushed the refactor-hdfs-scanner-ctx branch from fcf8d00 to 62ddc90 Compare June 10, 2026 21:48
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f699de4eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/test/formats/parquet/file_reader_test.cpp Outdated
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 064960ae61

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/exec/iceberg/iceberg_delete_builder.cpp
@dirtysalt dirtysalt changed the title [Refactor] Merge HdfsScannerParams into HdfsScannerContext and consolidate predicate ownership [Refactor] Merge HdfsScannerParams into HdfsScannerContext, eliminate double indirection Jun 11, 2026
dirtysalt and others added 4 commits June 11, 2026 09:55
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@dirtysalt dirtysalt changed the title [Refactor] Merge HdfsScannerParams into HdfsScannerContext, eliminate double indirection [Refactor] Consolidate HdfsScannerParams into HdfsScannerContext, pass by pointer, and eliminate HdfsScannerState Jun 11, 2026
@dirtysalt dirtysalt requested a review from Copilot June 11, 2026 02:58

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 480ecec3d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/connector/hive_connector.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the BE HDFS scanner initialization and per-scan state handling by consolidating the former HdfsScannerParams and HdfsScannerState into a single pointer-passed HdfsScannerContext, and renames HdfsScanStats to HdfsScannerStats. The goal is to eliminate deep-copy overhead and reduce pointer indirection across the Hive connector → scanner → file readers pipeline.

Changes:

  • Replace pass-by-value/copy (HdfsScannerParams / HdfsScannerState) with a pointer-passed HdfsScannerContext that embeds split/predicate state.
  • Rename and propagate scan stats type (HdfsScanStatsHdfsScannerStats) across Parquet/ORC scanners and tests.
  • Update Hive connector and related readers to read/write context fields directly (e.g. options, scan_range, column_access_paths, global_dictmaps).

Reviewed changes

Copilot reviewed 51 out of 51 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
be/test/formats/parquet/parquet_ut_base.h Update helper signature to accept HdfsScannerContext*.
be/test/formats/parquet/parquet_ut_base.cpp Update predicate manager setup to use ctx->predicates.*.
be/test/formats/parquet/parquet_footer_test.cpp Remove HdfsScannerParams usage; initialize HdfsScannerContext directly.
be/test/formats/parquet/parquet_file_writer_test.cpp Switch tests from params to context fields (scan_range, stats, counters).
be/test/formats/parquet/parquet_cli_reader.h Remove params indirection; populate HdfsScannerContext directly.
be/test/formats/parquet/page_reader_test.cpp Rename stats type to HdfsScannerStats.
be/test/formats/parquet/page_index_test.cpp Migrate test scaffolding from params to context fields/options.
be/test/formats/parquet/iceberg_schema_evolution_file_reader_test.cpp Use ctx->table_specific.* and context fields (no params).
be/test/formats/parquet/group_reader_test.cpp Remove params allocation and store access paths directly in context.
be/test/formats/parquet/file_writer_test.cpp Replace params with context for scan_range/stats/counters.
be/test/formats/parquet/file_reader_test.cpp Migrate extensive test setup to new context layout (options/predicates/split).
be/test/formats/parquet/column_converter_test.cpp Replace params with context for scan_range/stats/counters.
be/test/exec/hdfs_scanner/jni_scanner_test.cpp Update JNI scanner tests to use _scanner_ctx pointer on scanner.
be/test/exec/hdfs_scanner/hdfs_scanner_json_test.cpp Rename stats type to HdfsScannerStats.
be/test/exec/hdfs_scanner/cache_select_scanner_test.cpp Update cache-select scanner init to accept HdfsScannerContext*.
be/test/connector/deletion_vector/deletion_vector_test.cpp Construct DeletionVector from HdfsScannerContext.
be/src/formats/parquet/variant_projection.h Rename stats type in API to HdfsScannerStats.
be/src/formats/parquet/variant_projection.cpp Remove params indirection for access paths; rename stats type.
be/src/formats/parquet/metadata.cpp Read split context and options directly from HdfsScannerContext.
be/src/formats/parquet/group_reader.h Rename stats pointer type to HdfsScannerStats*.
be/src/formats/parquet/group_reader.cpp Replace params->options/... and params->global_dictmaps with context fields.
be/src/formats/parquet/file_reader.cpp Replace params indirection; use ctx->split and ctx->predicates state.
be/src/formats/parquet/column_reader.h Rename stats type in reader options to HdfsScannerStats*.
be/src/formats/orc/orc_input_stream.h Rename app stats type to HdfsScannerStats*.
be/src/exec/iceberg/iceberg_delete_builder.h Accept HdfsScannerContext reference; rename stats types.
be/src/exec/iceberg/iceberg_delete_builder.cpp Build delete readers using new context fields/options and renamed stats.
be/src/exec/hdfs_scanner/jni_scanner.h Update scanner init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/jni_scanner.cpp Convert scanner to store HdfsScannerContext* and use pointer accesses.
be/src/exec/hdfs_scanner/hdfs_scanner.h Consolidate params/state into HdfsScannerContext; embed split/predicate state; rename stats type.
be/src/exec/hdfs_scanner/hdfs_scanner.cpp Switch scanner to pointer-held context; rebuild predicate/split state in context; rename stats.
be/src/exec/hdfs_scanner/hdfs_scanner_text.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_text.cpp Replace params access with _scanner_ctx->... context fields.
be/src/exec/hdfs_scanner/hdfs_scanner_partition.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_partition.cpp Replace params access with context pointer usage.
be/src/exec/hdfs_scanner/hdfs_scanner_parquet.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_parquet.cpp Build delete/DV logic and reader init using context pointer fields.
be/src/exec/hdfs_scanner/hdfs_scanner_orc.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_orc.cpp Replace params indirection; wire split/predicate state through context.
be/src/exec/hdfs_scanner/hdfs_scanner_json.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_json.cpp Replace params access with context pointer usage.
be/src/exec/hdfs_scanner/hdfs_scanner_avro.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/hdfs_scanner_avro.cpp Replace params access with context pointer usage for scan-range/path/schema.
be/src/exec/hdfs_scanner/cache_select_scanner.h Update init signature to accept HdfsScannerContext.
be/src/exec/hdfs_scanner/cache_select_scanner.cpp Replace params access with context pointer usage for formats/options/datacache.
be/src/connector/hive_connector.h Replace _scanner_params with _scanner_ctx; adjust sink provider return typedef.
be/src/connector/hive_connector.cpp Populate per-range fields in _scanner_ctx in-place; pass context pointer to scanner.
be/src/connector/deletion_vector/deletion_vector.h Make DeletionVector depend on HdfsScannerContext; rename stats types.
be/src/connector/deletion_vector/deletion_vector.cpp Use context fields (fs, table_location, profile, datacache opts); rename stats types.
be/src/connector/connector.h Include chunk sink header for the new provider pointer typedef.
be/src/connector/connector_chunk_sink.h Add ConnectorChunkSinkProviderPtr alias.

Comment thread be/src/exec/hdfs_scanner/hdfs_scanner.h
Comment thread be/src/exec/hdfs_scanner/hdfs_scanner.cpp
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 482 / 527 (91.46%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/hdfs_scanner/cache_select_scanner.cpp 7 28 25.00% [71, 98, 99, 100, 101, 102, 108, 109, 121, 127, 152, 183, 186, 203, 204, 208, 209, 224, 228, 234, 237]
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_parquet.cpp 17 19 89.47% [54, 55]
🔵 be/src/formats/parquet/file_reader.cpp 26 28 92.86% [159, 163]
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_orc.cpp 51 55 92.73% [350, 351, 468, 594]
🔵 be/src/connector/hive_connector.cpp 166 177 93.79% [99, 100, 154, 159, 206, 283, 460, 784, 825, 826, 830]
🔵 be/src/exec/hdfs_scanner/hdfs_scanner.cpp 99 104 95.19% [330, 331, 332, 346, 347]
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_avro.cpp 14 14 100.00% []
🔵 be/src/exec/hdfs_scanner/jni_scanner.cpp 12 12 100.00% []
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_text.cpp 23 23 100.00% []
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_json.cpp 6 6 100.00% []
🔵 be/src/connector/deletion_vector/deletion_vector.h 2 2 100.00% []
🔵 be/src/formats/orc/orc_input_stream.h 1 1 100.00% []
🔵 be/src/formats/parquet/metadata.cpp 3 3 100.00% []
🔵 be/src/exec/hdfs_scanner/hdfs_scanner_partition.cpp 3 3 100.00% []
🔵 be/src/connector/deletion_vector/deletion_vector.cpp 9 9 100.00% []
🔵 be/src/formats/parquet/group_reader.cpp 5 5 100.00% []
🔵 be/src/exec/iceberg/iceberg_delete_builder.cpp 21 21 100.00% []
🔵 be/src/exec/hdfs_scanner/hdfs_scanner.h 15 15 100.00% []
🔵 be/src/exec/iceberg/iceberg_delete_builder.h 2 2 100.00% []

@dirtysalt dirtysalt enabled auto-merge (squash) June 11, 2026 06:23
@dirtysalt dirtysalt merged commit ca7d8bc into StarRocks:main Jun 11, 2026
60 of 63 checks passed
@dirtysalt

Copy link
Copy Markdown
Contributor Author

@mergify backport branch-4.1

@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

backport branch-4.1

✅ Backports have been created

Details

Cherry-pick of ca7d8bc has failed:

On branch mergify/bp/branch-4.1/pr-74643
Your branch is up to date with 'origin/branch-4.1'.

You are currently cherry-picking commit ca7d8bc71b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/connector/connector.h
	modified:   be/src/connector/connector_chunk_sink.h
	modified:   be/src/exec/hdfs_scanner/cache_select_scanner.cpp
	modified:   be/src/exec/hdfs_scanner/cache_select_scanner.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_avro.cpp
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_avro.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_json.cpp
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_json.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_orc.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_parquet.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_partition.cpp
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_partition.h
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_text.cpp
	modified:   be/src/exec/hdfs_scanner/hdfs_scanner_text.h
	modified:   be/src/exec/hdfs_scanner/jni_scanner.h
	modified:   be/src/formats/parquet/column_reader.h
	modified:   be/src/formats/parquet/group_reader.h
	modified:   be/src/formats/parquet/metadata.cpp
	modified:   be/test/connector/deletion_vector/deletion_vector_test.cpp
	modified:   be/test/exec/hdfs_scanner/cache_select_scanner_test.cpp
	modified:   be/test/exec/hdfs_scanner/hdfs_scanner_json_test.cpp
	modified:   be/test/exec/hdfs_scanner/jni_scanner_test.cpp
	modified:   be/test/formats/parquet/column_converter_test.cpp
	modified:   be/test/formats/parquet/file_writer_test.cpp
	modified:   be/test/formats/parquet/iceberg_schema_evolution_file_reader_test.cpp
	modified:   be/test/formats/parquet/page_index_test.cpp
	modified:   be/test/formats/parquet/page_reader_test.cpp
	modified:   be/test/formats/parquet/parquet_cli_reader.h
	modified:   be/test/formats/parquet/parquet_file_writer_test.cpp
	modified:   be/test/formats/parquet/parquet_footer_test.cpp
	modified:   be/test/formats/parquet/parquet_ut_base.cpp
	modified:   be/test/formats/parquet/parquet_ut_base.h

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   be/src/connector/deletion_vector/deletion_vector.cpp
	both modified:   be/src/connector/deletion_vector/deletion_vector.h
	both modified:   be/src/connector/hive_connector.cpp
	both modified:   be/src/connector/hive_connector.h
	both modified:   be/src/exec/hdfs_scanner/hdfs_scanner.cpp
	both modified:   be/src/exec/hdfs_scanner/hdfs_scanner.h
	both modified:   be/src/exec/hdfs_scanner/hdfs_scanner_orc.cpp
	both modified:   be/src/exec/hdfs_scanner/hdfs_scanner_parquet.cpp
	both modified:   be/src/exec/hdfs_scanner/jni_scanner.cpp
	both modified:   be/src/exec/iceberg/iceberg_delete_builder.cpp
	both modified:   be/src/exec/iceberg/iceberg_delete_builder.h
	both modified:   be/src/formats/orc/orc_input_stream.h
	both modified:   be/src/formats/parquet/file_reader.cpp
	both modified:   be/src/formats/parquet/group_reader.cpp
	deleted by us:   be/src/formats/parquet/variant_projection.cpp
	deleted by us:   be/src/formats/parquet/variant_projection.h
	both modified:   be/test/exec/hdfs_scanner/hdfs_scanner_test.cpp
	both modified:   be/test/formats/parquet/file_reader_test.cpp
	both modified:   be/test/formats/parquet/group_reader_test.cpp

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@dirtysalt dirtysalt deleted the refactor-hdfs-scanner-ctx branch June 11, 2026 08:46
wanpengfei-git added a commit that referenced this pull request Jun 11, 2026
…s by pointer, and eliminate HdfsScannerState (backport #74643) (#74689)

Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Co-authored-by: tqqq <dirtysalt1987@gmail.com>
Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants