[Refactor] Consolidate HdfsScannerParams into HdfsScannerContext, pass by pointer, and eliminate HdfsScannerState#74643
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c86fe94d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
0c86fe9 to
fcf8d00
Compare
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
fcf8d00 to
62ddc90
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f699de4eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 064960ae61
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 480ecec3d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR refactors the BE HDFS scanner initialization and per-scan state handling by consolidating the former HdfsScannerParams and HdfsScannerState into a single pointer-passed HdfsScannerContext, and renames HdfsScanStats to HdfsScannerStats. The goal is to eliminate deep-copy overhead and reduce pointer indirection across the Hive connector → scanner → file readers pipeline.
Changes:
- Replace pass-by-value/copy (
HdfsScannerParams/HdfsScannerState) with a pointer-passedHdfsScannerContextthat embeds split/predicate state. - Rename and propagate scan stats type (
HdfsScanStats→HdfsScannerStats) across Parquet/ORC scanners and tests. - Update Hive connector and related readers to read/write context fields directly (e.g.
options,scan_range,column_access_paths,global_dictmaps).
Reviewed changes
Copilot reviewed 51 out of 51 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| be/test/formats/parquet/parquet_ut_base.h | Update helper signature to accept HdfsScannerContext*. |
| be/test/formats/parquet/parquet_ut_base.cpp | Update predicate manager setup to use ctx->predicates.*. |
| be/test/formats/parquet/parquet_footer_test.cpp | Remove HdfsScannerParams usage; initialize HdfsScannerContext directly. |
| be/test/formats/parquet/parquet_file_writer_test.cpp | Switch tests from params to context fields (scan_range, stats, counters). |
| be/test/formats/parquet/parquet_cli_reader.h | Remove params indirection; populate HdfsScannerContext directly. |
| be/test/formats/parquet/page_reader_test.cpp | Rename stats type to HdfsScannerStats. |
| be/test/formats/parquet/page_index_test.cpp | Migrate test scaffolding from params to context fields/options. |
| be/test/formats/parquet/iceberg_schema_evolution_file_reader_test.cpp | Use ctx->table_specific.* and context fields (no params). |
| be/test/formats/parquet/group_reader_test.cpp | Remove params allocation and store access paths directly in context. |
| be/test/formats/parquet/file_writer_test.cpp | Replace params with context for scan_range/stats/counters. |
| be/test/formats/parquet/file_reader_test.cpp | Migrate extensive test setup to new context layout (options/predicates/split). |
| be/test/formats/parquet/column_converter_test.cpp | Replace params with context for scan_range/stats/counters. |
| be/test/exec/hdfs_scanner/jni_scanner_test.cpp | Update JNI scanner tests to use _scanner_ctx pointer on scanner. |
| be/test/exec/hdfs_scanner/hdfs_scanner_json_test.cpp | Rename stats type to HdfsScannerStats. |
| be/test/exec/hdfs_scanner/cache_select_scanner_test.cpp | Update cache-select scanner init to accept HdfsScannerContext*. |
| be/test/connector/deletion_vector/deletion_vector_test.cpp | Construct DeletionVector from HdfsScannerContext. |
| be/src/formats/parquet/variant_projection.h | Rename stats type in API to HdfsScannerStats. |
| be/src/formats/parquet/variant_projection.cpp | Remove params indirection for access paths; rename stats type. |
| be/src/formats/parquet/metadata.cpp | Read split context and options directly from HdfsScannerContext. |
| be/src/formats/parquet/group_reader.h | Rename stats pointer type to HdfsScannerStats*. |
| be/src/formats/parquet/group_reader.cpp | Replace params->options/... and params->global_dictmaps with context fields. |
| be/src/formats/parquet/file_reader.cpp | Replace params indirection; use ctx->split and ctx->predicates state. |
| be/src/formats/parquet/column_reader.h | Rename stats type in reader options to HdfsScannerStats*. |
| be/src/formats/orc/orc_input_stream.h | Rename app stats type to HdfsScannerStats*. |
| be/src/exec/iceberg/iceberg_delete_builder.h | Accept HdfsScannerContext reference; rename stats types. |
| be/src/exec/iceberg/iceberg_delete_builder.cpp | Build delete readers using new context fields/options and renamed stats. |
| be/src/exec/hdfs_scanner/jni_scanner.h | Update scanner init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/jni_scanner.cpp | Convert scanner to store HdfsScannerContext* and use pointer accesses. |
| be/src/exec/hdfs_scanner/hdfs_scanner.h | Consolidate params/state into HdfsScannerContext; embed split/predicate state; rename stats type. |
| be/src/exec/hdfs_scanner/hdfs_scanner.cpp | Switch scanner to pointer-held context; rebuild predicate/split state in context; rename stats. |
| be/src/exec/hdfs_scanner/hdfs_scanner_text.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_text.cpp | Replace params access with _scanner_ctx->... context fields. |
| be/src/exec/hdfs_scanner/hdfs_scanner_partition.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_partition.cpp | Replace params access with context pointer usage. |
| be/src/exec/hdfs_scanner/hdfs_scanner_parquet.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_parquet.cpp | Build delete/DV logic and reader init using context pointer fields. |
| be/src/exec/hdfs_scanner/hdfs_scanner_orc.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_orc.cpp | Replace params indirection; wire split/predicate state through context. |
| be/src/exec/hdfs_scanner/hdfs_scanner_json.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_json.cpp | Replace params access with context pointer usage. |
| be/src/exec/hdfs_scanner/hdfs_scanner_avro.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/hdfs_scanner_avro.cpp | Replace params access with context pointer usage for scan-range/path/schema. |
| be/src/exec/hdfs_scanner/cache_select_scanner.h | Update init signature to accept HdfsScannerContext. |
| be/src/exec/hdfs_scanner/cache_select_scanner.cpp | Replace params access with context pointer usage for formats/options/datacache. |
| be/src/connector/hive_connector.h | Replace _scanner_params with _scanner_ctx; adjust sink provider return typedef. |
| be/src/connector/hive_connector.cpp | Populate per-range fields in _scanner_ctx in-place; pass context pointer to scanner. |
| be/src/connector/deletion_vector/deletion_vector.h | Make DeletionVector depend on HdfsScannerContext; rename stats types. |
| be/src/connector/deletion_vector/deletion_vector.cpp | Use context fields (fs, table_location, profile, datacache opts); rename stats types. |
| be/src/connector/connector.h | Include chunk sink header for the new provider pointer typedef. |
| be/src/connector/connector_chunk_sink.h | Add ConnectorChunkSinkProviderPtr alias. |
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
Signed-off-by: tqqq <dirtysalt1987@gmail.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 482 / 527 (91.46%) file detail
|
|
@mergify backport branch-4.1 |
✅ Backports have been createdDetails
Cherry-pick of ca7d8bc has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Why I'm doing:
The scanner initialization path had structural inefficiencies:
Two copying steps per scan range:
HiveDataSourcecopied_scanner_ctxinto a local, thenHdfsScanner::init()copied it again into the scanner member — two full deep copies of all vectors and maps.Pool-allocated state with pointer indirection:
HdfsScannerStatewas heap-allocated viaobj_pool, requiring a raw pointer (ctx.state) to access predicate and split state. The indirection existed solely to keepHdfsScannerContextcopyable — but sinceHiveDataSourceandHdfsScannerare strictly 1:1, the context never needs to be copied.Duplicate storage in HiveDataSource: Many fields (
_hive_table,_hive_column_names,_column_access_paths,_extended_column_expr_ctxs,_min_max_tuple_desc,_scan_range_id,_partition_filter.values) were stored separately inHiveDataSourceand then assigned as pointer references into_scanner_ctx— unnecessary indirection.What I'm doing:
1. Pass context by pointer, zero-copy
2. Eliminate HdfsScannerState — embedded value structs
Non-copyable predicate and split state moves directly into
HdfsScannerContextas two embedded value structs (no heap allocation, no raw pointer):What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: