Skip to content

Add TimeSeriesDecisionTreeRegressionOperator#576

Merged
zkaoudi merged 7 commits into
apache:mainfrom
xristlamp:main
May 27, 2025
Merged

Add TimeSeriesDecisionTreeRegressionOperator#576
zkaoudi merged 7 commits into
apache:mainfrom
xristlamp:main

Conversation

@xristlamp

Copy link
Copy Markdown
Contributor

Add TimeSeriesDecisionTreeRegressionOperator with full API and integration support

  • Implemented logical operator: TimeSeriesDecisionTreeRegressionOperator
  • Added Spark execution operator with internal lag feature generation
  • Created mapping: TimeSeriesDecisionTreeRegressionMapping
  • Registered mapping in SparkMLPlugin
  • Extended DataQuanta and DataQuantaBuilder with trainTimeSeriesDecisionTree method
  • Added integration test in SparkIntegrationIT.java

@zkaoudi zkaoudi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to have a generic DecisionTreeRegression operator and not tight to timeseries. Like this people who are not interested in decision trees can still use it.

We could have the emulation of creating timeseries and using decision trees in a test or application.

@xristlamp

Copy link
Copy Markdown
Contributor Author

I've applied the changes as you suggested:

  • The operator is now renamed to DecisionTreeRegressionOperator and is designed to be generic, no longer tied to time series.
  • I removed the internal lag feature generation logic from the operator and its Spark implementation.
  • Any time series–specific preprocessing (e.g., lag features) is now handled externally in the test, not in the operator.
  • The operator now expects standard (features: double[], label: Double) inputs, making it reusable in a wide range of regression use cases.

Please let me know if there’s anything else you'd like improved or adjusted!

@mspruc mspruc requested a review from zkaoudi May 26, 2025 08:34
@mspruc

mspruc commented May 26, 2025

Copy link
Copy Markdown
Contributor

Hiya thanks for your contribution!

I am a little confused with your usage of BinaryToUnaryOperator for the DecisionTreeRegressionOperator, you have inputs Double[] and Double but also void can you check if the return type of the operator should be void?

@xristlamp

Copy link
Copy Markdown
Contributor Author

Hiya thanks for your contribution!

I am a little confused with your usage of BinaryToUnaryOperator for the DecisionTreeRegressionOperator, you have inputs Double[] and Double but also void can you check if the return type of the operator should be void?

Thank you very much for the observation!
Just to clarify: the Void is the third type parameter and refers to the "model" that the operator could return.
In our case, since the operator does not return a model but only the predictions (Double), we use Void as the third parameter.
The actual output of the operator (predictions) is correctly defined as Double , that's the second type parameter.

  • double[] is the feature vector input
  • Double is the prediction output
  • Void indicates that this operator does not return a model object

@zkaoudi

zkaoudi commented May 26, 2025

Copy link
Copy Markdown
Contributor

I agree with @mspruc. The operator should return the model and not Void.

Also this operator should have two inputs as it's a BinaryToUnaryOperator: double[] and Double. The first one is the X (data) and the second one Y should be the labels, if I'm not mistaken.

See for instance here: https://github.com/apache/incubator-wayang/blob/main/wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/LinearRegressionOperator.java

Shouldn't it be a similar input/output?

@xristlamp

Copy link
Copy Markdown
Contributor Author

Hi @mspruc , @zkaoudi , thank you both for your helpful feedback!
You're absolutely right, the operator should not return Void. I have updated the implementation to align with the design pattern used in LinearRegressionOperator. Here's what's been changed:

  • The third type parameter of BinaryToUnaryOperator is now DecisionTreeRegressionModel instead of Void.
  • The operator now returns the trained model, not just the predictions.
  • The SparkDecisionTreeRegressionOperator wraps the Spark MLlib model and implements the DecisionTreeRegressionModel interface, just like the linear regression case.
  • The DSL API and tests were also updated to reflect the correct model return type.

This ensures the operator follows the same interface contract as the other regression operators in Wayang, and allows for post-training model usage.

Appreciate the guidance!

@zkaoudi zkaoudi merged commit 475e2d1 into apache:main May 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants