Proposal: contribute PySparkOnK8sOperator + @task.pyspark_on_k8s to apache.spark provider
#67165
Unanswered
sdaberdaku
asked this question in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal: contribute
PySparkOnK8sOperator+@task.pyspark_on_k8stoapache.sparkproviderSummary
I maintain a third-party provider,
apache-airflow-providers-pysparkonk8s, that adds two ways to run PySpark code as Airflow tasks on a Kubernetes-hosted Airflow deployment:PySparkOnK8sOperator— aPythonOperatorsubclass that initializes aSparkSessionand injects it into the user's Python callable as asparkkwarg.@task.pyspark_on_k8sdecorator — the TaskFlow-API equivalent.I would like to contribute this code upstream and merge it into the existing
providers/apache/spark/provider. Before I open a draft PR, I want to (a) confirmapache.sparkis the right home, and (b) find a committer interested in reviewing.What it does
The provider initializes the Spark cluster in one of three modes, configured via dataclasses:
client(default) — Spark Driver runs inside the Airflow worker pod executing the task. Executor pods are provisioned through the Kubernetes API. The worker pod's resource requests/limits are dynamically mutated via Airflow'sexecutor_config["pod_override"]mechanism to match the configured Spark driver resources.local— single-JVM driver+executor inside the worker pod. For dev/testing.connect— connects to an existing Spark Connect cluster.Because the driver coincides with the worker pod, user code has native access to Airflow Variables, Connections, and XComs from inside the Spark callable, without needing
spark-submitor a sidecar.The operator also auto-generates pod affinity rules per task run (using a fresh UUID label) so that:
This means concurrent PySpark tasks don't interfere with each other's pod placement.
Why not just
SparkSubmitOperator?The existing
apache.sparkprovider'sSparkSubmitOperatorshells out tospark-submitand runs the driver in a separate process or pod. That works, but:@taskintegration — you cannot return a value from PySpark code and have it flow as XCom.The proposed operator addresses these specifically for the K8s-executor deployment topology, which is increasingly common.
Minimal usage example
Scope of the contribution
What I'd contribute, in
providers/apache/spark/:operators/pyspark_on_k8s.py— operatordecorators/pyspark_on_k8s.py— decoratorconfig/—SparkBaseConf,SparkDriverConf,SparkExecutorConf,SparkDeployMode,Sentinelresources/—CPUandMemoryvalue types with K8s↔JVM conversiontests/)Additionally, in the official Airflow Helm chart (
chart/):Role+RoleBindinggranting the workerServiceAccountthe permissions required to create/list/watch/delete executorPods,ConfigMaps, andPersistentVolumeClaims in the worker's namespace. This is the RBAC currently shipped by my third-partypysparkonk8s-addonchart; folding it into the official chart removes a manual install step for users.Cross-provider dependency: the code imports from
apache.cncf.kubernetes(forPodGeneratorandcreate_pod_id). I'd declare this inprovider.yamlrather than vendoring helpers.Open questions for maintainers
providers/apache/spark/the right home? Alternatives I considered:providers/cncf/kubernetes/(since it's K8s-deployment-specific), or a newproviders/apache/spark-on-k8s/sub-provider. I lean toward folding it intoapache.spark, but defer to maintainers.PySparkOnK8sOperator. Open to suggestions ifapache.sparkhas naming conventions I should follow.ACCEPTING_PROVIDERS.rst, new community providers require committer sponsorship. Since I'm proposing to merge into an existing provider rather than create a new one, my reading is that standard PR review applies and no separate sponsorship is required. Please correct me if I have this wrong.apache.cncf.kubernetes— is this fine if declared inprovider.yaml, or is there a preferred pattern?Role+RoleBindingfor executor-pod management, gated behind a values flag and disabled by default? I'd open this as a separate PR scoped tochart/after the provider PR lands, but want to flag it now since it's part of the overall donation.apache.spark,cncf.kubernetes, or Helm chart side interested in reviewing? I'm happy to do all the restructuring work; I just want to make sure a reviewer is willing to engage before I invest the time.About me
I'm the sole maintainer of the existing third-party provider. I've signed the ASF ICLA (or will before opening the PR). The project has been in production use for a few years now. I'm comfortable losing release control by moving to Airflow's release cadence, and I'm happy to maintain the contributed code long-term as a regular contributor.
Happy to demo a working DAG, share resource-mutation traces, or walk through the affinity logic if useful. Thanks for considering this.
Beta Was this translation helpful? Give feedback.
All reactions