This is the Scala repository of sparkmobility. The Python user interface for sparkmobility can be found at sparkmobility.
Whether modifying the Scala core of the project or running it, you must compile the Scala code and create a JAR file. Follow these steps:
-
Download and Install SBT:
Install sbt on your machine. You can follow the instructions on the official sbt documentation.
-
Package the Scala Code:
Install,Compile and Package the Scala code using sbt:
sbt update sbt compile sbt assembly
This process creates a
target/scala-2.13/sparkmobility-<version>.jar. The Python package downloads a matching artifact from the repo's GitHub Releases intosparkmobility/lib/, so you don't need to copy the JAR anywhere for the normal Python workflow. Note the path only if you intend tospark-submitthe JAR directly.
Sometimes, the default Scala version is 2.12. Since the project is compatible only with Scala 2.13 and above, you must set the Scala version before setting up the project. To do this, follow the instructions below:
-
Download Spark:
- Go to the Apache Spark download page.
- Select the following options:
- Spark release: 3.3.x or later
- Package type: Pre-built for Apache Hadoop
- Scala version: 2.13
- Download the
.tgzfile.
-
Extract the Spark Distribution: Extract the downloaded file to a directory (e.g.,
/opt/spark):tar -xzf spark-3.5.4-bin-hadoop3-scala2.13.tgz -C /opt/spark
-
Set Environment Variables: Add the following environment variables to your shell configuration file (e.g.,
.bashrcor.zshrc):export SPARK_HOME=/opt/spark/spark-3.5.4-bin-hadoop3-scala2.13 export PATH=$SPARK_HOME/bin:$PATH export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=python3
-
Reload the Shell Configuration:
source ~/.bashrc # or source ~/.zshrc
-
Verify the Scala Version: Run the following command to verify the Scala version:
spark-shell --version
This will display information including the Scala version. Ensure it shows
Scala version: 2.13.x.
This is a standard sbt project. You can use the following commands:
- Update dependencies:
sbt update - Compile the code:
sbt compile - Run the project:
sbt run - Start a Scala REPL:
sbt console
The scala code is located in the src/main/scala directory. You can modify the code in this directory.
Once edited you must follow the steps below to compile and package the code.
sbt compile
sbt assembly
This process creates a .jar file(Usually under target/scala-2.13/ dir) that should be submitted to the Spark cluster.
The standard dev loop is ./update.sh, which runs the scalafmt formatters, compiles, and builds the assembly JAR. To cut a new release:
- Bump
ThisBuild / versionin build.sbt to match the sparkmobility Python package'spyproject.toml. The two versions MUST stay in lockstep —ensure_jar()in the Python package downloadssparkmobility-<version>.jarby string match. - Commit and push.
- Tag and push the tag:
git tag v<version> && git push origin v<version>.
The .github/workflows/release.yml workflow verifies that the tag matches build.sbt, builds the assembly, and attaches the JAR to a GitHub Release. Python's ensure_jar() pulls from that release on first import.
The Python package (sparkmobility on PyPI / humnetlab/sparkmobility) calls into the JAR via py4j, using pipelines.PyEntryPoint as the only supported entry surface. Methods there take primitive args plus JSON strings — no Scala collections cross the bridge. Changing a PyEntryPoint signature is a breaking change for Python callers and requires a coordinated PR on both repos.
To run the project, you can use the following command:
spark-submit \
--class pipelines.Main \
--master <your_master_url> \ # e.g., local[*], yarn
--driver-memory <MEMORY>g \
--executor-memory <MEMORY>g \
--conf "spark.driver.extraJavaOptions=-Dlog4j.rootLogger=WARN,console" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.rootLogger=WARN,console" \
/path/to/your/application.jar \
<input-parquet> <output-dir> <h3-resolution>Main is the OD-matrix entry point used for standalone spark-submit; the primary consumer today is the Python package via PyEntryPoint (see Python integration).
For more information on sbt, visit the official sbt documentation.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Albert Cao (@caoalbert)
- Christopher Chávez (@Vanchristoph3r)
- Mingyi He (@Hemy17)
- Wolin Jiang (@LowenJiang)
- Giuseppe Perona (@g-perona)
- Jiaman Wu (@charmainewu)