The 'tidyclust' package provides a unified interface for clustering in R, but lacked support for distributed Big Data environments. This project successfully extended 'tidyclust' by engineering a bridge to the Apache Spark MLlib via the sparklyr engine.
- Engine Integration Layer: Defined and implemented new methods for
k_means(),bisect_kmeans(), andgaussian_mixture()to support thesparklyrbackend. - Prediction Logic: Developed the
.k_means_predict_sparklyrinternal function to handle communication between R and Spark cluster nodes during the inference phase. - Unit Testing & QA: Utilized the
testthatframework to validate engine performance against the Ames housing dataset, ensuring parity between local and distributed results. - Error Remediation: Debugged and resolved critical specification errors within the
fit()function, specifically regarding cluster number (num_clust) parameter passing.
The extension allows users to utilize the same 'tidymodels' syntax while offloading the heavy computational lifting to a Spark cluster:
- Input: Standard
tidyclustspecification. - Engine:
set_engine("sparklyr"). - Execution: Distributed processing via Spark MLlib.
- Languages: R
- Distributed Systems: Apache Spark, SparklyR
- Software Engineering: tidyclust, tidymodels, testthat