Skip to content

Hamju1999/Big-Data-Clustering-Engine-Extension

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Clustering Engine: 'tidyclust' + 'sparklyr'

Project Goal

The 'tidyclust' package provides a unified interface for clustering in R, but lacked support for distributed Big Data environments. This project successfully extended 'tidyclust' by engineering a bridge to the Apache Spark MLlib via the sparklyr engine.

My Technical Contributions

  • Engine Integration Layer: Defined and implemented new methods for k_means(), bisect_kmeans(), and gaussian_mixture() to support the sparklyr backend.
  • Prediction Logic: Developed the .k_means_predict_sparklyr internal function to handle communication between R and Spark cluster nodes during the inference phase.
  • Unit Testing & QA: Utilized the testthat framework to validate engine performance against the Ames housing dataset, ensuring parity between local and distributed results.
  • Error Remediation: Debugged and resolved critical specification errors within the fit() function, specifically regarding cluster number (num_clust) parameter passing.

System Architecture

The extension allows users to utilize the same 'tidymodels' syntax while offloading the heavy computational lifting to a Spark cluster:

  1. Input: Standard tidyclust specification.
  2. Engine: set_engine("sparklyr").
  3. Execution: Distributed processing via Spark MLlib.

Technical Stack

  • Languages: R
  • Distributed Systems: Apache Spark, SparklyR
  • Software Engineering: tidyclust, tidymodels, testthat

About

Spark ML engine extension for the tidyclust R package, enabling distributed K-Means clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages