Regularized K-Means Clustering via Fully Corrective Frank-Wolfe Optimization
Main Article Content
Abstract
Clustering high-dimensional data remains challenging because traditional k-means is sensitive to noise, outliers, and high dimensionality, often leading to unstable performance. The research presents a robust clustering system which combines the Fully Corrective Frank-Wolfe (FCFW) algorithm with k-means objective that uses Frobenius norm regularization. The addition of Frobenius norm regularization in the model produces more stable clusters while preventing overfitting and promoting cluster compactness. The proposed method uses probabilistic cluster assignments to enable each data point to join multiple clusters at different membership levels, thus supporting clusters with overlapping boundaries. The Kruskal-Wallis test functions as a feature selection method to identify crucial genes, which then guide the clustering operation toward important features in high-dimensional datasets. The FCFW-regularized k-means outperforms traditional k-means in all experiments performed on synthetic and real gene expression datasets. On a breast cancer gene expression dataset (GSE10797), it achieved an Accuracy of 89.39%, compared to 58% for traditional -means. Moreover, it surpassed a recent deep subspace clustering method (scPEDSSC) in Adjusted Rand Index by 8.3% on the Goolam single-cell dataset (0.968 vs. 0.885) and 7.2% on the Deng dataset (0.801 vs. 0.729). Overall, the proposed approach attained the highest ARI and Normalized Mutual Information (NMI) scores across five benchmark datasets. These results confirm that the FCFW-regularized -means yields more accurate and stable clustering results, demonstrating robust performance on high-dimensional data.
Article Details
Section
How to Cite
References
Adams, R.P., 2018. K-means clustering and related algorithms. Princeton University.
Ahmed, V.M. and Al-Haleem, A.A., 2024. Permeability prediction for Ajeel Oilfield/Tertiary Reservoir by integrating rock typing approach with FZI method. Journal of Engineering, 30(12), pp. 96–111. https://doi.org/10.31026/j.eng.2024.12.07.
AL-Kordy, S.U. and Khudair, B.H., 2021. Effluent quality assessment of sewage treatment plant using principal component analysis and cluster analysis. Journal of Engineering, 27(4), pp. 79–95. https://doi.org/10.31026/j.eng.2021.04.07.
Beznosikov, A., Dobre, D. and Gidel, G., 2023. Sarah frank-wolfe: Methods for constrained optimization with best rates and practical features. arXiv preprint arXiv:2304.11737.
Blanza, J., 2021. Wireless propagation multipaths using spectral clustering and three-constraint affinity matrix spectral clustering. Baghdad Science Journal, 18(2), P. 1001. https://doi.org/10.21123/bsj.2021.18.2(Suppl.).1001.
Canon, M.D. and Cullum, C.D., 1968. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM Journal on Control, 6(4), pp. 509–516. https://doi.org/10.1137/0306032.
Cherfaoui, F., Emiya, V., Ralaivola, L. and Anthoine, S., 2018. Frank-Wolfe algorithm for the exact sparse problem. arXiv preprint arXiv:1812.07201.
Daoudi, S., Anouar Zouaoui, C.M., El-Mezouar, M.C. and Taleb, N., 2021. Parallelization of the K-means++ clustering algorithm. Ingénierie des Systèmes d’Information, 26(1).
Deng, Q., Ramsköld, D., Reinius, B. and Sandberg, R., 2014. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 343(6167), pp. 193–196. https://doi.org/10.1126/science.1245316.
Feltes, B.C., Chandelier, E.B., Grisci, B.I. and Dorn, M., 2019. Cumida: An extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. Journal of Computational Biology, 26(4), pp. 376–386. https://doi.org/DOI:10.1089/cmb.2018.0238.
Frank, M. and Wolfe, P., 1956. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1–2), pp. 95–110. https://doi.org/10.1002/nav.3800030112.
Gao, C.X., Dwyer, D., Zhu, Y., Smith, C.L., Du, L., Filia, K.M., Bayer, J., Menssink, J.M., Wang, T. and Bergmeir, C., 2023. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Research, 327, P. 115265. https://doi.org/10.1016/j.psychres.2023.115265.
Ghathwan, K.I. and Mohammed, A.J., 2022. Intelligent bat algorithm for finding eps parameter of DbScan clustering algorithm. Iraqi Journal of Science, pp. 5572–5580. https://doi.org/10.24996/ijs.2022.63.12.41.
Gondeau, A., Aouabed, Z., Hijri, M., Peres-Neto, P.R. and Makarenkov, V., 2019. Object weighting: A new clustering approach to deal with outliers and cluster overlap in computational biology. IEEE/ACM transactions on computational biology and bioinformatics, 18(2), pp. 633–643. https://doi.org/10.1109/TCBB.2019.2921577.
Goolam, M., Scialdone, A., Graham, S.J.L., Macaulay, I.C., Jedrusik, A., Hupalowska, A., Voet, T., Marioni, J.C. and Zernicka-Goetz, M., 2016. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell, 165(1), pp. 61–74. https://doi.org/10.1097/01.ogx.0000488738.30718.bf.
Grisci, B.I., Feltes, B.C. and Dorn, M., 2019. Neuroevolution as a tool for microarray gene expression pattern identification in cancer research. Journal of Biomedical Informatics, 89, pp. 122–133. https://doi.org/10.1016/j.jbi.2018.11.013.
He, Y. and Zheng, Y., 2018. Short-term power load probability density forecasting based on Yeo-Johnson transformation quantile regression and Gaussian kernel function. Energy, 154, pp. 143–156. https://doi.org/10.1016/j.energy.2018.04.072.
Holloway, C.A., 1974. An extension of the Frank and Wolfe method of feasible directions. Mathematical Programming, 6, pp. 14–27. https://doi.org/10.1007/BF01580219.
Jamail, I. and Moussa, A., 2020. Current state-of-the-art of clustering methods for gene expression data with RNA-Seq. In: Applications of Pattern Recognition. IntechOpen. https://doi.org/10.5772/intechopen.94069.
Jiang, P., Cao, J., Yu, W. and Nie, F., 2025. A robust entropy regularized K-means clustering algorithm for processing noise in datasets. Neural Computing and Applications, pp. 1–16. https://doi.org/10.1007/s00521-024-10899-4.
Lacoste-Julien, S. and Jaggi, M., 2015. On the global linear convergence of Frank-Wolfe optimization variants. Advances in neural information processing systems, 28.
Lacoste-Julien, S., Jaggi, M., Schmidt, M. and Pletscher, P., 2013. Block-coordinate Frank-Wolfe optimization for structural SVMs. In: International Conference on Machine Learning. PMLR. pp. 53–61.
Lei, T., Jia, X., Zhang, Y., He, L., Meng, H. and Nandi, A.K., 2018. Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Transactions on Fuzzy Systems, 26(5), pp. 3027–3041. https://doi.org/10.1109/TFUZZ.2018.2796074.
LI, X., CAI, L., LI, J., YU, C.K.W.A.I. and HU, Y., 2021. A survey of clustering methods via optimization methodology. Journal of Applied & Numerical Optimization, 3(1). https://doi.org/10.23952/jano.3.2021.1.09.
Liu, H., Chen, J., Dy, J. and Fu, Y., 2023. Transforming complex problems into K-means solutions. IEEE transactions on pattern analysis and machine intelligence, 45(7), pp. 9149–9168.
Mahdi, S.S. and Mahmood, R.S., 2014. MR brain image segmentation using spatial fuzzy C-means clustering algorithm. Journal of Engineering, 20(09), pp. 78–89. https://doi.org/10.31026/j.eng.2014.09.06.
Meilă, M., 2007. Comparing clusterings—An information based distance. Journal of multivariate analysis, 98(5), pp. 873–895. https://doi.org/10.1016/j.jmva.2006.11.013.
Meléndez Surmay, R., Giraldo Henao, R. and Rodríguez Cortes, F., 2024. Kruskal-Wallis test for functional data based on random projections generated from a simulation of a Brownian motion. TecnoLógicas, 27(59).
Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., Achas, M. and Adebiyi, E., 2016. Clustering algorithms: Their application to gene expression data. Bioinformatics and Biology insights, 10, P. BBI-S38316. https://doi.org/10.4137/BBI.S38316.
Raeisi, M. and Sesay, A.B., 2022. A distance metric for uneven clusters of unsupervised k-means clustering algorithm. IEEE Access, 10, pp. 86286–86297. https://doi.org/10.1109/ACCESS.2022.3198992.
Raymaekers, J. and Zamar, R.H., 2022. Regularized k-means through hard-thresholding. Journal of Machine Learning Research, 23(93), pp. 1–48.
Saha, J., Tanvir, R.H., Hassan Samee, M.A. and Rahman, A., 2023. Probabilistic clustering of cells using single-cell RNA-seq data. bioRxiv, pp. 2012–2023.
Salman, A. and Hussain, B.A., 2023. Gene expression analysis via spatial clustering and evaluation indexing. Iraqi Journal for Computer Science and Mathematics, 4(1), pp. 24–34. https://doi.org/10.52866/ijcsm.2023.01.01.004.
Sarray, B. Al, Chrétien, S., Clarkson, P. and Cottez, G., 2017. Enhancing Prony’s method by nuclear norm penalization and extension to missing data. Signal, Image and Video Processing, 11, pp. 1089–1096.
Shiltagh, N.A. and Hussein, M.A., 2015. Data aggregation in wireless sensor networks using modified Voronoi fuzzy clustering algorithm. Journal of Engineering, 21(4), pp. 42–60. https://doi.org/10.31026/j.eng.2015.04.03.
Strehl, A. and Ghosh, J., 2002. Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec), pp. 583–617. https://doi.org/10.1162/153244303321897735.
Sun, W., Wang, J. and Fang, Y., 2012. Regularized k-means clustering of high-dimensional data and its asymptotic consistency.
Teran Hidalgo, S.J., Zhu, T., Wu, M. and Ma, S., 2018. Overlapping clustering of gene expression data using penalized weighted normalized cut. Genetic epidemiology, 42(8), pp. 796–811.
Wei, X., Wu, J., Li, G., Liu, J., Wu, X. and He, C., 2025. scPEDSSC: Proximity enhanced deep sparse subspace clustering method for scRNA-seq data. PLOS Computational Biology, 21(4), P. e1012924. https://doi.org/10.1371/journal.pcbi.1012924.
Wirth, E., Pena, J. and Pokutta, S., 2024. Fast Convergence of Frank-Wolfe algorithms on polytopes. arXiv preprint arXiv:2406.18789.
Wu, Z. and Wu, Z., 2020. An enhanced regularized k-means type clustering algorithm with adaptive weights. IEEE Access, 8, pp. 31171–31179. https://doi.org/10.1109/ACCESS.2020.2972333.
Yang, X., Zhao, W., Xu, Y., Wang, C.-D., Li, B. and Nie, F., 2024. Sparse K-means clustering algorithm with anchor graph regularization. Information Sciences, 667, P. 120504. https://doi.org/10.1016/j.ins.2024.120504.
Yousif, A.Y. and Sarray, B. Al, 2024. Convex optimization techniques for high-dimensional data clustering analysis: A review. Iraqi Journal for Computer Science and Mathematics, 5(3), P. 29. https://doi.org/10.52866/ijcsm.2024.05.03.022.
Zhang, X., He, Y., Jin, Y., Qin, H., Azhar, M. and Huang, J.Z., 2020. A robust k‐means clustering algorithm based on observation point mechanism. Complexity, 2020(1), P. 3650926. https://doi.org/10.1155/2020/3650926.