Analysis of biological information detection technology based on the integration of non-parametric statistics and machine learning
Abstract
This study is based on breast cancer data from the National Cancer Institute (NCI) database, focusing on triple-negative breast cancer (n = 200) and LumB subtype breast cancer (n = 400). A data generation and analysis process combining non-parametric statistics and machine learning was designed. In the initial stage, the wgain algorithm was developed by integrating Wasserstein Generative Adversarial Networks (WGAN) and Random Forest algorithms. The generated expanded dataset was consistent with the original data, with a Pearson correlation coefficient of approximately 0.9, and Principal Component Analysis (PCA) confirmed the high accuracy and consistency of the generated data. The optimal threshold for differential gene selection was determined using the High-Confidence (HC) high-order identification method, and significance analysis was performed using rank-sum tests, Kolmogorov-Smirnov (K-S) tests, and edgeR tests. The results indicated that the rank-sum test performed the best (False Discovery Rate (FDR) = 0.099). A comparison with GAN and Wasserstein GAN Gradient Penalty (WGAN-GP) algorithms showed that wgain had a significant advantage in data consistency and differential gene reproduction (accuracy 83%). This study demonstrates the advantages of combining non-parametric statistics with machine learning, providing a new method for biological data generation and precise analysis.
References
1. Torné RV, Bryson K. Adversarial generation of gene expression data [Master’s thesis]. University College London; 2018.
2. Li Y, Ge X, Peng F, et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology. 2022; 23(1). doi: 10.1186/s13059-022-02648-4
3. Viñas R, Andrés-Terré H, Liò P, et al. Adversarial generation of gene expression data. Bioinformatics. 2021; 38(3): 730-737. doi: 10.1093/bioinformatics/btab035
4. Waters MR, Inkman M, Jayachandran K, et al. GAiN: An integrative tool utilizing generative adversarial neural networks for augmented gene expression analysis. Patterns. 2024; 5(2): 100910. doi: 10.1016/j.patter.2023.100910
5. Yang W. Application of non-parametric statistical analysis in multi-sample research—Example of the biological effect of normal liver RNA on cancer cells. Today Wealth Magazine. 2016.
6. Liu M, Wang B, Ta L, et al. Stereological analysis of the ultrastructure of human breast cancer cells and the rank-sum test. Progress in Biomedical Engineering. 2011; 32(02): 74-76.
7. Jiao CN, Gao YL, Yu N, et al. Hyper-Graph Regularized Constrained NMF for Selecting Differentially Expressed Genes and Tumor Classification. IEEE Journal of Biomedical and Health Informatics. 2020; 24(10): 3002-3011. doi: 10.1109/jbhi.2020.2975199
8. Zhang S. Research on GAN data augmentation methods for brain print recognition. Information Engineering University of Strategic Support Forces; 2023.
9. Zou H. Financial time series forecasting based on deep forest generative adversarial networks. Dalian Maritime University; 2021.
10. Zhou F. General Non-Parametric Tests for Differential Gene Expression Analysis [PhD thesis]. University of California, Berkeley; 2023.
11. Stupniko A, McInerney CE, Savage KI, et al. Robustness of differential gene expression analysis of RNA-seq. Computational and structural biotechnology journal. 2021; 19: 3470-3481.
12. Tang Y. Empirical analysis of non-parametric test statistics in survival analysis [PhD thesis]. Dalian University of Technology; 2018.
13. Yang Y, Zhao P. Non-parametric tests for two independent samples in teaching of non-parametric statistics. Science and Education Journal (Upper Volume). 2013; (04): 45-46.
14. Zhang X. Development and application of non-parametric KS test software. Yangzhou University; 2024.
15. Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028. 2017. doi: 10.48550/arXiv.1704.00028
16. Fang J, Liu L. Overview of various biological statistical tests and their conditions of use. Ecology Journal. 1995; (03): 67-70.
17. Hu Z. Research on clinical treatment protocols for diabetes combined with coronary heart disease. Chengdu University of Traditional Chinese Medicine; 2015.
18. Cheng N. Summary of the New Drug Biostatistics Seminar. Chinese Journal of Clinical Pharmacology and Therapeutics. 1996; (02): 142-145.
19. Gordon GJ, Jensen RV, Hsiao LL, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research. 2002; 62(17): 4963-4967.
Copyright (c) 2025 Author(s)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright on all articles published in this journal is retained by the author(s), while the author(s) grant the publisher as the original publisher to publish the article.
Articles published in this journal are licensed under a Creative Commons Attribution 4.0 International, which means they can be shared, adapted and distributed provided that the original published version is cited.