ENERA-BASE: A Method-Agnostic Framework for Synthetic Health Data

Edwin Gerardo Acuña Acuña; Sacramento Cruz-Doriano; María Teresita de Jesús Chi-Chan; Felipe Ángel Álvarez-Salgado

doi:10.55578/amsr.2605.007

Authors

Dr. Edwin Gerardo Acuña Acuña Universidad Latina, San Pedro de Montes de Oca, San José, Costa Rica Author https://orcid.org/0000-0001-7897-4137
Dr. Sacramento Cruz-Doriano Instituto Tecnológico Superior de Calkiní, Calkiní, Campeche, México Author https://orcid.org/0000-0002-8837-7114
Dra. María Teresita de Jesús Chi-Chan Instituto Tecnológico Superior de Calkiní, Calkiní, Campeche, México Author https://orcid.org/0000-0002-2642-9249
Dr. Felipe Ángel Álvarez-Salgado Instituto Tecnológico Superior de Calkiní, Calkiní, Campeche, México Author https://orcid.org/0000-0002-2191-2856

DOI:

https://doi.org/10.55578/amsr.2605.007

Keywords:

synthetic data, research methodology, statistical validation, reproducibility, methods education, health sciences research design

Abstract

For techniques teaching, pre-registration, statistical software validation, pilot testing, and privacy-preserving analytical prototyping, synthetic datasets are becoming more and more crucial in health research. Nevertheless, current tools are still disjointed, often depend on platform-specific implementations, and seldom integrate statistical validation with a uniform specification language. In order to create, verify, and export synthetic datasets for various quantitative research designs in the health sciences, this paper introduces GENERA-BASE, a specification-driven and method-agnostic framework. Four steps comprise the framework: method-agnostic data production, integrated validation via seven kinds of statistical integrity tests, cross-platform export to SPSS, R, Python, Stata, SAS, and JASP, and structured definition of research design and goal statistical attributes. Four popular design patterns in health research-an experimental randomized controlled trial-like design, a correlational/regression design, a longitudinal cohort, and a Likert-based psychometric structure-were used to assess the framework's effectiveness. One calibration cycle was sufficient to retrieve all 44 predetermined validation indications across the four applications within tolerance. The longitudinal dataset replicated the target monthly slope, the correlational dataset recovered the expected association structure, the psychometric dataset attained acceptable reliability and factor-related properties, and the synthetic randomized trial replicated the target intervention effect with preserved baseline equivalency. Overall fidelity was very good across 18 numerical target-versus-achieved metrics (Pearson r = 0.9985, p < 0.001). These results suggest that GENERA-BASE offers a transparent, interoperable, and repeatable system for creating synthetic data in health research. Its primary contribution is to help training, methodological experimentation, pilot preparation, and pre-registered analytical research by combining structured specification, validation, and platform interoperability into a unified methodical process.

Author Biography

Dr. Edwin Gerardo Acuña Acuña, Universidad Latina, San Pedro de Montes de Oca, San José, Costa Rica

Dr. Edwin Gerardo Acuña Acuña is a researcher and university professor in mathematics, data science, artificial intelligence, and engineering. He is affiliated with the Postgraduate Faculty at Universidad Latina de Costa Rica and works on interdisciplinary research in health sciences, synthetic data, quantitative methods, and emerging technologies.

References

[1] Kokosi, T., & Harron, K. (2022). Synthetic data in medical research. BMJ Medicine, *1*(1), e000167. https://doi.org/10.1136/bmjmed-2022-000167

[2] El Emam, K., Mosquera, L., & Bass, J. (2020). Evaluating identity disclosure risk in fully synthetic health data: Model development and validation. Journal of Medical Internet Research, *22*(11), e23139. https://doi.org/10.2196/23139

[3] Pasculli, G., et al. (2025). Synthetic data in healthcare and drug development: Definitions, regulatory frameworks, issues. CPT: Pharmacometrics & Systems Pharmacology, *14*(5), 819–833. https://doi.org/10.1002/psp4.70021

[4] Susser, D., et al. (2024). Synthetic health data: Real ethical promise and peril. Hastings Center Report, *54*(6), 8–13. https://doi.org/10.1002/hast.4911

[5] Goldfeld, K., & Wujciak-Jens, J. (2020). simstudy: Illuminating research methods through data generation. Journal of Open Source Software, *5*(54), 2763. https://doi.org/10.21105/joss.02763

[6] Walonoski, J., et al. (2018). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, *25*(3), 230–238. https://doi.org/10.1093/jamia/ocx079

[7] Yale, A., et al. (2020). Generation and evaluation of privacy-preserving synthetic health data. Neurocomputing, *416*, 244–255. https://doi.org/10.1016/j.neucom.2019.12.136

[8] Zhao, Z., Kunar, A., Birke, R., & Chen, L. Y. (2021). CTAB-GAN: Effective table data synthesizing. In Proceedings of the 13th Asian Conference on Machine Learning (Vol. 157, pp. 97–112). PMLR. https://proceedings.mlr.press/v157/zhao21a.html

[9] Snoke, J., Raab, G. M., Nowok, B., Dibben, C., & Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A, *181*(3), 663–688. https://doi.org/10.1111/rssa.12358

[10] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, *74*(11), 1–26. https://doi.org/10.18637/jss.v074.i11

[11] El Emam, K., Mosquera, L., & Hoptroff, R. (2020). Practical synthetic data generation: Balancing privacy and the broad availability of data. O’Reilly Media. https://www.oreilly.com/library/view/practical-synthetic-data/9781492072737/

[12] Mosquera, L., El Emam, K., Ding, L., Sharma, V., Zhang, X. H., El Kababji, S., Carvalho, C., Hamilton, B., Palfrey, D., Kong, L., Jiang, B., & Eurich, D. T. (2023). A method for generating synthetic longitudinal health data. BMC Medical Research Methodology, *23*, Article 67. https://doi.org/10.1186/s12874-023-01869-w

[13] Smith, A., Lambert, P. C., & Rutherford, M. J. (2022). Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility. BMC Medical Research Methodology, *22*, Article 176. https://doi.org/10.1186/s12874-022-01654-1

[14] Qian, Z., et al. (2024). Synthetic data for privacy-preserving clinical risk prediction. Scientific Reports, *14*, Article 25287. https://doi.org/10.1038/s41598-024-72894-y

[15] Wang, Z., Myles, P., & Tucker, A. (2021). Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Computational Intelligence, *37*(2), 819–851. https://doi.org/10.1111/coin.12427

[16] Tucker, A., Wang, Z., Rotalinti, Y., & Myles, P. (2020). Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digital Medicine, *3*, Article 147. https://doi.org/10.1038/s41746-020-00353-9

[17] Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., & Epelde, G. (2020). Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Medical Informatics, *8*(7), e18910. https://doi.org/10.2196/18910

[18] El Emam, K., & Hoptroff, R. (2019). The synthetic data paradigm for using and sharing data. Cutter Executive Update, *19*(6), 1–12. https://www.cutter.com/article/synthetic-data-paradigm-using-and-sharing-data-499526

[19] Pilgram, L., Ko, H., Tung, A., et al. (2025). Protecting patient privacy in tabular synthetic health data: A regulatory perspective. NPJ Digital Medicine, *8*. https://doi.org/10.1038/s41746-025-02112-0

[20] European Parliament and Council. (2025, March 5). Regulation (EU) 2025/327 of 11 February 2025 on the European Health Data Space and amending Directive 2011/24/EU and Regulation (EU) 2024/2847. Official Journal of the European Union, L series. https://eur-lex.europa.eu/eli/reg/2025/327/oj

[21] European Parliament and Council. (2024, July 12). Regulation (EU) 2024/1689 of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L series. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

[22] Barr, A. A., Quan, J., Guo, E., & Sezgin, E. (2025). Large language models generating synthetic clinical datasets: A feasibility and comparative analysis with real-world perioperative data. Frontiers in Artificial Intelligence, *8*, Article 1533508. https://doi.org/10.3389/frai.2025.1533508

[23] Brockschmidt, M., Schröder, M., & Feuerriegel, S. (2026). SurvDiff: A diffusion model for generating synthetic data in survival analysis. arXiv preprint, arXiv:2509.22352. https://arxiv.org/abs/2509.22352

[24] Ashhad, M., Norcliffe, A., van der Schaar, M., & Tomasev, N. (2025). Generating accurate synthetic survival data by conditioning on outcomes. In Proceedings of the 10th Machine Learning for Healthcare Conference (Vol. 298). PMLR. https://proceedings.mlr.press/v298/ashhad25a.html

[25] van Drumpt, J., Chawla, S., Barbereau, T., Spagnuelo, D., & van de Burgwal, L. (2025). Secondary use under the European Health Data Space: Setting the scene and towards a research agenda on privacy-enhancing technologies. Frontiers in Digital Health, *7*, Article 1602101. https://doi.org/10.3389/fdgth.2025.1602101

[26] Steier, A., Ramaswamy, L., Manoel, A., & Haushalter, A. (2025). Synthetic data privacy metrics. arXiv preprint, arXiv:2501.03941. https://arxiv.org/abs/2501.03941

[27] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). SynthEval: A framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, *39*, 6–25.

[28] Lu, X., et al. (2025). MIDST Challenge at SaTML 2025: Membership inference over diffusion-models-based synthetic tabular data. In Proceedings of the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[29] Ilaty, A., Shirazi, H., & Homayouni, H. (2025). SynLLM: A comparative analysis of large language models for medical tabular synthetic data generation via prompt engineering. arXiv preprint, arXiv:2508.08529. https://arxiv.org/abs/2508.08529

ENERA-BASE: A Method-Agnostic Framework for Synthetic Health Data

Authors

DOI:

Keywords:

Abstract

Author Biography

References

Downloads

Published

Data Availability Statement

Issue

Section

License

How to Cite

Information

Language

Make a Submission