Data Saturation Reliability Theory: A Framework for Optimising AI Input Feeds

Michael Mncedisi Willie; Siyabonga  Jikwana; Lesiba Arnold Malotana; Zwanaka James Mudara

doi:10.55578/isgm.2509.006

Authors

Michael Mncedisi Willie Council for Medical Schemes, Pretoria, South Africa Author
Siyabonga Jikwana Gauteng Department of Health, Johannesburg, South Africa; University of Pretoria, Pretoria, South Africa Author
Lesiba Arnold Malotana Gauteng Department of Health, Johannesburg, South Africa Author
Zwanaka James Mudara Vaal University of Technology, Vereeniging, South Africa Author

DOI:

https://doi.org/10.55578/isgm.2509.006

Keywords:

Artificial Intelligence, Data Saturation, Reliability, Input Feeds, Signal-to-Noise Ratio, Feedback Mechanisms, AI Governance, Data Quality, Machine Learning, Optimisation

Abstract

Artificial Intelligence (AI) systems increasingly rely on large and diverse data streams to support accurate, adaptive, and context-aware decision-making. However, beyond a certain point, adding new data can lead to diminishing or even negative returns due to redundancy, noise, and bias, a phenomenon known as data saturation. This paper introduces the Data Saturation Reliability (DSR) framework, a conceptual framework that optimises AI input feeds by balancing data volume, quality, and reliability. Drawing on principles from information theory, machine learning, and data governance, the DSR framework formalises saturation thresholds, signal-to-noise ratio assessment, temporal relevance, and dynamic feedback mechanisms as key factors for sustainable AI performance. By linking marginal information gain to input reliability, the DSR framework provides strategies to mitigate risks of over-saturation, bias propagation, and operational inefficiencies, while improving predictive accuracy and adaptive learning. The framework prioritises quality over quantity, encouraging intelligent curation of inputs rather than indiscriminate data collection. Applications include high-stakes fields such as healthcare diagnostics, financial forecasting, autonomous systems, and large-scale natural language processing, where real-time decision accuracy and reliability are vital. The paper highlights opportunities for empirical validation, cross-domain adaptation, and integration of DSR principles into AI lifecycle management and governance. Ultimately, the framework promotes shifting from “more data equals better performance” towards an optimal data balance that ensures operational effectiveness and ethical responsibility in AI deployment.

References

1. Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8–12. https://doi.org/10.1109/MIS.2009.36

2. Kaplan, J., McCandlish, S., Henighan, T., Brown, T., et al. (2020). Scaling laws for neural language models. arXiv. https://arxiv.org/abs/2001.08361

3. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., et al. (2022). Training compute-optimal large language models. arXiv. https://arxiv.org/abs/2203.15556

4. Atkinson, G., & Metsis, V. (2021). A survey of methods for detection and correction of noisy labels in time series data. In A. F. Gelbukh (Ed.), Artificial intelligence applications and innovations (IFIP Advances in Information and Communication Technology, Vol. 625, pp. 447–458). Springer. https://doi.org/10.1007/978-3-030-79150-6_38

5. González-Santoyo, C., Renza, D., & Moya-Albor, E. (2025). Identifying and mitigating label noise in deep learning for image classification. Technologies, 13(4), 132. https://doi.org/10.3390/technologies13040132

6. Nigam, N., Dutta, T., & Gupta, H. P. (2020). Impact of noisy labels in learning techniques: A survey. In T. Dutta (Ed.), Advances in data and information sciences (pp. 447–458). Springer. https://doi.org/10.1007/978-981-15-0694-9_38

7. Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). Confident learning: Estimating uncertainty in dataset labels. arXiv. https://arxiv.org/abs/1911.00068

8. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Proceedings of the 29th International Conference on Neural Information Processing Systems (Vol. 2, pp. 2503–2511). Curran Associates, Inc.

9. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). Everyone wants to do the model work, not the data work: Data cascades in high-stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–15). https://doi.org/10.1145/3411764.3445518

10. Banerjee, A., & Compte, O. (2024). Consensus and disagreement: Information aggregation under (not so) naive learning. Journal of Political Economy, 132(4). https://doi.org/10.1086/729448

11. Brendel, P., Torres, A., & Arah, O. A. (2023). Simultaneous adjustment of uncontrolled confounding, selection bias and misclassification in multiple-bias modelling. International Journal of Epidemiology, 52(4), 1220–1230. https://doi.org/10.1093/ije/dyad001

12. Jarrahi, M. H., Memariani, A., & Guha, S. (2023). The principles of data-centric AI. Communications of the ACM, 66(8), 84–92. https://doi.org/10.1145/3571724

13. Settles, B. (2009). Active learning literature survey (Technical Report TR1648). University of Wisconsin–Madison, Department of Computer Sciences. http://digital.library.wisc.edu/1793/60660

14. Saunders, B., Sim, J., Kingstone, T., Baker, S., Waterfield, J., Bartlam, B., Burroughs, H., & Jinks, C. (2017). Saturation in qualitative research: Exploring its conceptualization and operationalization. Quality & Quantity, 52(4), 1893–1907. https://doi.org/10.1007/s11135-017-0574-8

15. Speed, C., & Metwally, A. A. (2025). The Human–AI Hybrid Delphi Model: A structured framework for context-rich, expert consensus in complex domains. arXiv. https://arxiv.org/abs/2508.09349

16. Braun, V., & Clarke, V. (2019). To saturate or not to saturate? Questioning data saturation as a useful concept for thematic analysis and sample-size rationales. Qualitative Research in Sport, Exercise and Health, 13(1), 1–16. https://doi.org/10.1080/2159676X.2019.1704846

17. Guest, G., Bunce, A., & Johnson, L. (2006). How many interviews are enough? Field Methods, 18(1), 59–82. https://doi.org/10.1177/1525822X05279903

18. Aldoseri, A., Al-Khalifa, K. N., & Hamouda, A. M. (2023). Re-thinking data strategy and integration for artificial intelligence: Concepts, opportunities, and challenges. Applied Sciences, 13(12), 7082. https://doi.org/10.3390/app13127082

19. Qin, Z., Zhaopan, X., Zhou, Y., & You, Y. (2024). Dataset growth. arXiv. https://doi.org/10.48550/arXiv.2405.18347

20. Okeleke, P. A., Ajiga, D., Folorunsho, S., & Ezeigweneme, C. (2024). Predictive analytics for market trends using AI: A study in consumer behavior. International Journal of Engineering Research Updates, 7(1). https://doi.org/10.53430/ijeru.2024.7.1.0032

21. Ali, O., Murray, P. A., Momin, M., Dwivedi, Y. K., & Malik, T. (2024). The effects of artificial intelligence applications in educational settings: Challenges and strategies. Technological Forecasting and Social Change, 199, 123076. https://doi.org/10.1016/j.techfore.2023.123076

22. National Academy of Medicine. (2023). Artificial intelligence in health care: The hope, the hype, the promise, the peril (D. Whicher, M. Ahmed, S. T. Israni, & M. Matheny, Eds.). National Academies Press. https://www.ncbi.nlm.nih.gov/pubmed/39146448

23. Nivedhaa, N. (2024). A comprehensive review of AI's dependence on data. ResearchGate. https://doi.org/10.13140/RG.2.2.27033.63840

24. Barddal, J. P., Enembreck, F., Gomes, H. M., & Pfahringer, B. (2018). Merit-guided dynamic feature selection filter for data streams. Expert Systems with Applications, 116, 311–326. https://doi.org/10.1016/j.eswa.2018.09.031

25. Bellavista, P., Berrocal, J., Corradi, A., Das, S. K., Foschini, L., & Zanni, A. (2019). A survey on fog computing for the Internet of Things. Pervasive and Mobile Computing, 52, 71-99. https://doi.org/10.1016/j.pmcj.2018.12.007

26. Mortaji, S. T. H., & Sadeghi, M. E. (2024). Assessing the reliability of artificial intelligence systems: Challenges, metrics, and future directions. International Journal of Innovation in Management Economics and Social Sciences, 4(2), 1–13. https://doi.org/10.59615/ijimes.4.2.1

27. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

28. Kedziora, D., & Marciniak, P. (2025). Design principles for AI-enhanced process automation: An eDSR approach to intelligent data validation in financial decisions [Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-7206919/v1

29. Budnikov, M., Bykova, A., & Yamshchikov, I. P. (2025). Generalization potential of large language models. Neural Computing and Applications, 37, 1973–1997. https://doi.org/10.1007/s00521-024-10827-6

30. Ajiboye, A., Arshah, R. A., & Qin, H. (2015). Evaluating the effect of dataset size on predictive models using supervised learning techniques. International Journal of Computer Systems & Software Engineering, 1, 75–84. https://doi.org/10.15282/ijsecs.1.2015.6.0006

31. Scannapieco, M., & Catarci, T. (2002, May). Data quality under the computer science perspective. Rome, Italy.

32. Tamm, H. C., & Nikiforova, A. (2025). From data quality for AI to AI for data quality: A systematic review of tools for AI-augmented data quality management in data warehouses. In International Conference on Business Informatics Research. Springer International Publishing.

33. Heringa, M. B., Cnubben, N. H. P., Slob, W., & Hakkert, B. C. (2020). Use of the kinetically-derived maximum dose concept in selection of top doses for toxicity studies hampers proper hazard assessment and risk management. Regulatory Toxicology and Pharmacology, 114, 104659. https://doi.org/10.1016/j.yrtph.2020.104659

34. Mishra, S., Rao, A., Krishnan, R., & Zio, E. (2024). Reliability, resilience and human factors engineering for trustworthy AI systems [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2411.08981

35. Cho, J.-H., Xu, S., Hurley, P., & Beaumont, M. (2019). STRAM: Measuring the trustworthiness of computer-based systems. ACM Computing Surveys, 51(6), 1–47. https://doi.org/10.1145/3277666

36. Smith, J., & Ethan, A. (2019). Intelligent user feedback loops: AI for continuous product innovation. International Journal of Advanced Engineering Technologies and Innovations, 1(5), 21–XX.

37. Tadi, V. (2020). Optimizing data governance: Enhancing quality through AI-integrated master data management across industries. American Journal of Engineering Research, 1(3), 1–15.

38. Novelli, C., Taddeo, M., & Floridi, L. (2023). Accountability in artificial intelligence: What it is and how it works. AI & Society, 39(4), 1–12. https://doi.org/10.1007/s00146-023-01635-y

39. Gregor, S. (2006). The nature of theory in information systems. MIS Quarterly, 30(3), 611–642. https://doi.org/10.2307/25148742

40. Naeem, M., Ozuem, W., Howell, K., & Ranfagni, S. (2023). A step-by-step process of thematic analysis to develop a conceptual model in qualitative research. International Journal of Qualitative Methods, 22(11). https://doi.org/10.1177/16094069231205789

41. Arshed, N., Rehman, H. U., Nazim, M., & Saher, A. (2021). Evading law of diminishing returns, a case of human capital development. Journal of Contemporary Issues in Business and Government, 27(5), 2569–2584. https://doi.org/10.47750/cibg.2021.27.05.133

42. Wellmann, F. (2013). Information theory for correlation analysis and estimation of uncertainty reduction in maps and models. Entropy, 15(4), 1464–1485. https://doi.org/10.3390/e15041464

43. Jensen, G., Ward, R. D., & Balsam, P. D. (2013). Information: Theory, brain, and behavior. Journal of the Experimental Analysis of Behavior, 100(3), 408–431. https://doi.org/10.1002/jeab.49

Data Saturation Reliability Theory: A Framework for Optimising AI Input Feeds

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Data Availability Statement

Issue

Section

License

How to Cite

Information

Language

Make a Submission