Enhancing Large Language Model Reasoning via Retrieval-Augmented Generation and Self-Verification Mechanisms

Zerminey Saleem; Muhammad Noor ul Haq; Sifat Ullah; Ali Raza

doi:10.63544/xgvmne24

Authors

Zerminey Saleem Department of Computer Science, Bahria University, Karachi, Pakistan Author
Muhammad Noor ul Haq Department of Computer Science, Government College University, Faisalabad, Pakistan Author
Sifat Ullah Department of Computer Science, Islamia College University Peshawar, Pakistan Author
Ali Raza Department of Information Technology, Government Collage University, Hyderabad Author

DOI:

https://doi.org/10.63544/xgvmne24

Keywords:

Large Language Models, Retrieval Augmented Generation, Self-Verification, Hallucination Mitigation, Evidence Based Reasoning, Knowledge Intensive Tasks, Factual Accuracy, Natural Language Processing

Abstract

This study proposes a Retrieval Augmented Generation–Self Verification (RAG–SV) framework to enhance the reasoning reliability and factual accuracy of large language models in knowledge intensive tasks. The framework combines external evidence retrieval with a self-verification mechanism that evaluates and refines the model's own responses before final output. Experiments on open domain question answering, fact verification, and multi-step reasoning tasks show that the proposed approach achieves higher Exact Match and F1 scores while significantly reducing hallucination compared with standard LLMs, retrieval augmented baselines, and self-verification only models. Human evaluation further indicates that RAG–SV outputs are more accurate, coherent, and closely aligned with the underlying evidence. The framework is particularly suitable for high stake domains such as education, healthcare, and law, where correctness and explainability are critical. The study concludes that integrating retrieval with reflective self-checking offers a practical path toward more robust, trustworthy, and evidence-based language generation.

REFERENCES

[1] T. B. Brown et al., "Language models are few shot learners," in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901.

[2] OpenAI, "GPT-4 technical report," 2023. [Online]. Available: https://arxiv.org/abs/2303.08774

[3] A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.

[4] R. Bommasani et al., "On the opportunities and risks of foundation models," Stanford University, Stanford, CA, USA, Tech. Rep., 2021.

[5] S. Singh, I. Gabriel, D. Hadfield-Menell, and J. Riedl, "Aligning language models with human values through iterative preference learning," in Proc. AAAI Conf. Artif. Intell., 2024.

[6] Z. Ji et al., "Survey of hallucination in natural language generation," ACM Comput. Surv., vol. 55, no. 12, pp. 1–38, 2023.

[7] J. Maynez, S. Narayan, M. Bhandari, and I. Gurevych, "On factuality and faithfulness in abstractive summarization," in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 4208–4220.

[8] S. Lin, J. Hilton, and O. Evans, "TruthfulQA: Measuring how models mimic human falsehoods," in Proc. North Amer. Chapter Assoc. Comput. Linguistics (NAACL), 2022, pp. 298–313.

[9] S. Min et al., "FActScore: Fine-grained atomic factual accuracy scoring for longform text generation," in Proc. Conf. Empir. Methods Nat. Lang. Process., 2023.

[10] S. Min et al., "Rethinking the role of demonstrations: What makes in-context learning work?" in Proc. Conf. Empir. Methods Nat. Lang. Process., 2023.

[11] J. Wei et al., "Chain of thought prompting elicits reasoning in large language models," in Advances in Neural Information Processing Systems, vol. 35, 2022.

[12] S. Yao et al., "Tree of thoughts: Deliberate problem solving with large language models," in Proc. 40th Int. Conf. Mach. Learn. (ICML), 2023.

[13] M. Ribeiro, S. Singh, and C. Guestrin, "'Why should I trust you?' Explaining the predictions of any classifier," in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, 2016, pp. 1135–1144.

[14] L. Weidinger et al., "Ethical and social risks of harm from language models," 2021. [Online]. Available: https://arxiv.org/abs/2112.04359

[15] F. Petroni et al., "How context affects entity representations in downstream tasks," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2019.

[16] M. L. Roberts et al., "Scaling laws for autoregressive generative modeling," 2020. [Online]. Available: https://arxiv.org/abs/2010.14701

[17] V. Karpukhin et al., "Dense passage retrieval for open-domain question answering," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2020.

[18] K. Shuster, D. Yarats, D. Elliott, M. Bakhtiarifard, and M. Lewis, "Augmentable agents for open domain dialog," 2021. [Online]. Available: https://arxiv.org/abs/2106.03121

[19] K. Singhal et al., "Large language models encode clinical knowledge," Nature, vol. 620, no. 7972, pp. 172–180, 2023.

[20] A. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, "FEVER: A large-scale dataset for fact extraction and verification," in Proc. North Amer. Chapter Assoc. Comput. Linguistics (NAACL), 2018.

[21] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474.

[22] K. Guu, J. Lee, Z. Tung, P. Pasupat, and M. W. Chang, "REALM: Retrieval augmented language model pre-training," in Proc. 37th Int. Conf. Mach. Learn. (ICML), 2020.

[23] S. Borgeaud et al., "Improving language models by retrieving from trillions of tokens," in Proc. 39th Int. Conf. Mach. Learn. (ICML), 2022.

[24] G. Izacard and E. Grave, "Leveraging passage retrieval with generative models for open domain question answering," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2021.

[25] X. Chen, A. Fan, A. Gupta, and J. Howard, "Retrieval augmented generation in large language models," 2023. [Online]. Available: https://arxiv.org/abs/2305.06983

[26] O. Ram, M. Hay, and M. Iyyer, "Retrieval augmented generation for explainable story rewriting," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[27] A. Asai, K. Hashimoto, H. Hajishirzi, and O. Tafjord, "Multi-hop retrieval for factual knowledge-intensive NLP tasks," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[28] Q. Gao, X. Zhang, and Y. Zhang, "Retrieval augmented generation for large language models: A survey," 2023. [Online]. Available: https://arxiv.org/abs/2312.10997

[29] H. Shi, X. Ren, Y. Zhang, and J. Zhang, "On noisy retrieval for retrieval augmented generation," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[30] L. Yu, Y. Zhang, and J. Zhang, "Retrieval augmented generation with uncertainty aware retrieval," in Proc. Annu. Meeting Assoc. Comput. Linguistics (ACL), 2024.

[31] A. Zhang et al., "Hallucinated but accurate? An empirical study on the hallucination behaviors of large language models in question answering," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[32] A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and Z. Liu, "SummEval: Re-evaluating summarization evaluation," Trans. Assoc. Comput. Linguistics, vol. 10, pp. 104–119, 2022.

[33] X. Chen, Y. Liu, and M. Iyyer, "Self verification and correction in retrieval augmented generation," in Proc. Annu. Meeting Assoc. Comput. Linguistics (ACL), 2024.

[34] M. Mallen, Y. Zhang, and P. Lewis, "Misaligned retrieval and generation in retrieval augmented models," in Proc. Findings Assoc. Comput. Linguistics (ACL Findings), 2023.

[35] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, "Retrieval augmented transformers," 2022. [Online]. Available: https://arxiv.org/abs/2205.11916

[36] A. Asai, Y. Zhang, S. Min, and P. Lewis, "Reasoning aware retrieval augmented generation for complex question answering," in Proc. Annu. Meeting Assoc. Comput. Linguistics (ACL), 2024.

[37] A. Madaan et al., "Self verification with program aided language models," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[38] L. Wang et al., "Self-adaptive retrieval augmented generation," 2024. [Online]. Available: https://arxiv.org/abs/2402.07234

[39] S. Xie, A. Raghunathan, and P. Liang, "Self-checking for improved reliability in language models," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[40] L. Gao et al., "Factoring and verifying evidence for retrieval augmented generation," 2024. [Online]. Available: https://arxiv.org/abs/2401.04567

[41] M. Kumar et al., "Large language models are better reasoners with self verification," in Proc. Findings Assoc. Comput. Linguistics (ACL Findings), 2023.

[42] N. Shinn, B. Labash, and S. Pertsch, "Reflexion: Language agents with verbal reinforcement learning," 2023. [Online]. Available: https://arxiv.org/abs/2303.11366

[43] Y. Weng et al., "Self-verification for chain-of-thought reasoning," 2023. [Online]. Available: https://arxiv.org/abs/2310.02189

[44] Y. Zhang et al., "Self-verification in large language models: A survey," 2024. [Online]. Available: https://arxiv.org/abs/2402.06857

[45] P. Manakul, A. Liao, and M. J. F. Gales, "Self verifying reasoning through reflection," in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2023.

[46] D. Zhou et al., "Self-consistency improves chain of thought reasoning in language models," in Proc. Int. Conf. Learn. Represent. (ICLR), 2024.

[47] L. Pan et al., "Multi-step reasoning with self-verification," 2024. [Online]. Available: https://arxiv.org/abs/2401.10234

[48] D. Kahneman, Thinking, Fast and Slow. New York, NY, USA: Farrar, Straus and Giroux, 2011.

[49] J. S. B. T. Evans and K. E. Stanovich, "Dual-process theories of higher cognition: Advancing the debate," Perspect. Psychol. Sci., vol. 8, no. 3, pp. 223–241, 2013.

[50] NIST, "NIST AI Risk Management Framework (AI RMF 1.0)," National Institute of Standards and Technology, Gaithersburg, MD, USA, 2023.

[51] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, "On the dangers of stochastic parrots: Can language models be too big?" in Proc. ACM Conf. Fairness, Accountability, Transparency (FAccT), 2021.

[52] L. Ouyang et al., "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 27730–27744.

Author Biographies

Zerminey Saleem, Department of Computer Science, Bahria University, Karachi, Pakistan

Department of Computer Science,

Bahria University, Karachi, Pakistan

Email: zermineysaleem@gmail.com
Muhammad Noor ul Haq, Department of Computer Science, Government College University, Faisalabad, Pakistan

Department of Computer Science,

Government College University, Faisalabad, Pakistan.

Email: lunarstra95@gmail.com
Sifat Ullah, Department of Computer Science, Islamia College University Peshawar, Pakistan

Department of Computer Science,

Islamia College University Peshawar, Pakistan.

Email: sifat910ullah@gmail.com
Ali Raza, Department of Information Technology, Government Collage University, Hyderabad

Department of Information Technology,

Government Collage University, Hyderabad

Email: alirazaabro311@gmail.com

Enhancing Large Language Model Reasoning via Retrieval-Augmented Generation and Self-Verification Mechanisms

Authors

DOI:

Keywords:

Abstract

Author Biographies

Downloads

Published

Issue

Section

License

How to Cite

Share

Similar Articles

Most read articles by the same author(s)

For online submission

Make a Submission

Latest publications

Information

Browse

Language

Developed By

Keywords