Trustworthy LLMs for Ethically Aligned AI-based Systems: A PhD Research Plan José Antonio Siqueira de Cerqueira1,*, Rebekah Rousi2, Nannan Xi1, Juho Hamari1, Kai-Kristian Kemell1 and Pekka Abrahamsson1 1Tampere University (TAU), Finland 2University of Vaasa (UWASA), Finland Abstract In response to growing concerns around trustworthiness and ethical alignment in AI systems, this PhD aims to investigate how Large Language Models (LLMs) can be leveraged to support ethically aligned AI development in software engineering. Despite advancements, integrating ethical principles into AI workflows remains challenging, particularly in real-world applications that require compliance with emerging regulations, such as the EU AI Act. We will develop a Visual Studio Code (VSCode) Generative AI (GenAI) Extension powered by a multi-agent LLM system with Retrieval-Augmented Generation (RAG) capabilities. The extension will be designed to aid developers by evaluating code compliance with ethical standards, providing actionable recommendations to embed trustworthiness from early stages of development. The GenAI Extension will be evaluated through an iterative design science approach, encompassing dataset generation, ethical benchmarking, and practitioner testing. A dataset of over 2000 ethically aligned AI systems, will be created in compliance with leading regulatory frameworks, serving as a foundation for this tool’s assessments. With this work, we hope to assist developers, particularly in startups and SMEs, by providing practical resources for building ethically aligned AI within limited resources. Through this approach, we aim to bridge the gap between abstract ethical principles and actionable software development practices, making ethical AI more accessible across industry contexts. Keywords AI ethics, Large Language Models, Trustworthiness, AI4SE 1. Problem Definition In today’s increasingly digitized world, Artificial Intelligence (AI) is emerging as a transformative force, reshaping industries, economies, and daily lives. From virtual assistants and recommendation algorithms to autonomous vehicles and medical diagnostics. AI-based systems, particularly Large Language Models (LLMs), are becoming ubiquitous, wielding considerable influence over decision-making processes and human interactions [1, 2]. LLM is a subfield of AI developed through the use of complex algorithms and large amounts of data [1, 2]. It is permeating every area of science and in people’s everyday lives [3]. However, many reports reveal that its use – or misuse – can cause significant harm, directly or indirectly [2]. For example, it can produce factual inaccuracies, provide biased information, hallucinations, racism and mysoginism [1, 4]. This is largely due to the nature of LLMs, which reproduces patterns found in the data on which it has been trained on [2]. Furthermore, the algorithms that generate each word are probabilistic. In other words, the last word generated depends on the probability of its occurrence depending on the preceding word [5]. As a result, they are untrustworthy by nature, that is, despite generating coherent text, LLMs operate without genuine understanding, leading to outputs that may be irrelevant or misleading [5]. These discussions are crucial as our reliance on LLMs for tasks and decision-making grows, especially in software engineering [3]. In the Software Engineering field, the capabilities of LLMs are being explored in the software development, maintenance, and evolution [6, 7, 8]. Accordingly, they find applications across various stages of the software development process, The 15th International Conference on Software Business (ICSOB 2024), November 18–20, 2024, Utrecht, The Netherlands *Corresponding author. $ jose.siqueiradecerqueira@tuni.fi (J. A. S. d. Cerqueira); rebekah.rousi@uwasa.fi (R. Rousi); nannan.xi@tuni.fi (N. Xi); juho.hamari@tuni.fi (J. Hamari); kai-kristian.kemell@tuni.fi (K. Kemell); pekka.abrahamsson@tuni.fi (P. Abrahamsson) � 0000-0002-8143-1042 (J. A. S. d. Cerqueira); 0000-0001-5771-3528 (R. Rousi); 0000-0002-9424-8116 (N. Xi); 0000-0002-6573-588X (J. Hamari); 0000-0002-0225-4560 (K. Kemell); 0000-0002-4360-2226 (P. Abrahamsson) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings ceur-ws.org ISSN 1613-0073 mailto:jose.siqueiradecerqueira@tuni.fi mailto:rebekah.rousi@uwasa.fi mailto:nannan.xi@tuni.fi mailto:juho.hamari@tuni.fi mailto:kai-kristian.kemell@tuni.fi mailto:pekka.abrahamsson@tuni.fi https://orcid.org/0000-0002-8143-1042 https://orcid.org/0000-0001-5771-3528 https://orcid.org/0000-0002-9424-8116 https://orcid.org/0000-0002-6573-588X https://orcid.org/0000-0002-0225-4560 https://orcid.org/0000-0002-4360-2226 https://creativecommons.org/licenses/by/4.0/deed.en including requirement analysis, software design, code implementation, testing, refactoring, defect detection, and repair [7]. Regarding AI, it has been noted for several years that it faces ethical problems, similar to those faced by LLMs [9, 10]. However, researchers and the industry have approached AI ethics in a more theoretical way, providing abstract ethical guidelines and principles [11]. Recent advances in legislation, such as the EU AI Act, propose to regulate the development and use of AI-based systems [12], but there is still no evidence of the extent to which it can assist practitioners in operationalising AI ethics. Therefore, there is a problem in bridging the gap between theory and practice in AI ethics, as well as in addressing the trustworthiness of LLMs. 2. Knowledge Gap Regarding trustworthiness in LLMs, efforts found in the literature focus on finding a taxonomy with trustworthiness aspects, e.g., truthfulness, safety, fairness, robustness, privacy, machine ethics, trans- parency, accountability, regulations and law [4]. Moreover, taxonomies serve as a way on how to assess LLMs in relation to trustworthiness. Having specialized roles [13, 14, 15], the use of external tools (e.g., running a code, searching the web) [13, 15], providing human interaction (i.e., human in the loop) [15], structured conversations (i.e., message templates) [13, 15] and different conversational patterns [15], are pointed out as techniques to improve overall trustworthiness of LLM systems. These techniques can significantly improve their reasoning as they debate and refine their discourse over multiple rounds. However, introducing new layers of complexity and challenges, such as increasing the overall cost (requires multiple instances and rounds) [16] and scalability (manage computational resources) [17]; while this approach is innovative, it is still generative AI, so it can produce convincing but wrong results [16], generates a different software on each run [14] and is prone to possible unintentional harmful outcomes and vulnerable to misuse [14]. Similarly to trustworthiness in LLMs, AI ethics also lack a centralized set of principles, assessment and practical guidance. While recent interest from academia and industry highlights AI ethics as a growing field of research, concerns regarding AI’s development and deployment have a long history, with several incidents drawing public attention recently [18]. As a response, multiple principles and guidelines have been formulated in recent years by diverse stakeholders, including academia, industry, and civil society, to delineate what constitutes ethical AI [10]. Ryan and Stahl [19] identified 11 foundational ethical princi- ples relevant to AI ethics: 1) Transparency, 2) Justice and Fairness, 3) Non-maleficence, 4) Responsibility, 5) Privacy, 6) Beneficence, 7) Freedom and Autonomy, 8) Trust, 9) Sustainability, 10) Dignity, and 11) Solidarity. Nonetheless, AI ethics is still an open debate, where practitioners are often disoriented with abstract principles, lacking clear guidance on how to operationalise the many ethical principles available [18]. The European Parliament is progressing with the world’s first AI regulation [20], underlining the current status of most guidelines as “soft law,” without mandatory enforcement or significant legal repercussions [21]. This regulation echoes the abstract nature with which AI ethics is typically ap- proached [18, 10, 22]. Challenges in translating these broad ethical principles into actionable practices stem from the subjective interpretation required by practitioners to apply them in real-world scenarios [21]. Despite its critical role, ethical considerations in AI design and implementation are often addressed only later in the development process [23]. In the literature, several studies emphasize that examining trustworthiness in LLMs requires situa- tional applications, where models are tested within specific contexts to effectively assess how trust- worthiness issues unfold and address unique challenges [4, 24]. We argue that an appealing emergent application of LLM agents is in the development of ethically aligned AI-based systems. Concerning the use of LLM in Software Engineering (LLM4SE), practitioners should trust the solutions, and they must be seamlessly adopted by practitioners, otherwise they can become barriers [25]. For the best of our knowledge, there are no studies that directly address the development of ethically aligned AI-based systems through the use of LLMs. Unlike prior approaches in the literature, this work explores the application of LLM-based multi-agent systems in AI development, emphasizing the incorporation of ethical principles from the earliest stages of the development lifecycle. 3. Research Method To address the gaps identified in the literature, we aim to create a Visual Studio Code (VSCode) Generative AI (GenAI) Extension tool. VSCode is widely used in the industry, with approximately 75% of developers reporting it as their preferred code editor in the 2023 Stack Overflow Developer Survey [26]. This GenAI Extension will assist developers build ethically aligned AI-based systems by assessing the code and suggesting possible code. Nevertheless, we will follow some steps for the creation of this tool, following the Design Science Research method to build and evaluate IS artefacts [27]. Firstly, we will identify techniques to improve trustworthiness in LLMs and develop a prototype, the LLM-based multi-agent system with Retrieval Augmented Generation (RAG). Next, we will benchmark our prototype against the SWE-benchmark. This will be done to test the accuracy and trustworthiness of our system. After that, we will create a dataset of more than 2000 ethically aligned AI-based systems generated by using the system and complying with new legislation addressing AI ethics. This will be done by using the AI Incidents Database and by feeding the (1) EU AI Act, (2) AI HLEG, (3) ISO-IEC 42001:2024 and (4) California’s GenAI bills, into our system. Then, the dataset will be used to create a novel benchmark, to assess other LLMs regarding their capability to generate ethically aligned AI-based system. Finally, we will create our VSCode GenAI Extension tool and test it with practitioners in terms of synergy and trust [25]. 3.1. Research Questions The following research questions guide this study: • RQ1: What techniques can be identified and applied to enhance the trustworthiness of LLM-based systems in software engineering (LLM4SE)? • RQ2: How can LLMs be utilized to evaluate and develop AI systems that are ethically compliant with the EU AI Act? • RQ3: How does VSCode GenAI extension influence synergy, trust, and ethical AI development outcomes in startups and SMES? 4. Timeline Phase Description and Milestones Sep 2023 – Febr 2024: Foundational Research and Exploration - Conduct an in-depth review of the EU AI Act to identify regulatory standards for ethically aligned AI systems. - Explore existing techniques to enhance trustworthiness in LLMs, fo- cusing on ethical guidelines and principles. - Collect resources such as the AI Incident Database and international ethics standards. Mar 2024 – Nov 2024: Pro- totype Design and Early Development - Define key trustworthiness and ethical alignment criteria specific to LLMs. https://incidentdatabase.ai/ - Design and prototype an LLM-based multi-agent system leveraging Retrieval-Augmented Generation (RAG) capabilities. Dec 2024 – Mar 2025: Prototype Refinement and Bench- mark Setup for LLM4SE - Perform necessary refinements to the prototype and conduct bench- marks in LLM4SE using the SWE-benchmark. Apr 2025 – Feb 2026: Dataset Generation and Novel Benchmark Creation - Generate a dataset of over 2000 ethically aligned AI-based systems using the prototype. - Develop a novel benchmark to assess other LLMs’ abilities to gener- ate ethically aligned systems, based on insights from the AI Incident Database and legislative sources. Mar 2026 – Aug 2026: VS- Code GenAI Extension Development - Create the GenAI Extension tool for Visual Studio Code (VSCode), incorporating trustworthiness and ethical compliance assessment func- tionalities. - Conduct initial usability testing with developers to refine features based on synergy and trust insights. Sep 2026 – Feb 2027: Extension Re- finement and Practitioner Testing - Refine the extension based on developer feedback, focusing on enhanced trustworthiness and ethical adherence. - Conduct extensive testing with practitioners to validate usability and ethical compliance in real-world applications. Mar 2027 – Aug 2027: Final Adjust- ments and Knowledge Dissemina- tion - Implement final adjustments for industry readiness of the VSCode extension. - Publish and present findings on trustworthiness in LLMs, ethical com- pliance, and practical implications in software engineering. Table 1: Project Timeline 5. Preliminary Results Here we will present some of our preliminary results, with an initial prototype using OpenAI API gpt-4o that relies only on internal knowledge, that is, without RAG. This prototype, called LLM-based multi-agent system (LLM-BMAS) was developed by implementing different techniques to improve trustworthiness in AI for software engineering (AI4SE), and evaluated against three real AI incidents found in AI Incidents Database [28]. The evaluation was done using thematic analysis, hierarchical clustering, ablation study, and source code execution. Our initial results show that LLM-BMAS has the ability to provide extensive and detailed source code and documentation, around 2,000 lines, while ablation study - using only ChatGPT user interface as baseline - produce around 80 lines without source code. Moreover, it is seen from the thematic analysis and hierarchical clustering that the prototype can address various ethical issues in AI that are often overlooked, e.g., bias, transparency, fairness. However, several factors currently impede seamless integration for practitioners [25]. Notably, these challenges include limited practicality in extracting source code from generated text—especially when handling complex modules—as well as difficulties with installing packages and managing outdated dependencies tied to the model’s original training date. Although advancements can enhance the trustworthiness and quality of LLM4SE applications, further improvements are essential to enhance practical usability for developers. 6. Expected Contributions This research aims to contribute to the field of software engineering by developing a novel Visual Studio Code (VSCode) GenAI Extension that integrates LLM-based multi-agent systems to support the creation of ethically aligned AI systems. The extension will incorporate trustworthiness assessments based on a unique dataset of over 2000 AI-based systems that align with key regulatory frameworks such as the EU AI Act, AI HLEG guideline, and ISO-IEC 42001:2024. By establishing new benchmarks specific to ethical AI development, this tool will enable developers to assess and enhance code compliance with ethical standards early in the development process. The result is expected to bridge the gap between theoretical ethics principles and practical application in software engineering. In addition, this project aims to advance practical trustworthiness techniques for LLMs in Software Engineering (LLM4SE). Rigorous testing with software practitioners will evaluate the effectiveness of the extension in providing ethically guided code recommendations, focusing on usability, trust and real-world synergy. By providing a structured and accessible approach to embedding ethical principles into standard development practices, this work can particularly support practitioners in start-ups and small to medium enterprises where resources and regulatory expertise may be limited. This contribution is expected to make ethical AI development more feasible for smaller teams, helping them to align their AI systems with evolving regulatory and ethical standards from the earliest stages of development. Acknowledgments This research was supported by Jane and Aatos Erkko Foundation through CONVERGENCE of Humans and Machines Project under grant No. 220025. Declaration on Generative AI During the preparation of this work, the authors utilized ChatGPT to assist in identifying and correcting writing errors, and enhancing clarity and conciseness. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article. References [1] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, H. Li, Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment, arXiv preprint arXiv:2308.05374 (2023). https://incidentdatabase.ai/ [2] P. P. Ray, ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope, Internet of Things and Cyber-Physical Systems 3 (2023) 121–154. doi:10.1016/j.iotcps.2023.04.003. [3] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, arXiv preprint arXiv:2307.03109 (2023). [4] L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, et al., TrustLLM: Trustworthiness in large language models, arXiv preprint arXiv:2401.05561 (2024). [5] L. Floridi, M. Chiriatti, GPT-3: its nature, scope, limits, and consequences, Minds Mach. 30 (2020) 681–694. doi:10.1007/S11023-020-09548-1. [6] I. Ozkaya, Application of large language models to software engineering tasks: Opportunities, risks, and implications, IEEE Software 40 (2023) 4–8. doi:10.1109/MS.2023.3248401. [7] X. Peng, Software development in the age of intelligence: embracing large language models with the right approach, Frontiers of Information Technology & Electronic Engineering 24 (2023) 1513–1519. doi:10.1631/FITEE.2300537. [8] B. Ni, M. J. Buehler, Mechagents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge, Extreme Mechanics Letters 67 (2024) 102131. doi:10.1016/j.eml.2024.102131. [9] T. Hagendorff, The ethics of ai ethics: An evaluation of guidelines, Minds and Machines 30 (2020) 99–120. doi:10.1007/s11023-020-09517-8. [10] A. Jobin, M. Ienca, E. Vayena, The global landscape of AI ethics guidelines, Nature Machine Intelligence 1 (2019) 389–399. doi:10.1038/S42256-019-0088-2. [11] V. Vakkuri, K.-K. Kemell, P. Abrahamsson, ECCOLA–a method for implementing ethically aligned ai systems, arXiv preprint arXiv:2004.08377 (2020). [12] A. R. Marinković, The new EU AI Act: A comprehensive legislation on AI or just a beginning?, Global Journal of Business and Integral Security (2023). [13] S. Hong, X. Zheng, J. P. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. H. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, MetaGPT: Meta programming for multi-agent collaborative framework, ArXiv abs/2308.00352 (2023). [14] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, M. Sun, Communicative agents for software development, arXiv preprint arXiv:2307.07924 (2023). [15] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, C. Wang, AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework, arXiv preprint arXiv:2308.08155 (2023). [16] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, I. Mordatch, Improving factuality and reasoning in language models through multiagent debate, arXiv preprint arXiv:2305.14325 (2023). [17] Y. Talebirad, A. Nadiri, Multi-agent collaboration: Harnessing the power of intelligent LLM agents, arXiv preprint arXiv:2306.03314 (2023). [18] E. Halme, M. Jantunen, V. Vakkuri, K. Kemell, P. Abrahamsson, Making ethics practical: User stories as a way of implementing ethical consideration in software engineering, Inf. Softw. Technol. 167 (2024) 107379. doi:10.1016/J.INFSOF.2023.107379. [19] M. Ryan, B. C. Stahl, Artificial intelligence ethics guidelines for developers and users: Clarifying their content and normative implications, Journal of Information, Communication and Ethics in Society 19 (2021) 61–86. doi:10.1108/JICES-12-2019-0138. [20] E. Commission, EU AI Act: First regulation on artificial intelligence, https://www.europarl.europa. eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence, 2023. Accessed 01 Apr 2024. [21] J. A. S. de Cerqueira, A. P. D. Azevedo, H. A. T. Leão, E. D. Canedo, Guide for artificial intelligence ethical requirements elicitation - RE4AI ethical guide, in: 55th Hawaii International Conference on System Sciences, HICSS 2022, Virtual Event / Maui, Hawaii, USA, January 4-7, 2022, ScholarSpace, 2022, pp. 1–10. URL: http://hdl.handle.net/10125/80015. [22] N. K. Corrêa, C. Galvão, J. W. Santos, C. D. Pino, E. P. Pinto, C. Barbosa, D. Massmann, R. Mambrini, L. Galvão, E. Terem, N. de Oliveira, Worldwide AI ethics: A review of 200 guidelines and recommen- http://dx.doi.org/10.1016/j.iotcps.2023.04.003 http://dx.doi.org/10.1007/S11023-020-09548-1 http://dx.doi.org/10.1109/MS.2023.3248401 http://dx.doi.org/10.1631/FITEE.2300537 http://dx.doi.org/10.1016/j.eml.2024.102131 http://dx.doi.org/10.1007/s11023-020-09517-8 http://dx.doi.org/10.1038/S42256-019-0088-2 http://dx.doi.org/10.1016/J.INFSOF.2023.107379 http://dx.doi.org/10.1108/JICES-12-2019-0138 https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence http://hdl.handle.net/10125/80015 dations for AI governance, Patterns 4 (2023) 100857. doi:10.1016/J.PATTER.2023.100857. [23] V. Vakkuri, K. Kemell, M. Jantunen, E. Halme, P. Abrahamsson, ECCOLA - A method for im- plementing ethically aligned AI systems, J. Syst. Softw. 182 (2021) 111067. doi:10.1016/J.JSS. 2021.111067. [24] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y. Cheng, S. Koyejo, D. Song, B. Li, DecodingTrust: A comprehensive assessment of trustworthiness in GPT models, in: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023. doi:10.48550/arXiv.2306.11698. [25] D. Lo, Trustworthy and synergistic artificial intelligence for software engineering: Vision and roadmaps, in: IEEE/ACM International Conference on Software Engineering: Future of Software Engineering, ICSE-FoSE 2023, Melbourne, Australia, May 14-20, 2023, IEEE, 2023, pp. 69–85. doi:10.1109/ICSE-FOSE59343.2023.00010. [26] Stack Overflow Developer Survey 2023, https://survey.stackoverflow.co/2023/, 2023. Accessed 25 Oct 2024. [27] A. R. Hevner, S. T. March, J. Park, S. Ram, Design science in information systems research, MIS Q. 28 (2004) 75–105. [28] J. A. S. de Cerqueira, M. Agbese, R. Rousi, N. Xi, J. Hamari, P. Abrahamsson, Can we trust AI agents? An experimental study towards trustworthy LLM-based multi-agent systems for AI ethics, arXiv preprint arXiv:2411.08881 (2024). http://dx.doi.org/10.1016/J.PATTER.2023.100857 http://dx.doi.org/10.1016/J.JSS.2021.111067 http://dx.doi.org/10.1016/J.JSS.2021.111067 http://dx.doi.org/10.48550/arXiv.2306.11698 http://dx.doi.org/10.1109/ICSE-FOSE59343.2023.00010 https://survey.stackoverflow.co/2023/ 1 Problem Definition 2 Knowledge Gap 3 Research Method 3.1 Research Questions 4 Timeline 5 Preliminary Results 6 Expected Contributions