CAAI Transactions on Intelligence Technology - ORIGINAL RESEARCH OPEN ACCESS Large Language Models With Contrastive Decoding Algorithm for Hallucination Mitigation in Low‐Resource Languages Zan Hongying1 | Arifa Javed1 | Muhammad Abdullah1 | Javed Rashid2 | Muhammad Faheem3,4 1School of Computing and Artificial Intelligence, Zhengzhou University, Zhengzhou, China | 2Information Technology Services, University of Okara, Okara, Pakistan | 3School of Technology and Innovations, University of Vaasa, Vaasa, Finland | 4VTT Technical Research Center of Finland, Espoo, Finland Correspondence: Muhammad Faheem (muhammad.faheem@uwasa.fi) Received: 11 June 2024 | Revised: 11 October 2024 | Accepted: 24 October 2024 Funding: The authors are highly grateful to their affiliated universities and institutes for providing research facilities. The research work of M. Faheem is supported by VTT Technical Research Center of Finland. Keywords: artificial intelligence | artificial neural network | computer vision | deep learning | deep neural networks | large language model ABSTRACT Neural machine translation (NMT) has advanced with deep learning and large‐scale multilingual models, yet translating low‐ resource languages often lacks sufficient training data and leads to hallucinations. This often results in translated content that diverges significantly from the source text. This research proposes a refined Contrastive Decoding (CD) algorithm that dynamically adjusts weights of log probabilities from strong expert and weak amateur models to mitigate hallucinations in low‐ resource NMT and improve translation quality. Advanced large language NMT models, including ChatGLM and LLaMA, are fine‐tuned and implemented for their superior contextual understanding and cross‐lingual capabilities. The refined CD algo- rithm evaluates multiple candidate translations using BLEU score, semantic similarity, and Named Entity Recognition accu- racy. Extensive experimental results show substantial improvements in translation quality and a significant reduction in hallucination rates. Fine‐tuned models achieve higher evaluation metrics compared to baseline models and state‐of‐the‐art models. An ablation study confirms the contributions of each methodological component and highlights the effectiveness of the refined CD algorithm and advanced models in mitigating hallucinations. Notably, the refined methodology increased the BLEU score by approximately 30% compared to baseline models. 1 | Introduction Neural machine translation has significantly advanced due to deep learning and the development of pre‐train models [1]. Recent advancements in large‐scale multilingual machine translation have significantly advanced the goal of achieving a universal translation system. These sophisticated pre‐trained models can manage a vast array of languages and translation directions as highlighted by Fan et al. [2]. At the same time, general‐purpose large language models (LLMs) have demonstrated exceptional versatility and excelling in new tasks such as translation, where they are continually improving, as noted by Chowdhery et al. [3]. Unlike traditional bilingual models, these advanced systems offer substantial performance enhancements and streamline engineering processes by enabling a single model to handle all language pairs [4]. How- ever, translating between low‐resource languages is still diffi- cult. These languages often lack enough training data and resources such as text corpora, annotated datasets, and lin- guistic tools [5]. The scarcity of training data results in poor This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2025 The Author(s). CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing University of Technology. CAAI Transactions on Intelligence Technology, 2025; 00:1–14 1 of 14 https://doi.org/10.1049/cit2.70004 https://doi.org/10.1049/cit2.70004 mailto:muhammad.faheem@uwasa.fi http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1049/cit2.70004 http://crossmark.crossref.org/dialog/?doi=10.1049%2Fcit2.70004&domain=pdf&date_stamp=2025-04-03 translation quality and makes models prone to generating content that appears plausible but is factually incorrect. Inconsistent or domain‐specific data scarcity exacerbates these issues, causing models to produce inaccurate or misleading outputs. Hallucinations in NMT happen when the translated sentence contains content not present in the original sentence, which leads to misleading or incorrect translations [6]. In other words, when there is a low contribution of the source sentence to the generation of the target sentence. It significantly undermines the trust in NMT systems [7]. Hallucinations in translation primarily stem from three main issues: insufficient context understanding, training data limitations, and overfitting. When models lack necessary contextual cues, they often produce translations that misrepresent the original source material. When models are trained on limited or noisy datasets, they are more prone to generating outputs that contain inaccuracies. Models that are overfitted to their training data may struggle to generalise effectively to new or varied inputs, leading to disconnected translations. Hallucination by LLM [8] represents an alarming barrier to the effective deployment and equitable impact of artificial intelligence technologies across globally diverse linguistic and cultural landscapes [9]. Recent studies have improved the understanding, detection, and mitigation of these pathological translations [10]. However, these studies have typically focused on high‐resource languages or small bilingual models with less than 100 million parameters [11]. These models were often trained on a single English‐centric, high‐resource language pair [12, 13]. This problem is more common in low‐resource language pairs like Chinese and Urdu because they face data scarcity. Previous work on mitigating hallucinations in low‐resource NMT has largely focused on sampling translations and reranking them based on quality metrics [14]. Contrastive decoding is introduced to address is- sues like excessive repetition and low diversity in unconditional language models [15]. Other techniques such as data augmen- tation and noise‐robust training have been introduced, but they have challenges such as overfitting or increased computational complexity. This research aims to develop an NMT model for low‐resource language pairs, Chinese and Urdu. A refined Contrastive Decoding algorithm is introduced along LLM. ChatGLM [16], a conversational language model, is adapted for translation tasks that enhance the translation's contextual understanding. LLaMA2 [17], a largemultilingualmodel, is optimised for diverse language pairs to improve cross‐lingual transfer and robustness. The CD Algorithm [15] is refined by dynamically adjusting the weight given to the amateur model based on the expert model's confidence. It generates multiple candidate solutions and selects the best one by using evaluation metrics. This ensures the final translation is accurate and relevant by reducing the hallucina- tions ratio. The major contributions are as follows: � Utilisation of dynamic Contrastive Decoding algorithm that dynamically adjusts the weights of each translation segment generated by an amateur model based on the confidence scores of the expert model to reduce halluci- nations and improve translation quality by controlling overfitting. � Utilisation of large language models such as ChatGLM 2– 6B [16] and LLaMA 65B and LLama 2 7B [17] as pre‐ train and fine‐tuned models to improve contextual under- standing and translation quality for low‐resource language pairs as Chinese‐Urdu. � The Evaluation of multiple candidate solutions using BLEU score, semantic similarity, and NER accuracy to select the best translation. This ensured the translations reduced hallucinations while maintaining high contextual relevance and accuracy for named entities. The proposed solution aims to maximise the difference between the log probabilities of the expert model and the amateur model to enhance the overall quality of the translation and reduce the likelihood of producing hallucinations. The remaining article is organised as follows: Section 2 reviews existing research on machine translation and hallucination for low‐resource lan- guages. Section 3 describes the material and method. Section 4 discusses the experimental setup. Section 5 discusses the results and discussion. Finally, look forward to concluding remarks and some potential research trends in the future. 2 | Literature Review Multilingual NMT has emerged as a vital paradigm for building translation systems capable of handling numerous languages [6]. These approaches aim to translate directly with a single model for multiple language pairs without relying on any in- termediate language. The dominant strategy for achieving these systems involves training large multilingual models on vast amounts of parallel data [18]. Data mining and data augmen- tation techniques are used for data acquiring with back‐ translation [19]. The multilingual capabilities of these systems result in significant improvements as compared to traditional bilingual models, particularly for low‐resource and non‐English‐ centric language pairs, which benefit the most from multilin- gual transfer [2]. An alternative and promising strategy adopts the emergent capabilities of large language models (LLMs). These models are pre‐trained on massive corpora and then can be deployed to perform a variety of tasks [8, 20]. This approach has yielded impressive results across a wide range of NLP tasks [3, 21]. LLM can produce fluent and adequate translations, especially for high‐resource English‐centric language pairs, and these translations are competitive with those generated by dedicated supervised translation models [22, 23]. Hallucinations in machine translation represent a significant challenge, as accurate translation issues pose a critical threat to the safety and reliability of real‐world applications. Notably, these hallucinations differ from natural language generation tasks, such as abstractive summarisation and generative ques- tion answering [7]. Hallucinations are substantially rarer and harder to observe in clean, unperturbed data. This rarity is possibly attributed to the more closed‐ended nature of the task. Several previous studies have examined the properties of hal- lucinations by creating artificial scenarios where they are more likely to occur. For example, perturbations in the source text [18] or noise in the training data [24] have been introduced to study hallucinations. Hallucinations in machine translation are 2 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense categorised into hallucinations under perturbation and natural hallucinations. According to Raunak et al. [24] and later extended by Guerreiro et al. [13] taxonomy, hallucinations are translations that contain content detached from the source text. Detecting hallucinations is necessary for improving machine translation reliability. Dale et al. [14], Guerreiro et al. [25], and Bawden and Yvon [23] evaluated various methodologies for identifying hallucinations and demonstrate the effectiveness of ALTI+ as a detection method. For hallucination mitigation, Guerreiro et al. [25] proposed a fallback model when a hallucination is detected, whereas other methods rely on sampling and re‐ranking trans- lations. For example, using COMET [26] helped to mitigate hallucinations by ranking sentences for the best performance. These approaches often depend on an additional external model and have been evaluated primarily on small de→en models [14, 25]. CD approach is used to mitigate hallucinations, evaluating them by counting segments with chrF2 < 10. They used the same model as the amateur but supplied it with randomly selected inputs [27]. We use different models as amateurs to provide more stable results. Additionally, our work compares different amateurs and techniques for combining expert and amateur distributions. This is complementarGy to the focus of Sennrich et al. [27] on off‐target translations. In reviewing the existing literature on NMT and hallucination detection and mitigation is summarised in Table 1. Despite significant advancements in NMT driven by deep learning and large‐scale multilingual models, translating low‐resource lan- guages such as Chinese and Urdu remains a substantial chal- lenge. The primary issue is the scarcity of training data, which leads to hallucinations—instances where the translated content diverges significantly from the source text. Existing methods to mitigate hallucinations are often plagued by overfitting and increased computational complexity. Moreover, no previous work specifically addresses hallucinations in Chinese‐Urdu language pairs which highlights a significant gap in the cur- rent research. Therefore, there is an urgent need for a robust and scalable solution that can effectively reduce hallucinations and improve the overall quality of translations for low‐resource languages. 3 | Materials and Methods In this section, we describe the datasets, data preprocessing techniques, tokenisation, normalisation, and the proposed model. The detailed steps and algorithms used in our approach are outlined below and shown in Figure 1. 3.1 | Datasets Because of data scarcity in low‐resource languages, a diverse Chinese and Urdu parallel corpus was developed through human effort and web crawlers (ParaCrawl, Bitextor, Common Crawl and OpenNMT). ParaCrawl is a project that aims to build large‐ scale parallel corpora for machine translation by crawling multilingual websites. It uses sophisticated techniques to identify parallel texts onmultilingual websites and provides open datasets for research and development in machine translation. Chinese and Urdu pairs are not available in open access. Bitextor is a popular tool designed specifically for crawling and collecting parallel corpora from the web. It automatically identifies and downloads bilingual websites, extracts and aligns parallel text segments. It can handle various text formats and encodings. TABLE 1 | Summary of key literature on hallucination detection and mitigation in NMT. Ref. Method Datasets Contribution Limitation Costa‐jussà et al. [6] LLM with BT Web‐crawled corpora Highlighted the potential of multilingual NMT Limited focus on low‐resource languages Bapna et al. [18] pre‐train GPT LLM Flores‐200, NLLB Discussed data mining and augmentation strategies Lack of robustness analysis Fan et al. [2] M2M100 WMT, FLORES Showed improvements in low‐resource language pairs Primarily English‐centric Brown et al. [20] GPT‐3 WebText2, Books1 Demonstrated capabilities of LLMs in NLP tasks High computational requirements Chowdhery et al. [3], pre‐train LLM Multilingual corpora LLMs' fluency in high‐resource language pairs Limited application to low‐ resource pairs Ji et al. [7] NLG with LLM Multilingual corpora Reviewed hallucination issues in NLG tasks General NLG focus, less on MT Bapna et al. [18], MVE algorithm IWSLT Studied hallucinations under perturbation Focus on artificial scenarios Guerreiro, Voita, and Martins [13] NMT models Manual datasets Extended taxonomy of hallucinations Challenges in natural hallucination detection Guerreiro, Voita, and Martins [25], transformer WMT’18 Evaluated hallucination detection methods Primarily evaluated on small models Sennrich et al. [27] CD and M2M 100‐418 FLORES‐101 Mitigated hallucinations using CD approach Focus on off‐target translations 3 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense Common Crawl provides a large, open repository of web crawl data. OpenNMT provides a suite of tools for machine translation, including utilities for data collection and preprocessing. It is validated by native language and NLP experts and verified using multiple translators, including Google Translate and Baidu Translate, ensuring accuracy and reliability. Consistency checks included completeness, accuracy, uniform formats, and naming conventions. Redundancy checks removed duplicates, and label validation ensured correct and balanced datasets. Text consistency checks addressed spelling, grammar, and termi- nology. Sampling validation ensured accurate population rep- resentation for better model generalisation and bias detection. Text alignment and statistical analysis were performed to eval- uate linguistic diversity and coverage across domains. In addi- tion, we incorporated publicly available datasets. � The OPUS dataset is a collection of parallel corpora for a wide range of languages, including Chinese and Urdu, which is widely used for machine translation and linguistic research [28]. The dataset is available at (http://opus.nlpl. eu/). � The Workshop on Machine Translation (WMT) provides a benchmark in the field of machine translation. It includes a variety of parallel corpora for different language pairs and is used in annual machine translation competitions [29]. � WiLi 18 benchmark dataset of short text extracts from Wikipedia. It contains 1000 paragraphs in 235 languages, totalling 23,500 paragraphs. Each language in this dataset contains 1000 rows/paragraphs. After the same data selec- tion and preprocessing, we selected the same Chinese Urdu 45 paragraph with the help of the middle language En- glish [30]. These datasets are compiled and formatted in CSV format. Table 2 shows the details of the datasets. We consolidated all relevant datasets into a single comprehensive dataset in a uni- fied format to facilitate experiments on large corpora. Addi- tionally, we conducted dataset‐wise experiments to ensure thorough analysis and validation. The size of the datasets for the training, testing, and validation is formed with a ratio of 70%, 15%, and 15%, respectively. 3.2 | Data Preprocessing Data cleaning is performed for the removal of noise, such as special characters, HTML tags, punctuation, missing values, and unnecessary white space, as well as the standardisation of test cases and the correction of common misspellings. Consistency checks are conducted to ensure alignment between Chinese and Urdu sentences and prove the validity of this dataset using FastAlign. Statistical analysis evaluates the linguistic diversity and coverage across various domains. The correlation between the lengths of sentences in the source and target texts is ana- lysed using the Pearson correlation coefficient [32]. A high correlation indicates good alignment and coherence in the dataset. SentencePiece tokeniser [33] is used to implement Byte‐ Pair Encoding (BPE) [34]. It breaks down rare words into sub- word units, which helps manage vocabulary size and improve translation accuracy for low‐resource languages. The process can be expressed as S be a source sentence, repre- sented as a sequence of characters S = {c1, c2,…, cn}. The BPE [34] merges the most frequent pair of character sequences iteratively to form subword units. The tokenisation function T can be defined as: T(S) = {t1, t2,…, tm} where tm represents the subword units derived from S. The frequency of character pairs governs the merging process, and at each iteration, the most frequent pair (a, b) in the current vocabulary is merged into a new token ab. The updated rule for the vocabulary V at iteration k is represented in Equation (1) where V (k) is the vocabulary at iteration k. V (k+1) = (V (k)\{a, b}) ∪ {ab} (1) To further refine normalisation, we incorporate an entropy‐ based normalisation function. The entropy H of the token dis- tribution can be calculated as Equation (2), where p(ti) is the probability of token ti in the dataset. H(T) = −∑ n i=1 p(ti)log p(ti) (2) We adjust the token frequencies to ensure a more balanced token distribution. The entropy‐normalised token frequency FIGURE 1 | The proposed model architecture. 4 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://opus.nlpl.eu/ http://opus.nlpl.eu/ f ʹ(ti) can be expressed as Equation (3), where f (ti) is the original frequency of token ti. f ʹ(ti) = f (ti) H(T) (3) 3.3 | Proposed Model This research proposes a refined CD Algorithm [15] with large language models to enhance machine translation accuracy be- tween Chinese and Urdu while minimising hallucinations. Back translation is employed to generate training data, where sen- tences from the target language are translated back into the source language to create additional parallel corpora. The amateur models consist of pre‐trained LLaMA 65B, LLaMA 2 7B and ChatGLM 2 6B, whereas the expert models are their fine‐ tuned versions with enhancements. The process begins with amateur models generating initial translations, which serve as a baseline. These translations are evaluated using metrics such as BLEU score and lexical matching to identify areas for improvement. The CD algorithm plays a pivotal role in assessing the confidence levels of these initial translations. It dynamically adjusts the weights of each translation segment based on the confidence scores, giving higher weight to segments with higher confidence. This ensures that more reliable translations are prioritised. Subsequently, the expert models refine these weighted translations, correcting errors and enhancing quality through their fine‐tuned capabilities. A beam search with varied parameters and stochastic sampling is then conducted by the expert models to generate multiple candidate translations. Each candidate represents a different possible translation, taking into account various confidence levels and model strengths. The CD algorithm evaluates these candidates using metrics like BLEU score, semantic similarity, and NER accuracy. The highest‐ weighted and most accurate segments from the expert models are combined to form the final translation. 3.3.1 | ChatGLM Model The pre‐trained ChatGLM 2 6B model [16] with 6 billion pa- rameters offers enhanced capabilities in understanding and generating human‐like text. We enhanced the attention mech- anisms and integrated context‐aware embeddings to boost translation accuracy and contextual understanding. Normally, models use static embeddings where each word has a fixed representation regardless of its context. This can lead to po- tential misunderstandings, especially with words that have multiple meanings depending on the context. Unlike static embeddings, context‐aware embeddings are dynamically generated based on the entire sequence, allowing the model to capture nuanced meanings of words depending on their usage. For an input sequence T = {t1, t2,…, tn}, the context‐aware em- beddings for each token ti are generated considering the full context of the sequence T. This can be represented as Equa- tion (4): EChatGLM(T) = [ChatGLM(t1∣CT),ChatGLM(t2∣CT),…,ChatGLM(tn∣CT)] (4) where CT = context(T) indicates the full contextual de- pendency of each token ti on the entire sequence T. This modification enables the model to capture deep conversational contexts, aggregating all relevant information for the sequence, thereby improving its understanding of the text. The standard attention mechanism allows a model to focus on different parts of the input sequence when generating each part of the output. It calculates attention weights for each pair of tokens in the sequence. The attention score for a pair of tokens ( ti, tj) is calculated using a scoring function, and these scores are then normalised to produce the attention weights. The standard scoring function is often a simple dot product between the query and key vectors projected through weight matrices. We enhanced the attention mechanism to ensure that the model focuses more effectively on the most relevant parts of the input sequence. The enhanced attention mechanism calculates attention for each token pair ( ti, tj) as Equation (5): Attention( ti, tj) = exp( score( ti, tj)) ∑ n k=1 exp(score(ti, tk)) (5) The score function used in the attention calculation is defined as Equation (6): score( ti, tj) = (WqEChatGLM(ti))(WkEChatGLM( tj)) T ̅̅̅̅̅ dk √ (6) In these equations, Wq and Wk are weight matrices for the query and key projections, respectively, and dk is the dimensionality of the key vectors. These enhancements collectively improve the model's ability to produce high‐quality translations, especially for complex and low‐resource language pairs such as Chinese and Urdu. By implementing these custom enhancements, ChatGLM 2 6B's effectiveness in low‐resource Chinese‐Urdu bidirectional NMT tasks is significant. Its large parameter size and deep contextual TABLE 2 | Chinese‐Urdu corpus data. Corpus Sentences Zh tokens Ur tokens Training Testing Validation OPUS Tiedemann et al. [28] 493,042 1,189,539 899,376 345,129 73,956 73,957 WMT Fonseca et al. [31] 608,405 1,348,494 1,106,496 425,883 91,260 91,262 Wi Li Thoma [30] 7938 33,357 56,894 5556 1190 1192 Custom 56,332 130,569 104,742 39,432 8449 8451 Total 1,165,717 2,701,959 2,167,508 815,000 174,855 174,862 5 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense embeddings enable better handling of complex language struc- tures and nuances, reducing errors and improving overall translation quality. Figure 2 illustrates the model architecture for the ChatGLM‐2 6B model. 3.3.2 | LLaMA LLaMA 65B [17] is adopted for an experiment, which is a variant with 65 billion parameters. It provides enhanced capa- bilities for translation and contextually relevant tasks in text. It maintains a balance between model complexity and perfor- mance, which makes it suitable for our task. LLaMA's embed- dings for input sequences T are used to handle the variability and scarcity of low‐resource language data. The embeddings are represented as Equation (7). ELLaMA(T) = [LLaMA(t1,CT),LLaMA(t2,CT),…,LLaMA(tn,CT)] (7) The context CT includes different linguistic features. The encoder embeddings for an input sequence T are calculated as Equation (8): H65B enc = EncoderLLaMA 65B(T, context(T)) (8) LLaMA 2 7B is adopted for extensive analysis. It is pre‐trained with 7 billion parameters. It offers greater model capacity and enables a more nuanced understanding and generation of text. It is particularly effective in handling complex language structures and improving translation quality in low‐resource languages. The LLaMA architecture is shown in Figure 3. The encoder embed- dings for an input sequenceT are represented inEquation (9). The LLaMA model integrates morphological tags into the input fea- tures to enhance word embeddings. This modification enriches the linguistic data fed into the model by concatenating morpho- logical tag embeddings with standard word embeddings. In addition, character‐level embedding is used to capture detailed linguistic features, which is especially helpful for processing rare words that are underrepresented in the training dataset. To handle the complexity of these enriched inputs, the encoder layers are adapted by incorporating specialised sub‐networks. These sub‐networks are designed to effectively process the morphological and character‐level enhancements, ensuring they contribute optimally to the model's language understanding and generation capabilities. A multi‐head attention mechanism is used to maintain precise alignment on different parts of the source and target texts. These mechanisms focus on monitoring the relevance and coherence of the output throughout the translation process. They regulate how much of the attention‐ modified inputs influence subsequent layers, adding an addi- tional level of filtering and prioritisation. By strategically imple- menting this approach, the model not only focuses on the most pertinent parts of the data but also dynamically manages how these parts influence the learning and generation processes, ensuring the accuracy and relevance of the translated content. H7B enc = EncoderLLaMA 2 7B(T, context(T)) (9) FIGURE 2 | ChatGLM model architecture. FIGURE 3 | LLAMA‐2 model architecture. 6 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense 3.3.3 | Refined Contrastive Decoding Algorithm The CD algorithm [15] is developed to improve the quality of generated text by leveraging the outputs of both an expert model and an amateur model. It aims to refine the expert model's predictions by adjusting them based on the outputs of an amateur model. The CD algorithm helps in scenarios where enhancing diversity or preventing errors such as hallucinations is essential. We refined CD to address hallucinations in low‐ resource language pairs, Chinese and Urdu, along with LLM models. As introduced in Section 3.3, we set pre‐trained ChatGLM, LLaMA 65B, and LLaMA 2 7B as amateur models and enhanced Fine‐tuned ChatGLM 6B and LLaMA variants as an expert model by dynamically adjusting the weight given to the amateur model‐based on the expert model's confidence. The CD score for a token i is given by Equation (10). CD(i) = logPexpert( yi|x) − γ logPamateur( yi|x) (10) Pexpert( yi|x) is the expert model's probability for token i after applying softmax, whereas Pamateur( yi|x) represents the proba- bility assigned by the amateur model. A hyperparameter α within the range 0 < α < 1 is used to filter the probability dis- tribution of the expert model. Tokens are included in the candidate set Vthresh if their probability is at least the maximum probability scaled by α. Formally, this can be expressed as Equation (11): Vthresh = {i ∈ V : log(Pexpert( yi|x)) ≥ log(α) +maxj log(Pexpert(yj|x))} (11) The filtering process has two main purposes. Firstly, it prevents tokens with low probability from overwhelming the candidate set Vthresh. Secondly, it ensures that if the expert model exhibits high confidence, only the most probable token is included, allowing the candidate set to closely match the expert's preferred choice. The original CD algorithm used a constant weight γ at each time step by uniformly adjusting all probabilities. Instead of varying the number of candidates considered (as the threshold α does), we propose varying the degree of CD's influence on token gen- eration by dynamically adjusting the weight γ. It is dynamically adjusted as: γ = 1 − maxiPexpert(xi)β. This parameter handles the influence of the amateur model's log probability. The CD scores CD(i) replace the expert's scores during the beam search. Nor- malisation is applied to stabilise beam search and ensure consistent contribution across time steps. The normalisation constant NCD is calculated as Equation (12): NCD = ∑i∈Vthresh Pexpert(i) ∑i∈Vthresh ( Pexpert(i) Pamateur(i)) γ (12) Dividing the CD scores by NCD normalises them before scaling to the probability mass covered by Vthresh. This set of normalised CD scores is combined with the expert probabilities outside Vthresh to form a probability distribution. This method is referred to as NORMALISED CD. In our approach to CD, we prioritise preventing hallucinations in neural machine translation (NMT) over prioritising diversity. There are two main challenges to consider: hallucinations only occur in a small fraction of translated sentences, and NMT outputs need to be closely linked to the source sentence. As a result, CD should only modify outputs when hallucinations occur and have minimal effects on non‐hallucinated outputs. These challenges are tackled using normalisation and dynamic weighting. By considering the pro- posed model, our approach aims to provide robust and accurate translations for low‐resource language pairs, such as Chinese‐ Urdu, while effectively mitigating hallucinations. The Algo- rithm 1 summarises the proposed methodology. ALGORITHM 1 | Hallucination mitigation algorithm. 1: Input: Source sentence x, Expert model Pexpert, Amateur model Pamateur, Hyperparameter β, Models {ChatGLM, LLaMA,} 2: Output: Best translation y∗ 3: Preprocess x: clean, tokenise, and normalise 4: Generate initial translation candidates {y1, y2,…, yn} using beam search on Pexpert 5: for each candidate translation yi do 6: Calculate CD score: CD(i) = logPexpert( yi|x) − γ logPamateur( yi|x) 7: Adjust weight γ dynamically: γ = 1 − maxiPexpert(xi)β 8: Normalise scores using NCD 9: end for 10: Evaluate candidates using BLEU score, semantic similarity, and NER accuracy 11: Select the best translation y∗ based on the highest combined score 12: Model Selection: Select the model (ChatGLM 2 6B, LLaMA 65B and LLAMA 2 7B) that provides the best translation y∗ based on evaluation metrics 13: Return y∗ The inclusion of NER [35] within the proposed model is effec- tive in maintaining the integrity and accuracy of named entities in translation. When a token tk in input sequence T is recog- nised as a named entity, the embedding is adjusted before translation as Equation (13): EĆhatGLM 2 6B(tk) = { EChatGLM 2 6B(tk) if ¬ is NE(tk) ENER(tk, source lang, target lang) if is NE(tk) (13) This approach ensures that named entities pare accurately represented and translated across different languages. 4 | Experimental Setup In our experimental setup for evaluating the performance of the proposed model, Python is chosen as the primary programming language due to its robust support for machine learning and natural language processing tasks. All experiments are con- ducted using the PyTorch framework. The computational tasks are executed on a cloud server featuring 1� A100 PCIe GPU, which provides the essential computational power for training large‐scale translation models. 7 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense 4.1 | Hyper‐Parameters Fine Tuning and Model Training Each model is fine‐tuned on the hyperparameters such as learning rate, batch size, and number of epochs are optimised as shown in Table 3. The objective function for fine‐tuning is given by Equation (14), where θ represents the model parameters, x is the source sentence, y is the target sentence, and Dtrain is the training dataset. L(θ) = − ∑ (x,y)∈Dtrain log P(y|x; θ) (14) 4.2 | Evaluation Metrics Semantic similarity is assessed using cosine similarity on em- beddings generated by models. It ensures that the translations preserve the intended meaning of the source text. NER accuracy is measured to evaluate how well‐named entities are maintained across translations. Other metrics are BLEU, METEOR, ROUGE‐L, and chrFþþ. Flag translations as hallucinations if both BLEU and semantic similarity scores fall below‐specified thresholds (e.g. BLEU < 0.6, similarity < 0.75). 5 | Results and Discussion Results are reported in Table 4, which shows the performance of baseline and expert models across datasets. Firstly, the experi- ments are performed on a unified dataset. This comprehensive dataset is used to train and evaluate both the amateur and expert models. The initial translations are generated using amateur models and considered as baseline, which are then refined by expert models. Among these models, LLaMA 2–7B achieved the highest performance with a BLEU score of 47.0. The larger parameter size of LLaMA 2–7B enabled it to capture nuanced linguistic patterns more effectively. ChatGLM 2–6B also performed robustly and obtained a BLEU score of 46.0%, METEOR score of 38.0%, chrFþþ score of 55.0%, and ROUGE‐L score of 43.0%, demonstrating its capability in handling trans- lation tasks efficiently. LLaMA 65B showed effectiveness with a BLEU score of 44.0(%). It showed slightly lower performance than the others, but its computational resource utilisation is good, indicating some limitations due to its smaller size and need for more training over epochs. The improved METEOR scores for fine‐tuned models as 68.0% for LLaMA 2–7B indicate enhanced translation quality and contextual relevance. chrFþþ (character n‐gram F‐score) measured the translation quality based on character n‐gram, and it captured precision and recall. It is particularly effective for languages with complex morphology, such as Chinese and Urdu. The substantial in- crease in chrFþþ scores for fine‐tuned models as 56.0%–79.0% for LLaMA 2–7B highlighted the models' improvement in handling the morphological variations and fine‐grained trans- lation accuracy. The significant gains in ROUGE‐L scores in fine‐tuning (from 44.0% to 74.0% for LLaMA 2–7B) underscore its enhanced ability to produce coherent and fluent translations. The LLaMA 2–7B as an expert model showed improvements in BLEU score of 79.1%, METEOR score of 68.0%, chrFþþ score of 81.0%, and ROUGE‐L score of 74.0% as compared to the baseline model. These results highlight the increased accuracy and reliability of translation. Similarly, ChatGLM 2–6B as an expert model got significant performance gains in a BLEU score of 76.0% and other metrics. The LLaMA 65B as an expert model achieved a BLEU score of 74.0, METEOR score of 63.0%, chrFþþ score of 77.0%, and ROUGE‐L score of 69.0%. The combination of diverse datasets in a standardised format ensured high‐quality training data, and the multi‐metric eval- uation approach provided a comprehensive assessment of translation quality and reliability. These results verified the potential of advanced models and refined algorithms in addressing the above‐mentioned challenges. In dataset‐wise experiments, LLaMA 2–7B achieves the highest scores in all metrics for each dataset, with BLEU scores of 27.7 (%), 28.4 (%), and 19.6 (%) for OPUS, WMT, and Wili þ Custom, respectively. ChatGLM 2–6B and LLaMA 65B also perform well but are consistently lower than LLaMA 2–7 B. The lower scores in the Wili þ Custom dataset highlight the impact of limited and less diverse data on NMT performance, empha- sising the challenge of translating low‐resource languages. This variation underscores the need for comprehensive datasets to improve translation quality and reduce hallucinations. TABLE 3 | Fine‐tuning of hyperparameters for selected models. Hyperparameter LLaMA 65B LLaMA 2 7B ChatGLM 2 6B Learning rate 5 × 10− 6 2 × 10− 5 2 × 10− 5 Batch size 64 32 24 Number of epochs 10 10 15 Warmup steps 1000 1000 750 Dropout rate 0.05 0.1 0.15 Weight decay 0.015 0.01 0.01 Gradient accumulation 2 2 2 Maximum sequence length 256 128 128 Optimiser AdamW AdamW AdamW Learning rate scheduler Linear warmup & decay Linear warmup & decay Linear warmup & decay Back Translation Yes Yes Yes 8 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense Fine‐tuning significantly enhances performance across all metrics. For example, LLaMA 2–7B's BLEU scores increase to 36.0, 39.0, and 23.6 for OPUS, WMT, and Wili þ Custom, respectively. ChatGLM 2–6B and LLaMA 65B show similar improvements. The fine‐tuned models exhibit better handling of diverse linguistic patterns and reduced hallucinations, particu- larly evident in the higher scores for OPUS and WMT datasets. However, the Wili þ Custom dataset still shows relatively lower scores which highlights ongoing challenges with low‐resource data. These results demonstrate the effectiveness of fine‐ tuning in enhancing model performance and the critical role of dataset quality in NMT systems. 5.1 | Performance Analysis Through NER Metrics The performance of baseline and expert models is computed on NER and other translation quality metrics. Figure 4 provides valuable insights into how each model handles various aspects of translation quality and errors. Additionally, by including categories such as under‐generation, fully detached, strongly detached, oscillatory, and other errors, we analysed where each model excels or struggles. In Figure 4, the first radar chart fo- cuses on binary‐score heuristics and evaluates models like Translation Neural Generation (TNG) and Reference Trans- lation(RT). The second radar chart uses continuous scores for finer granularity. Comparing methods such as Attn‐to‐EOS, Attn‐ign‐SRC, TokHal‐Model, Seq‐Logprob, and MC‐DSim are adopted. Attn‐to‐EOS measures the effectiveness of the atten- tion mechanism towards the end of the sequence. Attn‐ign‐SRC indicates how often the model generates content by ignoring the source input. TokHal‐Model evaluates token‐level hallucina- tions. Seq‐Logprob assesses the model's confidence in its translations. MC‐DSim measures the dissimilarity between source and target sequences using Monte Carlo methods. It provides a more detailed analysis and is helpful to identify TABLE 4 | Model Performance over Dataset and Combined datasdet. Corpus Model Baseline Model Expert Model BLEU (%) METEOR (%) chrFþþ (%) ROUGE‐L (%) BLEU (%) METEOR (%) chrFþþ (%) ROUGE‐L (%) Combined LLaMA 2–7B 47.0 38.0 56.0 44.0 79.1 68.0 81.0 74.0 Datasets LLaMA 65B 44.0 36.0 53.0 41.0 74.0 63.0 77.0 69.0 ChatGLM 2–6B 46.0 38.0 55.0 43.0 76.0 67.0 77.0 73.0 OPUS ChatGLM 2–6B 22.8 24.1 46.7 30.2 32.5 34.0 53.3 39.7 LLaMA 65B 24.6 26.0 40.1 31.7 31.0 32.0 49.0 37.0 LLaMA 2–7B 27.7 29.0 50.4 34.0 36.0 37.0 58.0 43.0 WMT ChatGLM 2–6B 27.1 28.5 52.2 34.8 30.0 32.4 54.6 38.5 LLaMA 65B 25.6 27.0 40.8 31.5 26.3 27.1 50.4 33.5 LLaMA 2–7B 28.4 29.3 53.10 35.3 39.0 41.2 62.0 46.3 Wili þ custom ChatGLM 2–6B 17.8 26.0 43.4 45.0 21.5 23.0 44.6 27.0 LLaMA 65B 11.3 21.0 42.7 42.0 21.7 23.0 44.1 27.3 LLaMA 2–7B 19.6 30.0 27.3 43.0 23.6 25.1 38.4 31.2 FIGURE 4 | NER calculations for NMT and hallucination mitigation through quality filters. 9 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense subtle differences in model performance. The third radar chart evaluates the same categories with quality filters such as COMET‐QE, COMET, and CHRF2. It focuses on the quality of translations by indicating how well each model maintains translation quality across different aspects. Figure 5 displays the performance improvements of proposed translation models across various metrics after fine‐tuning. ChatGLM performs well on sequence log probability but needs improvement in attention‐to‐end‐of‐sequence and Monte Carlo dissimilarity. LLaMA 2 7B excels in attention‐to‐end‐of‐ sequence and token hallucination but needs improvement in Monte Carlo dissimilarity. LLAMA 65B is strong in token hallucination and sequence log probability but weaker in attention‐ignoring‐source and Monte Carlo dissimilarity. Figure 6 illustrates the reduction in hallucination rates for both baseline and expert models. The hallucination rate starts at 0.31 and gradually decreases to 0.15 in the baseline model. This shows a steady decline in the hallucination rate as the training progresses, indicating that the baseline model improves with more epochs, but the rate of improvement is relatively modest. The hallucination rate starts at 0.10 and decreases to 0.04 over 10 epochs in the expert model. This demonstrates a substantial decline compared to the baseline model that indicates that the proposed Algorithm 1 has a significant impact on reducing hallucinations quickly and effectively. The cosine similarity between two vectors A and B is found as cosine_similarity(A,B) = A⋅B ‖A‖‖B‖ where A and B are the embed- ding vectors of the source and target sentences respectively. The resulting value should be between − 1 and 1. For semantic similarity, values closer to 1 indicate higher similarity. The re- sults indicate that fine‐tuned models significantly improve the semantic similarity scores as Figure 7. Specifically, LLAMA 2 7B showed the most substantial improvement as compared to LLaMA 65B and ChatGLM2 6B. 5.2 | Comparitive Analysis The comparative analysis, as detailed in Table 5, highlights several studies on Chinese‐Urdu neural machine translation (NMT) and focuses on the challenges of low‐resource language pairs and issues related to hallucinations. Chen et al. developed a Chinese‐Urdu NMT model integrating POS sequence prediction with the Transformer architecture, achieving a BLEU score of 0.36 [42]. Zeeshan et al. implemented the OpenNMT framework using LSTM and RNN‐based models, attaining a lower BLEU score of 0.18 [41].A further studybyZeshan et al. comparedLSTM with Transformer models that demonstrates the superior per- formance of the Transformer, which significantly improved BLEU scores from 0.077 to 0.52, compared to 0.41 for LSTM [40]. The seq2seq NMT system is introduced for Chinese Urdu bidi- rectional translation by deploying a hybrid model as RNN with long short‐termmemory (LSTM) cells. Themodel gained a BLEU score of 0.42 [39]. The TranS2S model was applied to Chinese English translation by Zhou et al., and a Bleu score of 0.65 was obtained [36]. In low‐resource bilingual translation, Li et al. showed the effectiveness of LLaMa and ChatGLM models [37]. The CD algorithm is used with a transformer‐based M2m Model for low‐resource languages. It improved the translation perfor- mance by 0.79% Bleu score and minimised the hallucination rate [38]. The proposed LLAMA 2 7B model stands out with the highest BLEU score of 79.17% among the studies focused on low‐ FIGURE 5 | Performance evaluation of Translation models with different quality metrics including contrastive decoding. FIGURE 6 | Reduction in hallucination by proposed models. FIGURE 7 | Semantic similarity scores. 10 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense resource settings, underscoring its effectiveness in mitigating hallucinations and enhancing translation quality for challenging language pairs like Chinese‐Urdu. In addition to focusing on low‐resource language pairs such as Chinese‐Urdu, we also tested our models on medium and high‐ resource languages to evaluate their generalisation and robust- ness across different language pairs. This approach ensures in Figure 8 that our refinedmodel not only addresses the challenges specific to low‐resource languages but also performs effectively across a broader spectrumof language pairs. The datasets used for these evaluations are mentioned above in Section 3.1. We computed hallucination rates for low, medium, and high‐ resource languages using thresholds. Translations are flagged as hallucinations if BLEU scores fall below 0.5. Translations are flagged as hallucinations if semantic similarity scores fall below 0.6. Figure 9 illustrates an overall improvements across the met- rics for each model for quickly grasping how each model's per- formance and allowing for easy comparison of trends across different metrics. The improved NER metrics post‐fine‐tuning suggest that the models are better at detecting and handling hallucinations. It results in more accurate and reliable translations. The BLEU scores exhibit a wide range indicating significant variability in translation quality. This variability can be attributed to the limited amount of training data available. The red dashed line represents the threshold belowwhich translations are flagged as potential hallucinations. A substantial portion of the scores falls below this threshold. It suggests a higher likelihood of hallucinated translations. The distribution of BLEU scores for mid‐resource languages shows higher median values than low‐ resource languages with a more concentrated spread. This in- dicates an improvement in translation quality as more training data becomes available. The threshold line shows fewer instances below it and signifies a reduction inhallucinated translations. The BLEU scores for high‐resource languages are clustered towards the higher end of the scale, reflecting better translation perfor- mance. Most scores are above the threshold, indicating lower hallucination rates and demonstrating the effectiveness of abundant training data in improving translation quality. Figure 10 illustrates the training and validation loss curves for three models: LLAMA 65B, LLAMA 2 7B, and ChatGLM over 10 epochs. Each plot shows a consistent decrease in both training loss and validation loss as the number of epochs in- creases; it indicates that the models are learning effectively over time. The decreasing validation loss across all models suggests that the improvements generalise well to unseen data and confirm the effectiveness of the training process. There is no over‐fitting in the model's evaluation. TABLE 5 | Comparative analysis with state‐of‐the‐art models. Ref Year Model Language pair BLEU (%) Zhou et al. [36] 2021 TranS2S Chinese‐English 0.65 J. Li et al. [37] 2024 LLAMA 2 Low resource bilingual 0.49 2024 ChatGLM Low resource bilingual 0.48 Waldendorf et al. [38] 2024 CD algorithm with M2M model Low resource 0.79 J. Zeeshan et al. [39] 2021 Open NMT, LSTM and RNN Chinese ↔ Urdu 0.18 Khan et al. [40] 2020 NMT, LSTM Chinese ↔ Urdu 0.42 Z. A. Zeeshan and Jawad [41] 2020 LSTM Chinese ↔ Urdu 0.41 Z. A. Zeeshan and Jawad [41] 2020 Transformer Chinese ↔ Urdu 0.52 H. CHEN et al. [42] 2024 Transformer for POS Chinese ↔ Urdu 0.36 Proposed CD algorithm with ChatGLM Chinese ↔ Urdu 0.76 Method CD algorithm with LLAMA 65B Chinese ↔ Urdu 0.74 CD algorithm with LLAMA 2 7B Chinese ↔ Urdu 0.79 FIGURE 8 | Hallucination by resource levels. FIGURE 9 | Hallucination rate through NER metrics. 11 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense 5.3 | Discussion The results demonstrate that the proposed approach signifi- cantly enhances translation quality for low‐resource language pairs such as Chinese‐Urdu. The integration of LLMs with Contrastive Decoding, dynamic weight adjustment, and multilingual embeddings contributed to these improvements. Dynamic weight adjustment helps the model to generalise better across different types of data. By focusing on rare and difficult instances, the model avoids overfitting to common patterns, which is a common cause of hallucinations. It en- sures balanced learning by giving appropriate importance to different parts of the data. This prevents the model from being biased towards certain frequent phrases or structures. It improves the alignment between source and target texts. By focusing on aligning more complex or less frequent word pairs, the model reduces the likelihood of generating hallu- cinated sentences that do not correspond to the input text. Thresholds are defined to flag potential hallucinations. The results indicated that low‐resource languages have the highest ratio of hallucination compared to mid‐ and high‐resource languages. This trend underscores the impact of training data availability on translation quality. The higher variability and greater number of scores below the thresholds highlight the difficulties in achieving reliable translations with data scarcity. These results emphasise the critical role of training data in developing robust machine translation models and the effectiveness of defined thresholds in detecting hallucinations across different language resource levels. The combination of multiple evaluation metrics provides a comprehensive evalu- ation of translation quality. The scores for low‐resource lan- guages show a broad distribution and highlight the challenges in maintaining semantic integrity due to the scarcity of training data. The distribution of scores for high‐resource languages is concentrated at the higher end, implying strong semantic fidelity in translations. The minimal number of scores below the threshold underscores the robustness of translations when ample training data is available, resulting in the lowest hallucination rates among the three categories. The experimental results validate the effectiveness of the proposed methodologies. In the future, the research may focus on accuracy and anomaly detection using proposed model in various appliations [43–46]. 6 | Ablation Study The research conducted an ablation study to understand the impact of various components on the proposed methodology. It is conducted on the same experimental setup, and a combined dataset is used. We measured the BLEU score, hallucination rate, and translation quality for each variant. The results are averaged across multiple runs to ensure consistency in Table 6 with different configurations. The results of the ablation study highlighted the contributions of each component to the overall performance of the proposed model. This ensured that the selected translations were not only accurate but also contextually relevant. Combining all compo- nents yielded the highest performance across all metrics, with the BLEU score of 79.17%, and the hallucination rate dropped to 0.09%. This confirmed that the integration of all enhancements provides a synergistic effect that leads to robust and reliable translations. FIGURE 10 | Performance Evaluation through Models Training and Validation loss. TABLE 6 | Ablation study results. Configuration BLEU (%) Hallucination rate (%) Translation quality (score) Baseline 48.57 0.31 0.56 Baseline þ refined CD algorithm 62.34 0.18 0.67 Fine‐tuned þ refined CD algorithm 68.45 0.15 0.72 Baseline þ multi‐metric evaluation 64.78 0.19 0.70 Fine‐tuned þ multi‐metric evaluation 75.12 0.11 0.74 Full model 79.17 0.09 0.79 12 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense 7 | Conclusion This research presented a better approach to addressing hallu- cinations in low‐resource NMT, specifically for Chinese‐Urdu translation. We Utilised a refined CD algorithm with LLM models to enhance translation accuracy and reduce hallucina- tions. The methodology integrates comprehensive data pre- processing, large language model selection, and a robust evaluation framework that significantly improves the reliability and quality of translations. We deployed effective tokenisation using SentencePiece and normalisation techniques to ensure clean and consistent training data. We employed back trans- lation to generate additional parallel corpora to enhance the robustness of the models. The proposed algorithm evaluates translations using BLEU score, semantic similarity, and Named Entity Recognition (NER) accuracy. The experimental results demonstrated substantial improvements and achieved scores of 79.17 (LLAMA 2 7B), 74.00 (LLaMA 65B), and 71.23 (ChatGLM 2 6B). The hallucination rate is reduced from 31% to 0.14% on the baseline and from 0.10% to 0.04% on fine‐tuned models. Semantic Similarity and NER Accuracy metrics reflect better preservation of meaning and accurate handling of named en- tities. Future research would explore extending this methodol- ogy to other low‐resource language pairs and refining the algorithm to enhance performance in more challenging trans- lation scenarios. Emphasising data augmentation strategies, adopting community‐shared resources, and advancing model architectures will be key to continually improving NMT systems. Conflicts of Interest The authors declare no conflicts of interest. Data Availability Statement The data will be available upon request to the corresponding author. References 1. X. Guan, Y. Liu, H. Lin, et al., “Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph‐Based Retrofitting,” in Proceedings of the AAAI Conference on Artificial Intelligence 38 (2024): 18126–18134, https://doi.org/10.1609/aaai.v38i16.29770. 2. A. Fan, S. Bhosale, H. Schwenk, et al., “Beyond English‐Centric Multilingual Machine Translation,” Journal of Machine Learning Research 22, no. 107 (2021): 1–48. 3. A. Chowdhery, S. Narang, J. Devlin, et al., “Palm: Scaling Language Modeling With Pathways,” Journal of Machine Learning Research 24, no. 240 (2023): 1–113. 4. N. Arivazhagan, A. Bapna, O. Firat, et al., Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges (2019): arXiv preprint arXiv:1907.05019. 5. N. Goyal, C. Gao, V. Chaudhary, et al., “The Flores‐101 Evaluation Benchmark for Low‐Resource and Multilingual Machine Translation,” Transactions of the Association for Computational Linguistics 10 (2022): 522–538, https://doi.org/10.1162/tacl_a_00474. 6. M. R. Costa‐jussà, J. Cross, O. Çelebi, et al., No Language Left behind: Scaling Human‐Centered Machine Translation (2022): arXiv preprint arXiv:2207.04672. 7. Z. Ji, N. Lee, R. Frieske, et al., “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys 55, no. 12 (2023): 1–38, https://doi.org/10.1145/3571730. 8. A. Radford, J. Wu, R. Child, et al., “Language Models Are Unsu- pervised Multitask Learners,” OpenAI blog 1, no. 8 (2019): 9. 9. G. Wenzek, V. Chaudhary, A. Fan, et al., “Findings of the WMT 2021 Shared Task on Large‐Scale Multilingual Machine Translation,” in Proceedings of the Sixth Conference on Machine Translation (Association for Computational Linguistics (ACL), 2021), 89–99. 10. W. Xu, S. Agrawal, E. Briakou, M. J. Martindale, and M. Carpuat, “Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection,” Transactions of the Association for Computational Linguistics 11 (2023): 546–564, https://doi.org/10.1162/ tacl_a_00563. 11. A. Hendy, M. Abdelrehim, A. Sharaf, et al., How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation (2023): arXiv preprint arXiv:2302.09210. 12. J. Ferrando, G. I. Gállego, B. Alastruey, C. Escolano, and M. R. Costa‐jussà, Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer (2022): arXiv preprint arXiv:2205.11631. 13. N. M. Guerreiro, E. Voita, and A. F. Martins, Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation (2022): arXiv preprint arXiv:2208.05309. 14. D. Dale, E. Voita, L. Barrault, and M. R. Costa‐jussà, Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better (2022): arXiv preprint arXiv:2212.08597. 15. X. L. Li, A. Holtzman, D. Fried, et al., Contrastive Decoding: Open‐ Ended Text Generation as Optimization (2022): arXiv preprint arXiv:2210.15097. 16. Z. Du, Y. Qian, X. Liu, et al., “Glm: General Language Model Pre- training With Autoregressive Blank Infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1 (Association for Computational Linguistics (ACL), 2022), 320–335. 17. H. Touvron, L. Martin, K. Stone, et al., Llama 2: Open Foundation and Fine‐Tuned Chat Models, 2023): arXiv preprint arXiv:2307.09288. 18. A. Bapna, I. Caswell, J. Kreutzer, et al., Building Machine Trans- lation Systems for the Next Thousand Languages, 2022): arXiv preprint arXiv:2205.03983. 19. A. Siddhant, A. Bapna, O. Firat, et al., Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy between Supervised and Self‐Supervised Learning, 2022): arXiv preprint arXiv:2201.03110. 20. T. Brown, B. Mann, N. Ryder, et al., “Language Models Are Few‐ Shot Learners,” Advances in Neural Information Processing Systems 33 (2020): 1877–1901. 21. S. Zhang, S. Roller, N. Goyal, et al., Opt: Open Pre‐trained Trans- former Language Models (2022): arXiv preprint arXiv:2205.01068. 22. K. Peng, L. Ding, Q. Zhong, et al., Towards Making the Most of Chatgpt for Machine Translation, 2023): arXiv preprint arXiv:2303.13780. 23. R. Bawden and F. Yvon, Investigating the Translation Performance of a Large Multilingual Language Model: The Case of Bloom (2023): arXiv preprint arXiv:2303.01911. 24. V. Raunak, A. Menezes, and M. Junczys‐Dowmunt, The Curious Case of Hallucinations in Neural Machine Translation (2021): arXiv preprint arXiv:2104.06683. 25. N. M. Guerreiro, E. Voita, and A. Martins, “Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation,” in A. Vlachos and I. Augenstein, eds., Proceedings of the 13 of 14 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense https://doi.org/10.1609/aaai.v38i16.29770 https://doi.org/10.1162/tacl_a_00474 https://doi.org/10.1145/3571730 https://doi.org/10.1162/tacl_a_00563 https://doi.org/10.1162/tacl_a_00563 17th Conference of the European Chapter of the Association for Compu- tational Linguistics, May 2023 (Association for Computational Linguis- tics, 2023), 1059–1075, https://aclanthology.org/2023.eacl‐main.75. 26. R. Rei, J. G. De Souza, D. Alves, et al., “Comet‐22: Unbabel‐Ist 2022 Submission for the Metrics Shared Task,” in Proceedings of the Seventh Conference on Machine Translation (WMT) (Association for Computa- tional Linguistics (ACL), 2022), 578–585. 27. R. Sennrich, J. Vamvas, and A. Mohammadshahi, Mitigating Hal- lucinations and Off‐Target Machine Translation with Source‐ Contrastive and Language‐Contrastive Decoding, 2023): arXiv preprint arXiv:2309.07098. 28. J. Tiedemann, M. Aulamo, D. Bakshandaeva, et al., “Democratizing Neural Machine Translation With Opus‐Mt,” Language Resources and Evaluation 58, no. 2 (2023): 1–43, https://doi.org/10.1007/s10579‐023‐ 09704‐w. 29. T. Kocmi, E. Avramidis, R. Bawden, O. Bojar, A. Dvorkovich, C. Federmann, et al., “Findings of the 2023 conference on machine translation (WMT23): LLMs are here but Not quite there yet,” in P. Koehn, B. Haddow, T. Kocmi, and C. Monz eds., Proceedings of the Eighth Conference on Machine Translation, Dec. 2023 (Association for Computational Linguistics, 2023), 1–42, https://aclanthology.org/2023. wmt‐1.1. 30. M. Thoma, The Wili Benchmark Dataset for Written Language Identification, 2018): arXiv preprint arXiv:1801.07779. 31. E. Fonseca, L. Yankovskaya, A. F. Martins, M. Fishel, and C. Fed- ermann, “Findings of the WMT 2019 Shared Tasks on Quality Estima- tion,” in Proceedings of the Fourth Conference on Machine Translation, Vol. 3, (Association for Computational Linguistics (ACL), 2019), 1–10: Shared Task Papers, Day 2, https://doi.org/10.18653/v1/w19‐5401. 32. P. K. Buttar andM.K. Sachan, “AReview of theApproaches toNeural Machine Translation,” Natural Language Processing and Information Retrieval (2023): 78–109, https://doi.org/10.1201/9781003244332‐4. 33. S. Choo and W. Kim, “A Study on the Evaluation of Tokenizer Performance in Natural Language Processing,” Applied Artificial Intel- ligence 37, no. 1 (2023): 2175112, https://doi.org/10.1080/08839514.2023. 2175112. 34. S. Hellsten, Incremental Re‐tokenization in BPE‐Trained Senten- cepiece Models, (Umeå University, 2024). 35. S. Chen, Y. Pei, Z. Ke, and W. Silamu, “Low‐resource Named Entity Recognition via the Pre‐training Model,” Symmetry 13, no. 5 (2021): 786, https://doi.org/10.3390/sym13050786. 36. Zhou, C., Neubig, G., Gu, J. et al., “Detecting Hallucinated Content in Conditional Neural Sequence Generation.” in Findings of the Association for Computational Linguistics: ACL‐IJCNLP 2021 (Association for Computational Linguistics (ACL), 2021): 1393–1404, https://aclantho logy.org/2021.findings‐acl.122. 37. J. Li, J. Chen, R. Ren, et al., “The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models,” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024): 10879–10899: arXiv preprint arXiv:2401.03205, https://doi.org/10.18653/v1/2024.acl‐ long.586. 38. J. Waldendorf, B. Haddow, and A. Birch, “Contrastive Decoding Reduces Hallucinations in Large Multilingual Machine Translation Models,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1 (Association for Computational Linguistics (ACL), 2024), 2526–2539. 39. J. Zeeshan, M. Zakira, and M. Niaz, “A Seq to Seq Machine Translation From Urdu to Chinese,” Journal of Autonomous Intelligence 4, no. 1 (2021): 1–5, https://doi.org/10.32629/jai.v4i1.359. 40. Z. Khan, M. Zakira, W. Slamu, and N. Slam, “A Study of Neural Machine Translation From Chinese to Urdu,” Journal of Autonomous Intelligence 2, no. 4 (2020): 29–36. 41. Z. A. Zeeshan and M. Z. Jawad, “Research on Chinese‐Urdu Ma- chine Translation Based on Deep Learning,” Journal of Autonomous Intelligence 3, no. 2 (2020): 34–44, https://doi.org/10.32629/jai.v3i2.279. 42. H. H. Chen, J. Wang, and N. U. H. Muhammad, “Chinese‐Urdu Neural Machine Translation Interacting Pos Sequence Prediction in Urdu Language,” Computer Engineering & Science 46, no. 03 (2024): 518. 43. M. Faheem, M. A. Al‐Khasawneh, A. A. Khan, and S. H. H. Madni, “Cyberattack Patterns in Blockchain‐Based Communication Networks for Distributed Renewable Energy Systems: A Study on Big Datasets,” Data in Brief 53, no. 5 (2024): 110212, https://doi.org/10.1016/j.dib.2024. 11021250.66. 44. M. Faheem and A.‐K. Mahmoud Ahmad, “Multilayer Cyber attacks Identification and Classification Using Machine Learning in Internet of Blockchain(IoBC)‐Based Energy Networks,” Data in Brief 54, no. 5 (2024): 110461, https://doi.org/10.1016/j.dib.2024.110461.68. 45. M. Faheem, B. Raza, M. S. Bhutta, and S. H. H. Madni, “A Blockchain‐Based Resilient and Secure Framework for Events Moni- toring and Control in Distributed Renewable Energy Systems,” IET Blockchain (2024): 1–15, https://doi.org/10.1049/blc2.12081.69. 46. A. Akram, J. Rashid, M. A. Jaffar, M. Faheem, and R. Amin, “Segmentation and Classification of Skin Lesions Using Hybrid Deep Learning Method in the Internet of Medical Things.” Skin Research and Technology 29, no. 11 (2023): e13524, https://doi.org/10.1111/srt.13524. 14 of 14 CAAI Transactions on Intelligence Technology, 2025 24682322, 0, D ow nloaded from https://ietresearch.onlinelibrary.w iley.com /doi/10.1049/cit2.70004 by U niversity O f V aasa, W iley O nline L ibrary on [05/05/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense https://aclanthology.org/2023.eacl-main.75 https://doi.org/10.1007/s10579-023-09704-w https://doi.org/10.1007/s10579-023-09704-w https://aclanthology.org/2023.wmt-1.1 https://aclanthology.org/2023.wmt-1.1 https://doi.org/10.18653/v1/w19-5401 https://doi.org/10.1201/9781003244332-4 https://doi.org/10.1080/08839514.2023.2175112 https://doi.org/10.1080/08839514.2023.2175112 https://doi.org/10.3390/sym13050786 https://aclanthology.org/2021.findings-acl.122 https://aclanthology.org/2021.findings-acl.122 https://doi.org/10.18653/v1/2024.acl-long.586 https://doi.org/10.18653/v1/2024.acl-long.586 https://doi.org/10.32629/jai.v4i1.359 https://doi.org/10.32629/jai.v3i2.279 https://doi.org/10.1016/j.dib.2024.11021250.66 https://doi.org/10.1016/j.dib.2024.11021250.66 https://doi.org/10.1016/j.dib.2024.110461.68 https://doi.org/10.1049/blc2.12081.69 https://doi.org/10.1111/srt.13524 Large Language Models With Contrastive Decoding Algorithm for Hallucination Mitigation in Low‐Resource Languages 1 | Introduction 2 | Literature Review 3 | Materials and Methods 3.1 | Datasets 3.2 | Data Preprocessing 3.3 | Proposed Model 3.3.1 | ChatGLM Model 3.3.2 | LLaMA 3.3.3 | Refined Contrastive Decoding Algorithm 4 | Experimental Setup 4.1 | Hyper‐Parameters Fine Tuning and Model Training 4.2 | Evaluation Metrics 5 | Results and Discussion 5.1 | Performance Analysis Through NER Metrics 5.2 | Comparitive Analysis 5.3 | Discussion 6 | Ablation Study 7 | Conclusion Conflicts of Interest Data Availability Statement