The researchers examined the current usage of LLMs as technological tools and potential future applications as they evolve into powerful scientific assistants.
They discovered that LLMs encounter obstacles due to hallucinations that yield plausible yet incorrect outcomes, undermining their reliability in both research and commercial settings. Their opaque nature hampers transparency and trust, while biases ingrained in training data may reinforce inequalities. Therefore, AI outputs necessitate verification through human oversight or algorithmic confidence assessment.
Hallucinations and accuracy: A double-edged sword
A primary challenge with LLMs is their propensity to generate hallucinations that appear credible but are factually incorrect. The paper indicates that although such imaginative results can sometimes inspire creative hypotheses, reliance on them for experimental validation poses risks.
Also Read: Outlook 2026 | How FinOps 2.0 will turn tech spend into business value in the year ahead
The paper stresses that LLMs should not be seen as completely trustworthy or entirely unreliable. Researchers ought to adopt a framework of “algorithmic confidence,” a continuous metric of trustworthiness that assesses the likelihood of an AI-generated output being accurate.
According to the World Bank’s “Digital Progress and Trends Report 2025, Strengthening AI Foundations,” inherent flaws in current AI may restrict its broader economic relevance.
The hallucinations produced by GenAI tools stem from the mathematical and logical frameworks of LLMs, rendering them unreliable in business contexts where errors can be costly.
Despite their robustness, such as DeepMind’s AlphaFold significantly advancing protein structure prediction—a long-standing issue—through deep learning for accurate protein folding predictions, AI4Science could potentially reverse the recent decline in scientific productivity marked by bottlenecks in literature searches and peer review.
Nevertheless, LLMs are not yet prepared to function as autonomous scientific agents. The authors emphasize that all AI-assisted research must be validated either by human experts or through algorithmic confidence evaluations to prevent the propagation of errors, biases, or hallucinations into published work or crucial decisions.
They found that human involvement is critical in the research process to enhance the safety, reliability, and efficiency of LLMs. In literature reviews, humans offer deeper insights and direct LLM agents in ways that align with the needs of scientists.
When reasoning, humans can recognize uncertain thoughts and rectify errors, enhancing the precision of chain-of-thought techniques. Human scientists also aid in disambiguation and troubleshooting of LLM-powered systems, help select generated hypotheses to streamline workloads, and play an essential role in conducting experiments and refining flawed experimental plans.
Transparency and interpretability challenges
A significant concern is the opaque nature of LLMs. The mechanisms behind these models are often unclear, making it challenging to comprehend why particular outputs are produced.
This lack of interpretability can erode trust, especially in high-stakes research scenarios. Scientists are investigating methods like neuron activation visualizations, probing, and logit lens approaches to enhance transparency. Interestingly, LLMs are also employed to elucidate other black-box systems, demonstrating their capabilities while emphasizing the necessity for ongoing scrutiny.
Also Read: China hits EU Dairy imports with duties up to 43% after probe
Bias, fairness, and access
LLMs encompass ethical issues extending beyond accuracy. While they have the capacity to democratize access to scientific knowledge—enabling researchers from non-English-speaking backgrounds to engage more fully in global scientific discussions—they can also perpetuate biases found in training datasets. These biases may skew outputs and reinforce disparities in research, highlighting the need for meticulous oversight.
Balancing AI creativity and scientific rigour
The paper notes that while LLMs can stimulate creative hypothesis generation, pushing the boundaries of research, excessive reliance on AI risks diluting scientific rigor if speculative outputs are mistaken for validated findings. Ensuring human supervision and implementing rigorous verification processes are critical to guarantee that AI enhances rather than undermines research integrity.
According to the World Bank report cited earlier, current AI systems mainly serve as pattern-recognition engines lacking true understanding, logical reasoning, or common sense.
A significant amount of scientifically valuable knowledge is implicit and context-dependent, making it challenging for AI to interpret or apply reliably. Unchecked hallucinations in AI outputs can lead to serious consequences in both scientific and business contexts. Until AI can reliably interact with the physical world and comprehend complex contextual scenarios, human judgment remains vital.
Responsible integration is key
The perspective paper concludes that LLMs present tremendous potential for accelerating scientific discovery, but their ethical and interpretability challenges must not be overlooked.
Responsible integration necessitates human oversight, transparency, and meticulous validation of AI-generated insights. By implementing these safeguards, researchers can effectively leverage AI capabilities while upholding the integrity of scientific inquiry.
The World Bank report further reveals that AI will become increasingly beneficial when it can understand and interact with the physical environment. Currently, AI models primarily focus on optimizing software and generating information, text, and images, impacting a limited range of activities that are largely restricted to virtual realms.
For AI to effect broader change, it must reliably perceive, comprehend, and respond to physical environments even in novel and unprecedented situations. By connecting digital intelligence with physical action, AI could transform more industries and tackle practical challenges in ways that extend beyond present applications.
As AI evolves, the scientific community faces a pressing question: how much autonomy should be granted to machines, and how can humans ensure that innovation remains rigorous, ethical, and dependable? The authors assert that the resolution of this question will shape the forthcoming era of science.