Abstract
Background Large language models (LLMs) are increasingly used by clinicians to generate executable code for pharmacokinetic (PK) simulation. Whether such code meets the accuracy standards of target-controlled infusion systems has not been systematically evaluated.
Methods Five LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) were prompted to generate Python code for the Marsh three-compartment propofol model under a standardized 120-minute bolus-plus-infusion regimen. Each LLM was tested in two phases: Phase 1, integrator free; Phase 2, fourth-order Runge–Kutta with 1-second step size mandated. Twenty runs per LLM per phase were collected (n = 200). Plasma concentrations were compared against a triple-validated reference using median prediction error (MDPE), median absolute prediction error (MDAPE), and Wobble. Runs were classified as Class A (MDAPE < 1 %), B (1–30 %), C (≥ 30 %), or D (failed).
Results All 200 scripts were invokable and created a CSV file; 199/200 (99.5 %, 95 % CI 97.1–99.9 %) produced a valid time–concentration series. The remaining script (Gemini Phase 2 run 18) aborted during row formatting with ValueError and left a header-only CSV. Median MDAPE per LLM × phase ranged 0.0043–0.020 %, with 195/200 runs (97.5 %, 95 % CI 94.3–98.9 %) achieving Class A. Five runs (2.5 %, 95 % CI 1.1–5.7 %) were non-excellent or structurally defective: three were Class C due to time-scale/unit-handling errors (one DeepSeek run with a 6-second effective Euler step from a minute-as-second declaration, two Grok runs with min⁻¹ rate constants applied per second), one was Class D (the empty-CSV failure above), and one was Class B but reflected a duplicated-bolus implementation error rather than a benign numerical deviation. Kruskal–Wallis testing showed significant inter-LLM heterogeneity across all metrics and phases (all omnibus p < 0.01). Strict compliance with Phase 2 directives was 98 % (98/100 runs; 95 % CI 93.1–99.5 %); lenient compliance accepting RK4-adaptive implementations as a superset was 100 % (100/100 runs; 95 % CI 96.3–100 %). Yet all three numerically divergent Phase 2 runs occurred under nominally compliant RK4/dt = 1 s configurations; the fourth non-Class-A Phase 2 outcome was a formatting failure that produced no usable trajectory.
Conclusion LLMs generate numerically accurate Marsh-model code in most runs but silently diverge in a clinically non-negligible minority. The Marsh model — the simplest fixed-parameter three-compartment propofol model — functioned here as a positive control: even so, three distinct classes of structural bug (unit/time-scale mismatch, duplicated-bolus event handling, malformed f-string formatting) slipped past apparent execution success. Two additional Phase 2 runs used an RK4-adaptive variant rather than classical RK4 and are therefore better interpreted as strict prompt non-compliance than as numerical failure. Prompt-level method specification substantially reduced algorithm-selection errors but did not eliminate unit or structural bugs. LLM-generated pharmacokinetic code requires reference-based validation before any safety-relevant use.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work received no external funding.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The full run-level dataset (200 Python source files, 200 output CSVs, 200 per-run PE series, and associated metadata including model version, timestamp, and SHA-256 hashes), together with the reference implementation and all analysis scripts, are re leased under the MIT license at https://github.com/omote-masahito/llm-pk-marsh-benchmark





