Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Fine-tuning Large Language Models in Behavioral Psychology for Scalable Physical Activity Coaching

Sriya Mantena, View ORCID ProfileAnders Johnson, View ORCID ProfileMarily Oppezzo, View ORCID ProfileNarayan Schuetz, View ORCID ProfileAlexander Tolas, Ritu Doijad, View ORCID ProfileC. Mikael Mattson, View ORCID ProfileAllan Lawrie, View ORCID ProfileMariana Ramirez-Posada, View ORCID ProfileEleni Linos, View ORCID ProfileAbby C. King, View ORCID ProfileFatima Rodriguez, View ORCID ProfileDaniel Seung Kim, View ORCID ProfileEuan A. Ashley
doi: https://doi.org/10.1101/2025.02.19.25322559
Sriya Mantena
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Anders Johnson
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Anders Johnson
Marily Oppezzo
2Department of Epidemiology & Population Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
3Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marily Oppezzo
Narayan Schuetz
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
4Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA
5Clinical Excellence Research Center (CERC), Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Narayan Schuetz
Alexander Tolas
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alexander Tolas
Ritu Doijad
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
C. Mikael Mattson
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for C. Mikael Mattson
Allan Lawrie
6National Heart and Lung Institute, Imperial College London, London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Allan Lawrie
Mariana Ramirez-Posada
7Department of Dermatology, Stanford University School of Medicine, Stanford, CA, 94305, USA
8Center for Digital Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mariana Ramirez-Posada
Eleni Linos
7Department of Dermatology, Stanford University School of Medicine, Stanford, CA, 94305, USA
8Center for Digital Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Eleni Linos
Abby C. King
2Department of Epidemiology & Population Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
3Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Abby C. King
Fatima Rodriguez
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
8Center for Digital Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
9Wu Tsai Human Performance Alliance, Stanford University School of Medicine, Stanford, CA, 94305, USA
10Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Fatima Rodriguez
Daniel Seung Kim
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
8Center for Digital Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
9Wu Tsai Human Performance Alliance, Stanford University School of Medicine, Stanford, CA, 94305, USA
10Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel Seung Kim
  • For correspondence: euan{at}stanford.edu kimds{at}stanford.edu
Euan A. Ashley
1Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
8Center for Digital Health, Stanford University School of Medicine, Stanford, CA, 94305, USA
9Wu Tsai Human Performance Alliance, Stanford University School of Medicine, Stanford, CA, 94305, USA
10Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, 94305, USA
11Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
12Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Euan A. Ashley
  • For correspondence: euan{at}stanford.edu kimds{at}stanford.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Personalized, smartphone-based coaching improves physical activity but relies on static, human-crafted messages. We introduce My Heart Counts (MHC)-Coach, a large language model fine-tuned on the Transtheoretical Model of Change. MHC-Coach generates messages tailored to an individual’s psychology (their “stage of change”), providing personalized support to foster long-term physical activity behavior change. To evaluate MHC-Coach’s efficacy, 632 participants compared human-expert and MHC-Coach text-based interventions encouraging physical activity. Among messages matched to an individual’s stage of change, 68.0% (N=430) preferred MHC-Coach-generated messages (P < 0.001). Blinded behavioral science experts (N=2) rated MHC-Coach messages higher than human-expert messages for perceived effectiveness (4.4 vs. 2.8) and Transtheoretical Model alignment (4.1 vs. 3.5) on a 5-point Likert scale. This work demonstrates how language models can operationalize behavioral science frameworks for personalized health coaching, promoting long-term physical activity and potentially reducing cardiovascular disease risk at scale.

Introduction

Physical inactivity is one of the strongest risk factors for nearly all chronic diseases of aging.1,2 Despite the well-documented benefits of regular physical activity, a majority of adults in the United States remain sedentary.3 Mean daily steps is a commonly used proxy for physical activity, with even modest, short-term increases (+500 steps) associated with cardiovascular benefits.4 However, recent large-scale studies show that the average American achieves only 4,700 daily steps, well below the recommended threshold for health benefits.4,5 These findings emphasize the need for scalable interventions that effectively increase physical activity and support sustained engagement in the long term.

Personalized coaching has consistently demonstrated efficacy in increasing physical activity by addressing individuals’ unique needs, motivations, and barriers.6–8 Yet, sustaining long-term behavior change is a complex and dynamic process shaped by multiple interacting factors. Behavioral science offers structured, evidence-based frameworks for personalization to the individual by addressing the psychological, social, and environmental factors that influence behavior change.9 One well-established framework is the Transtheoretical Model (TTM),10 which focuses on the stages of change individuals undergo when adopting new healthy behaviors. The TTM categorizes individuals into five distinct stages of change: precontemplation, contemplation, preparation, action, and maintenance.10 The TTM has proven effective in guiding behavior change for chronic disease prevention, with tailored interventions at each stage demonstrating higher efficacy in increasing physical activity.11 Because the model’s interventions evolve with users as they progress through the stages of change (from precontemplation to maintenance), it supports both short- and long-term improvements in physical activity.10 Moreover, coaching strategies grounded in the TTM framework have fostered sustained physical activity across diverse populations.12–14 Despite its success in fostering behavior change, TTM-based coaching has largely relied on manual creation and delivery, limiting scalability. Digital health tools, such as smartphone apps and wearables, offer a promising avenue for expanding access to TTM-based interventions.

Digital health tools, particularly smartphone applications and wearable devices such as smartwatches, have been employed to encourage physical activity.15,16 Interventions using smartphone applications and fitness trackers have demonstrated a positive effect on physical activity, with a recent meta-analysis reporting an average increase of 1,850 steps per day when digital interventions include personalized feedback and notifications.17 While these findings highlight the potential of digital interventions, existing tools are currently limited in that they focus on short-term behavior change, with a lack of attention on sustained engagement and long-term impact. As a result, their effects on chronic disease prevention and treatment, such as cardiovascular disease, remain unknown. Moreover, while several digital trials are being conducted to assess the efficacy of TTM-based personalized coaching, they remain constrained by using preprogrammed, static messages designed by human experts and localized, center-based recruitment, limiting their scalability and adaptability.18,19

To address this issue of scale, recent digital health research has leveraged large language models (LLMs) to deliver health coaching. Generic LLMs can generate responses with high levels of empathy, readability, and helpfulness, often rivaling or exceeding those of human coaches.20–23 However, these systems are reactive, responding to user-initiated queries rather than using domain-specific reasoning in behavioral psychology to provide timely and motivation-sensitive interventions necessary for sustained behavior change. Meanwhile, Just-in-Time Adaptive Interventions (JITAIs) leverage real-time user data to deliver contextually relevant, timely support to enhance adherence and retention.24 However, these systems are limited to predefined decision rules and cannot generate behaviorally informed messaging. Hence, there remains a critical need to integrate behavioral science frameworks with LLMs to enable them to capture both the motivational language seen in chatbots and the adaptive, real-time responsiveness of JITAIs.

To explore this potential, we built upon the My Heart Counts (MHC) Cardiovascular Health Study, a large-scale digital health platform designed to promote physical activity. The My Heart Counts Cardiovascular Health Study was launched in 2015 as one of the first smartphone applications using Apple’s ResearchKit. On launch, it became the fastest-recruiting medical study of all time with >40,000 participants in the first two weeks.25–30 MHC is a smartphone-based observational study designed to gather frequent, remote data on fitness, activity, and sleep.29 With a later update, MHC leveraged the accessibility of smartphones and wearable technology to perform the first fully digital randomized trial, using a personalized, data-driven approach to encourage physical activity.26 In this randomized crossover trial using MHC, we recently demonstrated that personalized text-based coaching notifications, customized to each user’s baseline activity levels, significantly increased mean daily step count compared to all other digital notifications (e.g., a notification to reach 10,000 steps).30 However, this study was limited by scale, requiring human experts to craft each personalized coaching prompt a priori.30 Moreover, this study only considered short-term physical activity (over a 7-day intervention period), whereas the majority of cardiovascular benefits are seen with longer-term behavioral change.30

To address the challenges of scalability and the lack of long-term behavioral change in digital health research, we developed MHC-Coach, a fine-tuned LLM31 trained on TTM behavioral science principles32 and domain-specific knowledge in physical activity and cardiovascular health. MHC-Coach autonomously generates personalized, text-based coaching messages to increase physical activity, evolving with users’ changing needs, motivations, and readiness for behavior change. Unlike prior studies that relied on small evaluator groups (5 to 42 participants),21–23,33,34 we conduct a large-scale evaluation (N=632) with participants of the MHC study, comparing the efficacy of MHC-Coach messages vs. traditional human expert-designed messages. Additionally, behavioral science experts systematically rated both message types based on their motivational effectiveness and alignment with behavioral science principles. These analyses assess the potential of integrating behavioral science frameworks with AI-driven coaching to enhance personalization, improve motivation, and foster sustainable physical activity behavior change.

Results

0.1 Fine-tuning MHC-Coach

MHC-Coach was developed by fine-tuning LLaMA 3-70B on a dataset of 3,268 human expert messages matched to a specific stage of change,32 a domain-specific corpus on health and exercise to provide background information on the importance of physical activity in preventing cardiovascular disease1,35–41, and detailed information about the TTM model and characteristics of individuals in each stage of change10,42 (see Methods and Figure 1). The combined fine-tuning dataset contained a total of 215,553 tokens.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

Schematic representation of the study design, including the fine-tuning process for LLaMA 3-70B, the generation of stage-specific motivational messages, and the evaluation through participant surveys.

To assess the effectiveness of this fine-tuning approach, we qualitatively compared MHC-Coach outputs to those of the base LLaMA 3-70B model, evaluating their integration of behavioral science principles and ability to generate stage-specific motivational messages (see Figure 2). For example, when asked to produce a motivation message for a user in the contemplation stage (i.e., those beginning to think about how to change their physical activity behavior), the base LLM generated:

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Qualitative comparison of Fine-Tuned LLM (MHC-Coach) and Base LLM (LLaMA3-70b) messages across the Contemplation and Action stages. Key differences include specificity in behavior recommendations and alignment with behavioral science principles, such as addressing ambivalence in the Contemplation stage and sustaining motivation through variety in the Action stage.

“Take the first step towards a healthier you! Schedule exercise into your daily routine and start feeling the benefits in just a few weeks.”

While this message encourages users to plan their exercise, it is overly general. It does not specify the type of exercise, the duration, or when to start, which makes it harder for users to visualize or act on the suggestion. In contrast, MHC-Coach generated the following message: “Ready to feel great? Schedule 10-minute daily walks into your calendar & reap energy, mood, & health benefits. Start Monday!”

This version demonstrates a significant improvement in actionable specificity by recommending “10-minute daily walks” and incorporating practical planning advice, such as integrating the activity “into your calendar.” The use of immediacy (“Start Monday!”) gives a clear starting point, while the explicit mention of benefits such as “energy, mood, & health benefits” creates a stronger connection between the action and its immediate outcomes. By combining actionable specificity, real-life planning, and outcome clarity, the fine-tuned message effectively integrates TTM behavioral science principles, making it more likely to resonate with and motivate users to take action. Figure 2 illustrates these improvements with annotated examples from the contemplation and action stages of the TTM.

To further evaluate the effectiveness of MHC-Coach, we conducted two systematic assessments. The first was a survey-based preference study in which 632 prior MHC participants compared MHC-Coach messages to human coach-generated messages encouraging physical activity. In the second, behavioral science experts reviewed MHC-Coach and human expert messages and rated them on perceived effectiveness and TTM stage alignment.

0.2 Survey-based evaluation of MHC-Coach versus human expert text prompts

Survey invitations were sent via email to 62,616 users of the MHC smartphone app who had previously consented to re-contact. Surveys were conducted using the Qualtrics platform, and only participants who completed the entire survey were included in the analysis. Among the 1,004 initial survey respondents, 853 met the inclusion criteria of being at least 18 years old and English-speaking. Of these 853 eligible participants, 632 provided complete responses and were included in the final analysis (see Figure 3). The survey respondents included 79.4% males and 19.8% females, with the majority of participants (50.7%) aged 60+ years. The racial/ethnic distribution was predominantly White (80.8%), with smaller proportions identifying as Asian (6.1%), Hispanic/Latino (4.3%), and Black/African American (1.6%). Demographic characteristics of survey participants are summarized in Table 1.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1. Participant Demographics: Summary of the study population’s demographic characteristics, including age, gender, ethnicity (with the option to select multiple ethnicities), and native language.
Figure 3.
  • Download figure
  • Open in new tab
Figure 3.

The recruitment process for the My Heart Counts study by comparing LLM-generated messages with expert-designed messages. Of 65,179 My Heart Counts participants who indicated openness to being contacted for future research, 62,616 were emailed the survey. A total of 1,004 participants responded, of which 853 met the inclusion criteria (English speakers aged ≥ 18 years), and 632 of the eligible participants completed the entire survey and were included in the final analysis.

To evaluate the efficacy of MHC-Coach in generating motivational messages, we surveyed participants with a series of forced-choice A/B questions. Figure 4 illustrates the overall survey flow along with example questions. All participants received five forced-choice A/B questions comparing general motivational messages that were not tailored to a specific stage of change, with 85.4% (N=540) expressing a preference for the general MHC-Coach-generated message and 14.6% (N=92) preferring the human expert-designed message. Across the general motivational messages, there was a significant preference for the MHC-Coach-generated messages (Chi-squared P < 0.001, see Table 2 and Figure 5).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2. Preferences over coaching messages generated by LLMs versus expert-crafted messages, stratified by stages of change.
Figure 4.
  • Download figure
  • Open in new tab
Figure 4.

Overview of the survey design for comparing user preferences between LLM-generated and human expert-designed motivational messages. Users answered demographic questions and were asked to choose the prompt that they preferred on evaluation of 10 total message pairs, including 5 generally motivational messages and 5 pairs tailored to each user’s specific stage of change.

Figure 5.
  • Download figure
  • Open in new tab
Figure 5.

Comparisons between human expert-generated messages and fine-tuned LLM-generated messages across multiple evaluations. (a) Preferences of 632 surveyed users of the MyHeart Counts platform indicate a higher preference for LLM-generated messages across all stages of change. (b) Behavioral science experts rated the perceived effectiveness of messages on a 5-point Likert scale, with the fine-tuned LLM consistently outperforming human expert-generated messages. (c) Behavioral science experts evaluated how well each message aligned with the corresponding stage of change, with the fine-tuned LLM again achieving higher scores in most categories. Vertical lines for each bar represent the standard deviation.

Users were then stratified to a specific TTM stage of change and received three forcedchoice A/B questions on which message they preferred: an MHC-Coach-generated or a human expert-created coaching message, matched to their individual stage of change. Across the stage-matched messages, 68.0% (N=430) of users showed a preference for the MHC-Coach-generated messages, and 32.0% (N=202) preferred the human expert messages (Chisquared P < 0.001, see Table 2). This preference was significant and consistent across all stages of change in a sensitivity analysis, with sample sizes ranging from N = 37 for the pre-contemplation stage to N = 318 for the maintenance stage. Detailed breakdowns of preferences across specific stages of change, along with associated statistical results, are presented in Table 2 and Figure 5.

0.3 Behavioral science expert evaluation of MHC-Coach and human messages

Two behavioral science experts reviewed and rated the MHC-Coach-generated messages and human expert-written messages (see Figure 5). Across all stages of change, the average perceived effectiveness rating was 2.8 for the human-written messages and 4.4 for the MHC-Coach-generated messages on a 5-point Likert scale. Similarly, the average rating for how well the messages matched the requested TTM stage of change was 3.5 for the human-written messages and 4.1 for the MHC-Coach-generated messages. The most pronounced difference occurred in the pre-contemplation stage, where the human-written messages received an effectiveness rating of 2.2 and a match rating of 2.2, compared to the MHC-Coach-generated messages, which scored 4.3 for effectiveness and 3.1 for the stage match. A detailed breakdown of these ratings across all stages of change is presented in Figure 5.

Discussion

Digital health offers the potential of a paradigm shift in scalability and individualization for physical activity coaching.43 Here, we developed and tested a tool, MHC-Coach, with substantial potential to address the gap in personalization via fine-tuning of a large language model to incorporate an evidence-based and validated psychological framework for behavior change. We show that MHC-Coach can provide autonomous, proactive, and personalized text-based coaching interventions, based on a user’s underlying behavioral psychology. Finally, we showed that MHC-Coach-generated messages were judged preferable by both behavioral science experts and participants in the My Heart Counts Cardiovascular Health Study, as compared to human expert-curated messages.

A key aspect of this work is demonstrating how fine-tuning LLMs with behavioral science principles enhances their ability to drive behavior change. Generic LLMs, despite their fluency and coherence, lack an inherent understanding of psychological motivation, resulting in responses that may be readable but not effective for long-term engagement. By embedding the TTM framework into an LLM’s latent space, the MHC-Coach model delivers structured, stage-specific coaching tailored to users’ behavioral readiness, surpassing both the base LLM and the human expert messages it was trained on. Moreover, with behavioral science as a backbone, future iterations of MHC-Coach can integrate existing personalization techniques based on activity habits, environmental context, and language preferences to further tailor interventions to individual needs.24,44–46 This could potentially bridge the gap and allow for ever-more-personalized and proactive LLM-generated health coaching at scale.

We further note that our study is unique amongst studies of LLM and health behavior for its scale and recruitment approach, reaching and engaging 632 MHC users seeking health interventions in the final analysis. This contrasts with the majority of preceding LLM-based health coaching studies, which relied on small, pre-selected groups of fewer than 42 annotators.22,23 We have collected user preference data and have made it openly available for further development of LLM-based health coaching leveraging the TTM framework (see Methods).

A few limitations of our study should be noted. First, our cohort reflects the user characteristics of the original My Heart Counts Cardiovascular Health Study and may not fully capture the diversity of the broader United States population.25–30 Moreover, MHC-Coach-generated messages were in English and hence restricted participation to English speakers. Research is ongoing to assess the efficacy of MHC-Coach-generated messages in non-English speakers, of which there are significant numbers within the United States. Finally, while MHC-Coach was fine-tuned on scientific health data to enhance accuracy, the possibility of hallucinating false or misleading information remains, a critical concern for health-related applications.47,48

In summary, MHC-Coach demonstrates the promise of integrating behavioral science frameworks with LLMs to deliver scalable, stage-specific health coaching. In contrast to generic LLMs, MHC-Coach goes beyond providing information about behavioral science and exercise; it operationalizes these principles in the language it generates, ensuring that coaching messages align with motivation-sensitive, stage-specific interventions. This study builds a foundation for future efforts, which include integrating MHC-Coach into the My Heart Counts smartphone application, expanding coaching to multilingual and culturally diverse settings, incorporating real-time activity data for automated stage classification, and evaluating long-term adherence and cardiovascular health outcomes through longitudinal studies.

Methods

1. Message Generation

1.1 Data

Our fine-tuning dataset consists of motivational messages based on the TTM framework, designed to promote exercise behavior change across the five stages of change, collected through a separate study.32 A total of 3,268 messages were included in this dataset, developed by 25 human experts, including fitness coaches, behavioral coaches, and researchers in health psychology, all of whom were native English speakers.32

In addition to these expert-designed messages, we incorporated content from manuscripts focused on the effects of regular exercise on cardiovascular health. The manuscripts1,35–41 contributed supplementary data that enriched the training corpus of MHC-Coach with evidencebased insights into the connection between physical activity and cardiovascular health outcomes, ensuring that the generated messages were scientifically grounded.

The combined fine-tuning dataset, consisting of human expert messages, content extracted from the research papers, and detailed information about the TTM model and characteristics of users in each phase,10,42 contained a total of 215,553 tokens. This dataset was utilized to finetune MHC-Coach, enabling it to generate contextually relevant and personalized messages for behavior change in physical activity (see Figure 1). By integrating expert-designed messages with research-based content, we aimed to ensure accurate health information aligned with clinical and behavioral science guidelines, while grounding the language in evidence-based psychology tailored to each TTM stage.

1.2 Language Model

LLaMA 3-70B31 was used as the primary LLM architecture in this study. LLaMA 3-70B is a decoder-only transformer model with >70 billion parameters, optimized for diverse tasks such as reasoning, instruction following, and dialogue generation. For our experiments, we utilized LLaMA 3-70B with the following hyperparameters: temperature of 0.8 for response diversity, maximum token length of 800, and top-p (nucleus sampling) set to 0.9.

1.3 Fine-tuning

We used an instruction-tuning49 variant to fine-tune LLaMA 3-70B on our dataset. Given the unstructured nature of the research papers and the need to align expert messages with the appropriate stage of change, we developed a custom template to guide the fine-tuning process. This custom template involved structuring the input data using a template designed to train the model to generate stage-specific motivational messages. Each example was designed with two key components: (1) a prompt section, describing the user’s stage of change and requesting a tailored motivational message, (2) a stage section specifying the user’s stage of change, and (3) a completion section, containing the expert-designed motivational message as the target response. To ensure clarity and consistency, a designated token was used to separate messages (see Figure 1). To provide additional context, unstructured text from manuscripts detailing the impact of regular exercise on cardiovascular health was appended to the training set,1,35–41 along with a comprehensive explanation of the TTM and the characteristics of individuals at each stage of change.10,42

We used Low-Rank Adaptation50 for fine-tuning, a technique that updates pre-trained models. This approach enables efficient fine-tuning by adding trainable low-rank matrices as additional parameters, leaving the original model’s weights unchanged. Fine-tuning was performed using a learning rate of 1 × 10−5, a batch size of 4, and 2 epochs. These parameters were determined through a grid-based hyperparameter search.

2. Expert Evaluation of Coaching Messages

We engaged two behavioral science experts to independently evaluate and rate both the fine-tuned MHC-Coach-generated messages and human-expert written messages. The evaluators were blinded as to whether a given message was MHC-Coach- or human-generated. They assessed each message on a 5-point Likert scale across two criteria: (1) the perceived effectiveness of the message in promoting increased physical activity and (2) the degree to which the message aligned with the recipient’s current stage of change.

3. Large-Scale User Study Design and Participant Recruitment

3.1 Ethics

Ethical approval for the study was obtained from Stanford University’s Research Compliance Office (IRB-75836).

3.2 Participant Recruitment and Eligibility Criteria

The My Heart Counts (MHC) smartphone application was first launched in March 2015 via the Apple App Store for United States, United Kingdom, and Hong Kong residents.25–30 Individuals aged 18 years or older who could read and understand English were eligible to participate in the My Heart Counts Cardiovascular Health Study.

Survey invitations were sent via email to MHC users who had indicated an openness to being contacted for future research. Survey responses were collected using the Qualtrics survey platform. Respondents who did not complete the full survey were excluded from the analysis, as complete responses were required for group-level analyses. Of the initial 1,004 respondents, 632 met the inclusion criteria, completed the entire survey, and were retained for the final analyses (see Figure 3).

3.3 Survey Design

Throughout the survey, participants evaluated messages using an A/B forced-choice format, comparing pairs of messages—one human expert-designed and one generated by MHC-Coach—based on which they found more empathetic, realistic, actionable, and encouraging. The messages designed by human experts used for comparison were sourced from the same crowdsourcing study that provided the dataset for fine-tuning, involving 25 human specialists in fitness, behavior, and health psychology.32 Examples of both human expert-crafted and MHC-Coach-generated messages that users compared are provided in Supplemental Table 1. The first set of questions asked participants to choose their preferred text-based coaching prompt in an A/B forced-choice set of five questions featuring general motivational messages.

These messages were created by a human expert or generated by MHC-Coach with the sole objective of encouraging regular physical activity, independent of the stage of change.

Participants were then categorized into one of five stages of change based on their responses to an evidence-based question for TTM stage stratification assessing the duration and consistency of their regular exercise habits.51 Regular exercise was defined as planned physical activities—such as brisk walking, aerobics, jogging, bicycling, swimming, or rowing—performed 3 to 5 times per week for 20–60 minutes per session to improve physical fitness.51 After categorization, participants answered five questions comparing messages tailored to their specific stage of change. In this section, MHC-Coach-generated messages were crafted to correspond to the motivational and behavioral traits specific to each stage of change (see Figure 1).

3.4 Statistical Analysis

The Chi-squared test was used to statistically evaluate differences in participants’ preferences for human expert-designed versus MHC-Coach-generated messages, under the null hypothesis that preferences for both types of messages were equal. Analyses were conducted across the five specific stages of change (pre-contemplation, contemplation, preparation, action, and maintenance) and for generic motivational messages. To account for the six comparisons performed (five stage-specific tests and one generic message test), a Bonferroni correction was applied to control the family-wise error rate, adjusting the significance threshold from α = 0.05 to αadjusted = 0.0083 (0.05/6). Participants were classified as preferring MHC-Coach if they selected a majority of MHC-Coach-generated messages (3), and as preferring the human expert if they selected a majority of human expert-generated messages (3) in A/B forced-choice questions in analyses of the general motivational and TTM stage-specific question sets.

Data Availability

The health coaching datasets generated and analyzed during the study are available on GitHub (https://github.com/SriyaM/MHC_LLM_Preference_Data). The dataset has been anonymized to ensure participant privacy, and no personally identifiable information is included. The repository includes code for data processing and customization to enable further analyses. Both the data and code are provided under the MIT License, allowing users to modify, reuse, and redistribute the materials in accordance with the license terms.

https://github.com/SriyaM/MHC_LLM_Preference_Data

4. Data and Code Availability

The health coaching datasets generated and analyzed during the study are available on GitHub (https://github.com/SriyaM/MHC_LLM_Preference_Data). The dataset has been anonymized to ensure participant privacy, and no personally identifiable information is included. The repository includes code for data processing and customization to enable further analyses. Both the data and code are provided under the MIT License, allowing users to modify, reuse, and redistribute the materials in accordance with the license terms.

Conflicts of Interest

D.S.K. reports grant support from Amgen and the Bristol Myers Squibb Foundation (via the Robert A. Winn Diversity in Clinical Trials Career Development Award). F.R. reports equity from Carta Healthcare and HealthPals, and consulting fees from HealthPals, Novartis, Novo Nordisk, Esperion Therapeutics, Movano Health, Kento Health, Inclusive Health, Edwards, Arrowhead Pharmaceuticals, HeartFlow, and iRhythm outside the submitted work. E.A.A. reports advisory board fees from SequenceBio, Foresite Labs, Pacific Biosciences, and Versant Ventures. E.A.A. has ownership interest in Personalis, Deepcell, Svexa, Candela, Parameter Health, Saturnus Bio, outside the submitted work. E.A.A. is a non-executive director of AstraZeneca and Svexa. E.A.A. receives collaborative research support from Illumina, Pacific Biosciences, Oxford Nanopore, Cache, Cellsonics, outside the submitted work. The remainder of the authors report no potential conflicts of interest.

Funding

The funders had no role in the interpretation of data or publication. D.S.K. is supported by the Wu-Tsai Human Performance Alliance as a Clinician-Scientist Fellow, the Stanford Center for Digital Health as a Digital Health Scholar, the Pilot Grant from the Stanford Center for Digital Health, the Robert A. Winn Diversity in Clinical Trials Career Development Award, NIH 1L30HL170306, and the American Diabetes Association (ADA) Pathway to Stop Diabetes Initiator Award (7-25-INI-11). F.R. was funded by grants from the NIH National Heart, Lung, and Blood Institute (R01HL168188; R01HL167974, R01HL169345), the American Heart Association/Harold Amos Medical Faculty Development Program, and the Doris Duke Foundation (Grant #2022051). EL is supported by NIH grants K24AR075060, K12TR004930 and R01AR082109. ACK is supported by NIH grants 5R01AG07149002 and 1R44AG071211, and a seed grant provided by the Stanford Institute for Human-Centered Artificial Intelligence at Stanford University.

Supplemental Materials

View this table:
  • View inline
  • View popup
  • Download powerpoint
Supplemental Table 1:

Examples of messages generated by the fine-tuned LLM Health Coach and human experts, tailored to different stages of behavior change. Expert messages were gathered from a crowdsourcing study involving 25 specialists in fitness, behavior, and health psychology.

References

  1. 1.↵
    Gibbs, B. Physical activity as a critical component of first-line treatment for elevated blood pressure or cholesterol: who, what, and how?: a scientific statement from the American Heart Association. Hypertension 78, e26–e37 (2021).
    OpenUrl
  2. 2.↵
    Booth, F. W., Roberts, C. K., Thyfault, J. P., Ruegsegger, G. N. & Toedebusch, R. G. Role of inactivity in chronic diseases: Evolutionary insight and pathophysiological mechanisms. Physiol. Rev. 97, 1351–1402 (2017).
    OpenUrlCrossRefPubMed
  3. 3.↵
    Nazik, E. A. & Kramarow, C. Physical Activity Among Adults Aged 18 and Over: United States, 2020. Centers for Disease Control and Prevention (2020).
  4. 4.↵
    Hall, K. S. et al. Systematic review of the prospective association of daily step counts with risk of mortality, cardiovascular disease, and dysglycemia. Int. J. Behav. Nutr. Phys. Act. 17, 78 (2020).
    OpenUrlCrossRefPubMed
  5. 5.↵
    Tudor-Locke, C. et al. How many steps/day are enough? For older adults and special populations. Int. J. Behav. Nutr. Phys. Act. 8, 80 (2011).
    OpenUrlCrossRefPubMed
  6. 6.↵
    Monteiro-Guerra, F., Rivera-Romero, O., Fernandez-Luque, L. & Caulfield, B. Personalization in real-time physical activity coaching using mobile applications: A scoping review. IEEE J. Biomed. Health Inform. 24, 1738–1751 (2020).
    OpenUrlPubMed
  7. 7.
    Bäccman, C., Bergkvist, L. & Wästlund, E. Personalized coaching via texting for behavior change to understand a healthy lifestyle intervention in a naturalistic setting: Mixed methods study (preprint). JMIR Preprints (2023) doi:10.2196/preprints.47312.
    OpenUrlCrossRef
  8. 8.↵
    Brink, S. M., Wortelboer, H. M., Emmelot, C. H., Visscher, T. L. S. & van Wietmarschen, H. A. Developing a personalized integrative obesity-coaching program: A systems health perspective. Int. J. Environ. Res. Public Health 19, 882 (2022).
    OpenUrlPubMed
  9. 9.↵
    Sucala, M., Ezeanochie, N. P., Cole-Lewis, H. & Turgiss, J. An iterative, interdisciplinary, collaborative framework for developing and evaluating digital behavior change interventions. Transl. Behav. Med. 10, 1538–1548 (2020).
    OpenUrlPubMed
  10. 10.↵
    Prochaska, J. O. & Velicer, W. F. The transtheoretical model of health behavior change. Am. J. Health Promot. 12, 38–48 (1997).
    OpenUrlCrossRefPubMedWeb of Science
  11. 11.↵
    Hashemzadeh, M., Rahimi, A., Zare-Farashbandi, F., Alavi-Naeini, A. M. & Daei, A. Transtheoretical model of health behavioral change: A systematic review. Iran. J. Nurs. Midwifery Res. 24, 83–90 (2019).
    OpenUrlPubMed
  12. 12.↵
    Liu, K. T., Kueh, Y. C., Arifin, W. N., Kim, Y. & Kuan, G. Application of transtheoretical model on behavioral changes, and amount of physical activity among university’s students. Front. Psychol. 9, 2402 (2018).
    OpenUrlPubMed
  13. 13.
    Pirzadeh, A., Mostafavi, F., Ghofranipour, F. & Feizi, A. Applying transtheoretical model to promote physical activities among women. Iran. J. Psychiatry Behav. Sci. 9, e1580 (2015).
    OpenUrl
  14. 14.↵
    Jiménez-Zazo, F., Romero-Blanco, C., Castro-Lemus, N., Dorado-Suárez, A. & Aznar, S. Transtheoretical Model for physical activity in older adults: Systematic review. Int. J. Environ. Res. Public Health 17, 9262 (2020).
    OpenUrlPubMed
  15. 15.↵
    Longhini, J. et al. Wearable devices to improve physical activity and reduce sedentary behaviour: An umbrella review. Sports Med. Open 10, 9 (2024).
    OpenUrlPubMed
  16. 16.↵
    Harries, T. et al. Effectiveness of a smartphone app in increasing physical activity amongst male adults: a randomised controlled trial. BMC Public Health 16, (2016).
  17. 17.↵
    Laranjo, L. et al. Do smartphone applications and activity trackers increase physical activity in adults? Systematic review, meta-analysis and metaregression. Br. J. Sports Med. 55, 422–432 (2021).
    OpenUrlAbstract/FREE Full Text
  18. 18.↵
    Mohd Saad, N., Mohamad, M. & Mat Ruzlin, A. N. Web-based intervention to act for weight loss in adults with type 2 diabetes with obesity (Chance2Act): Protocol for a nonrandomized controlled trial. JMIR Res. Protoc. 13, e48313 (2024).
    OpenUrl
  19. 19.↵
    Takahashi, A. et al. Effectiveness of a lifestyle improvement support app in combination with a wearable device in Japanese people with type 2 diabetes mellitus: STEP-DM study. Diabetes Ther. 15, 1187–1199 (2024).
    OpenUrlPubMed
  20. 20.↵
    Zhou, Y. et al. Towards enhancing health coaching dialogue in low-resource settings. arXiv [cs.CL] (2024).
  21. 21.↵
    Ong, Q. C. et al. Advancing health coaching: A comparative study of large language model and health coaches. Artif. Intell. Med. 157, 103004 (2024).
    OpenUrlPubMed
  22. 22.↵
    Hegde, N. et al. Infusing behavior science into large language models for activity coaching. PLOS Digit. Health 3, e0000431 (2024).
    OpenUrl
  23. 23.↵
    Jörke, M. et al. Supporting physical activity behavior change with LLM-based conversational agents. arXiv [cs.HC] (2024).
  24. 24.↵
    Nahum-Shani, I. et al. Just-in-time adaptive interventions (JITAIs) in mobile health: Key components and design principles for ongoing health behavior support. Ann. Behav. Med. 52, 446–462 (2018).
    OpenUrlCrossRefPubMed
  25. 25.↵
    Johnson, A. et al. Mobile health study incorporating novel fitness test. J. Cardiovasc. Transl. Res. 16, 569–580 (2023).
    OpenUrlPubMed
  26. 26.↵
    Shcherbina, A. et al. The effect of digital physical activity interventions on daily step count: a randomised controlled crossover substudy of the MyHeart Counts Cardiovascular Health Study. Lancet Digit. Health 1, e344–e352 (2019).
    OpenUrl
  27. 27.
    Kim, D. S. et al. Data release and baseline analysis of the largest collection of 6-minute walk tests: the MyHeart Counts Cardiovascular Health Study, a fully digital smartphone platform. medRxiv (2024) doi:10.1101/2024.06.26.24309535.
    OpenUrlAbstract/FREE Full Text
  28. 28.
    Hershman, S. G. et al. Physical activity, sleep and cardiovascular health data for 50,000 individuals from the MyHeart Counts Study. Sci. Data 6, 24 (2019).
    OpenUrlPubMed
  29. 29.↵
    McConnell, M. V. et al. Feasibility of obtaining measures of lifestyle from a smartphone app: The MyHeart Counts cardiovascular health study. JAMA Cardiol. 2, 67–76 (2017).
    OpenUrlPubMed
  30. 30.↵
    Javed, A. et al. Personalized digital behaviour interventions increase short-term physical activity: a randomized control crossover trial substudy of the MyHeart Counts Cardiovascular Health Study. Eur. Heart J. Digit. Health 4, 411–419 (2023).
    OpenUrlPubMed
  31. 31.↵
    Grattafiori, A. et al. The Llama 3 herd of models. arXiv [cs.AI] (2024).
  32. 32.↵
    de Vries, R. A. J., Truong, K. P., Kwint, S., Drossaert, C. H. C. & Evers, V. Crowddesigned motivation. in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (ACM, New York, NY, USA, 2016). doi:10.1145/2858036.2858229.
    OpenUrlCrossRef
  33. 33.↵
    Meywirth, S., Janson, A. & Söllner, M. Personalized coaching for lifestyle behavior change through large language models: A qualitative study. (2025).
  34. 34.↵
    Lim, S., Schmälzle, R. & Bente, G. Artificial social influence: Rapport-building, LLM-based embodied conversational agents for health coaching. (2024).
  35. 35.↵
    Gonzalez-Jaramillo, N. et al. Systematic review of physical activity trajectories and mortality in patients with coronary artery disease. J. Am. Coll. Cardiol. 79, 1690–1700 (2022).
    OpenUrlPubMed
  36. 36.
    Ding, D. et al. Towards better evidence-informed global action: lessons learnt from the Lancet series and recent developments in physical activity and public health. Br. J. Sports Med. 54, 462–468 (2020).
    OpenUrlAbstract/FREE Full Text
  37. 37.
    Kraus, W. E. et al. Daily step counts for measuring physical activity exposure and its relation to health. Med. Sci. Sports Exerc. 51, 1206–1212 (2019).
    OpenUrl
  38. 38.
    Hawes, A. M. et al. Disentangling race, poverty, and place in disparities in physical activity. Int. J. Environ. Res. Public Health 16, 1193 (2019).
    OpenUrlPubMed
  39. 39.
    Rippe, J. M. Physical activity and health: Key findings from the physical activity guidelines for Americans 2018. in Lifestyle Medicine, Fourth Edition 200–208 (CRC Press, Boca Raton, 2024). doi:10.1201/9781003227793-20.
    OpenUrlCrossRef
  40. 40.
    Althoff, T. et al. Large-scale physical activity data reveal worldwide activity inequality. Nature 547, 336–339 (2017).
    OpenUrlCrossRefPubMed
  41. 41.↵
    Carlson, S. A., Fulton, J. E., Pratt, M., Yang, Z. & Adams, E. K. Inadequate physical activity and health care expenditures in the United States. Prog. Cardiovasc. Dis. 57, 315–323 (2015).
    OpenUrlCrossRefPubMed
  42. 42.↵
    Velicer, W. F., Prochaska, J. O., Fava, J. L., Norman, G. J. & Redding, C. A. Detailed overview of the transtheoretical model. Homeostasis 38, 216–233 (1998).
    OpenUrl
  43. 43.↵
    Kim, D. S., Eltahir, A. A., Ngo, S. & Rodriguez, F. Bridging the gap: How accounting for social determinants of health can improve digital health equity in cardiovascular medicine. Curr. Atheroscler. Rep. 27, 9 (2024).
    OpenUrlPubMed
  44. 44.↵
    Choi, Y. J., Lee, M. & Lee, S. Toward a multilingual conversational agent: Challenges and expectations of code-mixing multilingual users. in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems vol. 85 1–17 (ACM, New York, NY, USA, 2023).
    OpenUrl
  45. 45.
    Chatterjee, A., Pahari, N., Prinz, A. & Riegler, M. AI and semantic ontology for personalized activity eCoaching in healthy lifestyle recommendations: a meta-heuristic approach. BMC Med. Inform. Decis. Mak. 23, 278 (2023).
    OpenUrlPubMed
  46. 46.↵
    Mandyam, A., Jörke, M., Denton, W., Engelhardt, B. E. & Brunskill, E. Adaptive interventions with user-defined goals for health behavior change. arXiv [cs.LG] (2023).
  47. 47.↵
    Huang, L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. (2024). doi:10.1145/3703155.
    OpenUrlCrossRef
  48. 48.↵
    Ahmad, M., Yaramic, I. & Roy, T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. arXiv [cs.CL] (2023). doi:10.20944/preprints202310.1662.v1.
    OpenUrlCrossRef
  49. 49.↵
    Wei, J. et al. Finetuned language models are zero-shot learners. arXiv [cs.CL] (2021).
  50. 50.↵
    Hu, E. J. et al. LoRA: Low-Rank Adaptation of large language models. arXiv [cs.CL] (2021).
  51. 51.↵
    Marcus, B. H., Selby, V. C., Niaura, R. S. & Rossi, J. S. Self-efficacy and the stages of exercise behavior change. Res. Q. Exerc. Sport 63, 60–66 (1992).
    OpenUrlCrossRefPubMedWeb of Science
Back to top
PreviousNext
Posted February 21, 2025.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fine-tuning Large Language Models in Behavioral Psychology for Scalable Physical Activity Coaching
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fine-tuning Large Language Models in Behavioral Psychology for Scalable Physical Activity Coaching
Sriya Mantena, Anders Johnson, Marily Oppezzo, Narayan Schuetz, Alexander Tolas, Ritu Doijad, C. Mikael Mattson, Allan Lawrie, Mariana Ramirez-Posada, Eleni Linos, Abby C. King, Fatima Rodriguez, Daniel Seung Kim, Euan A. Ashley
medRxiv 2025.02.19.25322559; doi: https://doi.org/10.1101/2025.02.19.25322559
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fine-tuning Large Language Models in Behavioral Psychology for Scalable Physical Activity Coaching
Sriya Mantena, Anders Johnson, Marily Oppezzo, Narayan Schuetz, Alexander Tolas, Ritu Doijad, C. Mikael Mattson, Allan Lawrie, Mariana Ramirez-Posada, Eleni Linos, Abby C. King, Fatima Rodriguez, Daniel Seung Kim, Euan A. Ashley
medRxiv 2025.02.19.25322559; doi: https://doi.org/10.1101/2025.02.19.25322559

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Cardiovascular Medicine
Subject Areas
All Articles
  • Addiction Medicine (430)
  • Allergy and Immunology (755)
  • Anesthesia (221)
  • Cardiovascular Medicine (3288)
  • Dentistry and Oral Medicine (364)
  • Dermatology (277)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1169)
  • Epidemiology (13359)
  • Forensic Medicine (19)
  • Gastroenterology (898)
  • Genetic and Genomic Medicine (5147)
  • Geriatric Medicine (481)
  • Health Economics (782)
  • Health Informatics (3264)
  • Health Policy (1140)
  • Health Systems and Quality Improvement (1190)
  • Hematology (429)
  • HIV/AIDS (1017)
  • Infectious Diseases (except HIV/AIDS) (14622)
  • Intensive Care and Critical Care Medicine (912)
  • Medical Education (476)
  • Medical Ethics (127)
  • Nephrology (522)
  • Neurology (4919)
  • Nursing (262)
  • Nutrition (727)
  • Obstetrics and Gynecology (882)
  • Occupational and Environmental Health (795)
  • Oncology (2519)
  • Ophthalmology (723)
  • Orthopedics (280)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (543)
  • Pediatrics (1299)
  • Pharmacology and Therapeutics (550)
  • Primary Care Research (556)
  • Psychiatry and Clinical Psychology (4205)
  • Public and Global Health (7499)
  • Radiology and Imaging (1704)
  • Rehabilitation Medicine and Physical Therapy (1011)
  • Respiratory Medicine (980)
  • Rheumatology (479)
  • Sexual and Reproductive Health (497)
  • Sports Medicine (424)
  • Surgery (547)
  • Toxicology (72)
  • Transplantation (235)
  • Urology (205)