Substantial SF-36 score differences according to the mode of administration of the questionnaire: an ancillary study of the SENTIPAT multicenter randomized controlled trial comparing web-based questionnaire self-completion and telephone interview

Background SF-36 is a popular questionnaire for measuring self-perception of quality of life in a given population of interest. Surprisingly, no study compared score values issued from a telephone interview versus an internet-based questionnaire self-completion. Methods Patients having an Internet connection and returning home after hospital discharge were enrolled in the SENTIPAT multicenter randomized trial the day of discharge. They were randomized to either self-complete a set of questionnaires using a dedicated website (I group) or to provide answers to the same questionnaires administered during a telephone interview (T group). This ancillary study of the trial compared SF-36 data relating to the post-hospitalization period in these two groups. In order to anticipate potential unbalanced characteristics of the respondents in the two groups, the impact of the mode of administration of the questionnaire on score differences was investigated using a matched sample of individuals originating from I and T groups (ratio 1:1), the matching procedure being based on a propensity score approach. SF-36 scores observed in I and T groups were compared with a Wilcoxon-Mann-Whitney test, the score differences between the two groups were also examined according to Cohen's effect size. Results There were 245/840 (29%) and 630/840 (75%) SF-36 questionnaires completed in the I and T group, respectively (p < 0.001). Globally, score differences between groups before matching were similar to those observed in the matched sample. Mean scores observed in T group were all above the corresponding values observed in the I group. After matching, score differences in six out of the eight SF-36 scales were statistically significant, with a mean difference greater than 5 for four scales and an associated mild effect size ranging from 0.22 to 0.29, and with a mean difference near this threshold for two other scales (4.57 and 4.56) and a low corresponding effect size (0.18 and 0.16, respectively). Conclusions Telephone mode of administration of SF-36 involved an interviewer effect increasing SF-36 scores. Questionnaire self-completion via the Internet should be preferred and surveys combining various administration methods should be avoided. Trial Registration ClinicalTrials.gov NCT01769261, registered January 16, 2013.

-4 -since then, with the spread of wide internet and computers, several other computerized or internet based formats have been applied in different studies [10][11][12].
Several randomized trials compared the SF-36 scores issued from different administration modes, such as paper versus internet [13][14][15][16][17] or telephone versus paper [18][19][20][21][22][23][24][25][26]. Telephone interview is a common mode of questionnaire administration for several reasons, including the potential to increase response rate [24,26], practical convenience if other data of the study are already being collected via telephone, and exploring quality of life in some special populations such as very elderly patients. On the other hand, self-completion via the Internet has advantages like avoiding any potential response bias related to interviewer effect [18], being potentially a simpler organization for collecting SF-36 data, and associated with lower costs. However, and surprisingly, to our knowledge, no study compared telephone interview versus internetbased auto-questionnaire methods for collecting SF-36 data to investigate whether they can be used as alternative methods in the mixed-mode data collection procedures according to participant preferences and/or to minimize the possible selection bias.
In this context, the multicenter SENTIPAT (the concept of sentinel patients who would voluntarily report their health evolution on a dedicated website) randomized trial [27][28][29] is the first multicenter randomized trial comparing the Internet against telephone interviews as the methods of administrating several questionnaires on the health evolution of hospitalized patients. The aim of the present work is to compare SF-36 data relating to the 6-weeks post-discharge period of hospitalized patients, collected either via the Internet or through telephone interviews in the SENTIPAT trial. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint -6 -corresponding questionnaire administration. A total of 1680 eligible patients (840 randomized in the I group and T group each) were enrolled in the SENTIPAT trial between February 25, 2013 and September 8, 2014.

Survey administration
Patients of the I group had access to the SF-36 questionnaire 40 days after discharge on a web site dedicated to SENTIPAT. Patients of the T group were interviewed by telephone approximately 42 days after discharge and the data entered to a similar web site interface as used in the I group. In case of nonresponse, reminding emails were sent in the I group while up to three calls were tried whenever the first call did not reach the patient in the T group.

SF-36 score calculations
The eight scale scores and the two summary scores of SF-36 were calculated according to MOS SF-36 French scoring manual [30]. The scale score calculations were done for the multi-item scales only if at least half of the items were answered and the missing item data, if existed, were treated with a person-specific approach which uses the average score of the completed items in the same scale.

Statistical analyses
Bivariate analyses were performed using Fisher exact test or Chi-Square test of independence for the categorical variables, and the Wilcoxon-Mann-Whitney test for the quantitative variables. The latter test was notably used for comparing the SF-36 score differences between I and T groups. Several authors have discussed the task of interpreting observed differences in terms of "clinically meaningful" differences [31][32][33]. In this study, in addition to the above-mentioned statistical test, SF-36 score differences between I and T groups were also examined at the light of two popular approaches: on the one hand, effect size of the difference was considered according to . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Cohen's effect size index [34]; on the other hand, we considered a threshold difference of five points, as was proposed by Ware et al [33] for defining a clinically and socially relevant difference between two compared scores. Internal reliability of the SF-36 was evaluated by Cronbach's alpha coefficient calculation for the eight scales, and was considered as acceptable if the alpha value was > 0.7.
To determine the reasons for the potential score differences between the groups, a responder sample was composed of internet responders matched to telephone responders according to a propensity score-based procedure : the R package MatchIt [35] was used for matching each internet responder to the nearest telephone responder with a 1:1 ratio, and we forced each pair to be strictly identical according to three qualitative variables, sex (male / female), type of hospitalization (conventional / weekly / day-care hospitalization), and hospital ward (general and digestive surgery / gastroenterology and nutrition / hepato-gastroenterology / infectious and tropical diseases / internal medicine). The following baseline variables were additionally included in the logistic regression modeling the propensity score (propensity for being an internet responder versus being a telephone responder): age, length of stay, education, employment (unemployed because of health / retired or unemployed / jobseeker, employed, student), income, relationship status, and type of health insurance. Figure 1 presents the flowchart of the study and Table 1  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Results
The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint In terms of internal validity of questionnaire completion, Cronbach's alpha values calculated for each of the eight scales composing the SF-36 form in the I and T groups (see Supplemental Table) were all > 0.7, the threshold value considered as acceptable.
The matching procedure matched the 245 respondents of group I-no individual was dropped-with 245 individuals of group T. The standardized mean difference of the global distance between I and T was 0.4167 and 0.0215 before and after matching, respectively, with a corresponding balance improvement of 95%. Figure 2 details the standardized mean differences between I and T groups observed on baseline variables, before and after the matching procedure. The differences between I and T groups before matching were globally dramatically dropped after matching, indicating that the matching procedure successfully yielded two populations I and T highly comparable in terms of the baseline variables. Figure 3 shows the differences between I and T groups, before and after matching, for the eight scales and the two summary measures composing SF-36. Figure 3 indicates that the matching procedure had a limited impact on the differences observed between I and T in each of the components of SF-36: regardless of the value of the difference before matching, the corresponding difference after matching appeared similar.
Importantly, the means observed in the Telephone group were all above the corresponding values observed in the Internet group. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint above-mentioned 6 differences were all statistically significant (see Table 2). In contrast, the observed mean difference between T and I was low for the remaining two scales (0.04 and 0.82 for GH and VT, respectively), and not significant. When examining the physical and the mental component summary, the difference was 0.99 and 2.72, respectively, the latter difference being statistically significant and with an associated effect size at 0.25.

Discussion
To our knowledge, this study is the first reported to date that compared SF-36 questionnaire data collected either via a telephone interview or via a self-completion on a dedicated internet website. The study has additional strengths such as the fact that it is based on a randomized trial, with a substantial number of patients included both arms, a large patient case-mix variability (patients originating from 5 very different hospital wards). The main limitation of the study concerns the selection bias related to respondent status in both arms, but such a bias is inherent to the two corresponding modes of administration, and we tried to mitigate this bias as much as possible by conducting a part of the analyses in a matched sub-population. The detailed analysis comparing the scores observed in the whole set of respondents (before matching) and in a sub-population enhancing the similarity of the individuals compared (after matching) constitutes an important strength of the study.
Despite the reminders sent to the patients, Internet group response rate (29%) to survey was dramatically lower than that of the Telephone group (75%) but still within the range of a meta-analysis on Web-based surveys that reported a median participation rate at 27% [36]. While no study compared telephone and internet administration modes for SF-36, two of the four studies that compared telephone and postal mail (paper) administration resulted in higher participation rate in the paper group [18,23] and the . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint other two had the opposite result [24,26]. In addition, the participation rates observed in our study are close to those of Basnov et al [13] who reported a lower response rate in the Internet group than that observed in the paper group (23% versus 76%, respectively). In our view, the response rates observed in a survey involving internet versus another method of administration are difficult to interpret and are not generalizable at all: the modes of administration include underlying elements of the whole survey process for which the impact on participation rate is hardly assessable / describable, such as the internet web site design in terms of its attractiveness or convenience, or the detailed procedure for reaching participants by telephone. For example, the relative high rate of participation in the telephone group observed in this study is likely related to the fact that the schedule of the telephone interview was arranged with each participant at the moment of his/her enrollment and that moreover, up to three calls were tried whenever the participant was not reached at the first phone call.
Nevertheless, with a perspective of a rigorous comparison between SF36 estimates issued from the I and T groups, the difference of response rates between groups observed in this study raised concerns in terms of selection bias associated to the responder status. Indeed, the difference in the SF36 estimates observed in these two groups may be mainly due to two features: first, the difference of the mode of is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint respondents is of primary importance, and in order to get more insight into this issue, we developed a procedure in which responders of the Internet group were matched to similar responders of the Telephone group, according to their baseline characteristics, and we further examined how the score differences between the two groups changed in this matched sample, as compared to the score differences observed in the initial unmatched populations. Figure 2 shows that the matching procedure highly succeeded for composing a sample of similar match-paired patients, but the very modest impact of this matching procedure on modifying the initial score differences between the scores in I and T groups (see Figure 3) highly suggests that the score differences between I and T are mainly attributable to the mode of administration strictly speaking, with a very minor impact of selection bias issues. However and importantly, the scores in the T group were always higher than those in the I group ( Figure 3 and Table 2), likely reflecting another type of bias associated with the telephone interview mode of administration: the interviewer effect. Our results are in agreement with previous studies that reported higher SF-36 scores, when administered by telephone compared to those issued from a mailed paper mode of administration [18,21,22,[24][25][26].
Similarly, Lyons et al [37,38] reported higher scores issued from a face-to-face interview administration than those issued from a self-completion of the SF-36 questionnaire. Altogether, our results and those of previous studies suggest that as compared to patient's self-completion, the introduction of an interviewer likely acts as a veil that somehow embellish patient's QoL reported perception. Internet selfcompletion avoids any potential bias of responses related to an interviewer effect [39], and patients are more likely to freely express their opinions [40] on websites covering anonymity than through telephone. Therefore, self-completion (internet or paper) should probably be preferred for collecting SF-36 data, since the involvement of a third . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint party appears to artificially increase the scores. In any case, our study indicates that an accurate comparison of different scores requires at least avoiding modes of administration of SF-36 mixing self-completion and interview.
For all but two scales out of eight, the mean difference of scores between the groups was statistically significant and higher than 4.5 points (Table 2), and several comments have to be made about this statement of fact. It is worth to recall that the misinterpretations of P values are very common [41,42]. A statistically significant score difference is not systematically considered as meaningful by authors [43,44] and Ware et al had initially proposed a 5 points difference between two SF-36 scores as a threshold value for a clinically and socially relevant difference [33]. In our view, considering effect size is an appropriate approach for examining the relevance of score differences because such a perspective takes into account the variability of the measures and not only a rough mean difference threshold. Interestingly, as shown in Table 2, even if there were substantial mean score differences for the majority of the scales between the two different modes of administration, these differences were all related to a small effect size in eight scales and in two summary components of SF-36 according to effect size index classification proposed by Cohen [34]. Cohen defines the small effect size as "noticeably smaller than medium which represents an effect visible to naked eye of a careful observer but also not so small as to be trivial". On the one hand, the effect size perspective considerably softens the relevance of the observed differences between I and T groups, and raises concerns about considering a five points mean difference as the main critical element of comparison between two scores.
Moreover, such results also indicate that in studies involving a substantially variable population, only very large score mean differences would be considered as meaningful when adopting effect size perspective, highly limiting the presumable usefulness of SF-. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint study and most likely attributable to the interviewer effect are not negligible. For example, in patients with chronic C hepatitis, Younossi et al [45] have reported a mean value of RP scale at 74.4 and 79.6 in patients with advanced and none to mild fibrosis, respectively (p = 0.0017). Therefore, the differences for RP scale likely attributable to SF-36 mode of administration observed in the present study (51.5 and 60.9 in group I and T, respectively (p = 0.002), see Table 2) are at least comparable to those attributable to substantial different health states reported in other studies.

Conclusions
As compared to a mode of administration based on telephone interview, the response rate of volunteer patients communicating their SF-36 data via the Internet was much lower, but our study indicates that a substantial proportion of hospitalized patients volunteered for actively documenting their health data via the Internet. Most of all, the study indicates that the telephone interviewer might be viewed as an intermediate subjective pattern in the collection of patient's data, resulting in a non-negligible increase of SF-36 scores. Therefore, self-administration of SF-36 should be preferred, including via the Internet which is likely a low-cost method. Importantly, the results of this study also strongly advocate for avoiding the conduction of surveys combining methods of SF-36 administration mixing self-reporting and interviews.

24.
Perkins JJ, Sanson-Fisher RW: An examination of self-and telephone- is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint

Ethics approvals
The SENTIPAT study was approved by the Comité de Protection des Personnes Ile de   France IX (decision CPP-IDF IX 12-

Availability of data and materials
The datasets analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: no support from any organization for the submitted work ; no financial relationships with any organizations that might have an interest in the submitted work in the previous three years ; no other relationships or activities that could appear to have influenced the submitted work.

Funding
The Assistance Publique-Hôpitaux de Paris (Département de la Recherche Clinique et du Développement) was the trial sponsor.
The SENTIPAT study was funded by grant AOM09213 K081204 from Programme Hospitalier de Recherche Clinique 2009 (Ministère de la Santé). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.08.21251357 doi: medRxiv preprint -27 -The sponsor and the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Contributors
GH had full access to all of the raw data in the study and can take responsibility for the integrity of the data and the accuracy of the data analysis. Study conception and design: GH. Data acquisition: GH. Analysis AA and GH. Interpretation of data: AA, FC, and GH. First draft of the article: AA and GH. All authors approved the final version of the manuscript.