Identifying and preventing fraudulent responses in online public health surveys: Lessons learned during the COVID-19 pandemic

Web-based survey data collection has become increasingly popular, and limitations on in-person data collection during the COVID-19 pandemic have fueled this growth. However, the anonymity of the online environment increases the risk of fraud, which pose major risks to data integrity. As part of a study of COVID-19 and the return to in-person school, we implemented a web-based survey of parents in Maryland, USA, between December 2021 and July 2022. Recruitment relied, in part, on social media advertisements. Despite implementing many existing best practices, the survey was challenged by sophisticated fraudsters. In this paper, we describe efforts to identify and prevent fraudulent online survey responses and provide specific, actionable recommendations for identifying and preventing online survey fraud. Some strategies can be deployed within the web-based data collection platform such as Internet Protocol address logging to identify duplicate responses and comparison of client-side and server-side time stamps to identify responses that may have been completed by respondents outside of the surveys target geography. Additional approaches include the use of a 2-stage survey design, repeated within-survey and cross-survey validation questions, the addition of "speed bump" questions to thwart careless or computerized responders, and the use of optional open-ended survey responses to identify irrelevant responses. We describe best practices for ongoing survey data review and verification, including algorithms to simplify aspects of this review.

44 Introduction 45 Web-based survey data collection has become increasingly popular, and limitations on in-46 person data collection during the COVID-19 pandemic have fueled this growth. Internet survey 47 software and data capture systems (e.g., REDCap 12,13 , Qualtrics (Qualtrics, Provo, UT)) can 48 reduce effort and expenditures associated with recruiting participants and may, in some 49 applications, assist with accessing populations who may be difficult to reach via other means. 1,2,3 50 Furthermore, the relative anonymity provided by online surveys may facilitate research involving 51 marginalized communities or when respondents may otherwise be hesitant to disclose sensitive 52 information. 4,5,6 53 Despite potential benefits related to access, online survey research (particularly if 54 offering incentives for completion) presents an increased risk of fraudulent activity as compared 55 to face-to-face data collection. There is some evidence that fraud in research surveys has 56 increased in recent years. 7,8 While incentives can promote higher survey response and 57 completion, they are also accompanied by an increased risk of interference from fraudulent 58 responses. 2,9 "Fraudsters" have various methods for finding surveys that involve incentives; for 59 example, Meta (Facebook's parent company) has an Ads Library that can help fraudsters find 60 incentivized surveys that are advertised on their social media platforms, such as Facebook and 61 Instagram. This resource can be exploited by fraudsters who may not be the intended target of a 62 survey but may complete the survey solely for the incentive ("professional survey takers") or 63 utilize computer code to rapidly automate the completion of multiple surveys to receive multiple 64 incentives.

134
We hypothesized that an individual or group of individuals may have posted the public 135 survey link to the eligibility screener to online communities that share information regarding the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Table 1.

187
188 Table 1: Example of points assigned to various indicators of fraud. (More points were an 189 indication of greater risk).

Point Value Indicators of Possible Fraud
1pt Total survey completion time infeasibly short (<8 min given pre-testing found valid completion times from 8-20 minutes) 1pt Response with same county + ZIP code combination as other responses submitted simultaneously based on timestamp data 1pt Grouping of 3+ responses submitted with similar responses to survey questions 1pt Unusual email format (e.g., asfs241421@email.com), misspellings, or extraneous letters/numbers (e.g., hannaahsmiithweka3242@email.com) 1pt Incorrect "speed bump" question responses 1pt Response that indicated an inactive or nonexistent recruitment source (e.g., choosing "Radio Advertisement" when the format was never used) 1pt Response with duplicate IP Address as an existing response 1pt Responses submitted in batches (e.g., 3 responses submitted every 5 minutes). 2pts Inconsistency in response/conflicting information within or between survey forms -Inconsistency in data reported between the eligibility screener and personalized survey -Inconsistency between duplicate questions in the same survey form -Inconsistency between related questions (e.g., child age inconsistent with reported grade (e.g., 5 years old in 8 th grade) -Incorrect or non-existent ZIP codes 2pts Response to optional open-ended question text identical to another response 2pts Time zone other than Eastern Time Zone 2pts Response including an email/address/telephone number already reported by another respondent . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.12.22283381 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.12.22283381 doi: medRxiv preprint

215
In fielding a survey of parents designed to capture perceptions about the safe return to 216 school during the COVID-19 pandemic in the USA, we encountered a variety of threats to data 217 integrity that required substantial time and effort to identify and address. 218 Mechanisms to prevent fraud should be an integral consideration during the survey 219 design phase. Survey and database administrators should be aware of these mechanisms and 220 recommend that all surveys implement measures to maintain data integrity and combat fraud.
221 However, it is important to recognize that even existing recommendations can be circumvented 222 by fraudsters with adequate resources. Automated measures to detect indicators of fraudulent 223 activity and bar respondents demonstrating suspicious behavior from completing the survey can 224 be "brute-forced" by fraudsters, who may attempt thousands of responses to determine a pattern 225 of response that allows them to be deemed eligible. Upon determining such patterns, fraudsters 226 will repeatedly utilize these strategies until they are detected. Here we recommend several 227 strategies to limit inference and safeguard data integrity that are deployed 1) within the data 228 collection platform, and 2) during the survey design phase.

229
230 Strategies deployed within the data collection platform 231 Some strategies can be deployed directly through data capture or survey implementation 232 platforms. First, consider omitting words like "survey," "study", or "research" from the text of 233 survey platform weblinks to make them harder for fraudsters to find. Also, if possible, create 234 multiple surveys associated with different web addresses. Cloned copies of the REDCap project 235 allow researchers to rapidly change links as needed. Should the link for one survey instance be . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

255
An important way to identify suspicious activity is to validate the information respondents 256 provide, within and/or across survey forms, depending on the study design. Validations within 257 the same survey or screener can also be used to thwart "brute force" efforts to determine study 258 eligibility criteria. For example, we asked participants to choose their county of residence from a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.12.22283381 doi: medRxiv preprint 260 and Zip codes were automatically compared and responses that did not match were deemed 261 ineligible. Validations can also be used between two survey forms. For example, we asked 262 participants to provide their county and Zip code on both the screener and again on the emailed 263 personalized survey. If the information did not match, responses were flagged as suspicious. We 264 also included the same question more than once on the personalized survey as a data 265 quality/internal consistency check. While participants' answers may vary for reasons unrelated to 266 fraud, consistency is an important data quality indicator that can be tracked and monitored.