Natural Language Processing Journal 4 (2023) 100020
a
b
essJvtd‘loias
Ditod
tfeoContents lists available at ScienceDirect
Natural Language Processing Journal
journal homepage: www.elsevier.com/locate/nlp
Employing large language models in survey research
Bernard J. Jansen a,∗, Soon-gyo Jung a, Joni Salminen b
Qatar Computing Research Institute, Hamad Bin Khalifa University, QatarSchool of Marketing and Communication, University of Vaasa, Finland
A R T I C L E I N F O
Keywords:Survey researchLarge language modelsSurvey dataSurveysLLM survey respondents
A B S T R A C T
This article discusses the promising potential of employing large language models (LLMs) for survey research,including generating responses to survey items. LLMs can address some of the challenges associated withsurvey research regarding question-wording and response bias. They can address issues relating to a lack ofclarity and understanding but cannot yet correct for sampling or nonresponse bias challenges. While LLMs canassist with some of the challenges with survey research, at present, LLMs need to be used in conjunction withother methods and approaches. With thoughtful and nuanced approaches to development, LLMs can be usedresponsibly and beneficially while minimizing the associated risks.1. Introduction
On 31 May 2023, the company CloudResearch, a survey participantrecruitment company, sent out via its company listserv the emailmessage shown in Fig. 1.The email message claimed that CloudResearch had addressed sev-ral persistent problems in survey research by engineering billions ofimulated but unique human personalities available for behavioral re-earch. No need for humans! CloudResearch’s Chief Technology Officeronathan Robinson stated, ‘‘Our team has been working on this ad-ancement for years. Survey researchers kept telling us about problemshey were having with attention and data quality. It’s also always beenifficult to find people from hard-to-reach groups. So, we thought,What if we just got rid of the people altogether? That would solve aot of problems’.’’ (Moss, 2023). CloudResearch claimed several benefitsn its blog from leveraging AI for the creation of survey participants,ncluding (presented in a list format that looks like ChatGPT wrote it)n amazingly low 0.8% margin of error, immediate access, cost savings,uperior data quality, perfect results, and expanded reach (Moss, 2023).Although the email message and blog posting was an April Fools’ay joke, the reaction to the email message and blog posting from annformal focus group was ‘‘Oh, this is totally possible!’’ highlightinghe potential future (near term) impact of large language models (LLM)n the domain of survey research, which is the topical impact that weiscuss in this communication paper.The debut of ChatGPT and other large language and Generative Pre-rained Transformer (GPT) models has generated significant attentionrom the natural language processing (NLP) community and nearlyvery domain that deals with words. These NLP models are trainedn massive amounts of text data and can generate human-like text,
∗ Corresponding author.E-mail addresses: jjansen@acm.org (B.J. Jansen), sjung@hbku.edu.qa (S.-g. Jung), jonisalm@uwasa.fi (J. Salminen).
answer questions, and even engage in conversations. Open AI’s Chat-GPT, in particular, has been hailed as a breakthrough in NLP, as ithas achieved state-of-the-art performance on a wide range of languagetasks. This paper will explore some of the potential benefits, drawbacks,and ethical considerations associated with using ChatGPT and otherLLMs within particular and vital domains such as survey research.As survey research is one of the most common tools social scientistsdeploy, the potential ramifications of LLMs could be tremendous. Infact, these ramifications are worth any number of analyses and articles,of which the current manuscript is but one. The motivational questionwe address through our analysis is, can generative AI improve surveyresearch?
2. Survey research: Process and challenges
Survey research is a research method that involves collecting datafrom a sample of individuals by using standardized questionnaires(called surveys or survey instruments). The goal of survey researchis to gather information about the attitudes, opinions, beliefs, andbehaviors of the targeted population through closed-ended questions(which result in quantitative data) and open-ended ones (which resultin qualitative data) (Aldridge, 2001; Braun et al., 2021). Research usingsurveys can be conducted via telephone, by mail, online, or as in-personinterviews. Online surveys are prevalent due to the ease with whichthey can be implemented and their low cost relative to other modesof collecting data (Jansen et al., 2007; Sue and Ritter, 2012). The datacollected from surveys is analyzed using statistical techniques to eitheridentify patterns, relationships, and trends in the data (Bryman andCramer, 2002) or harness the rich potential of qualitative data throughhttps://doi.org/10.1016/j.nlp.2023.100020Received 30 May 2023; Received in revised form 9 June 2023; Accepted 9 June 20Available online xxxx2949-7191/© 2023 The Author(s). Published by Elsevier B.V. This is an open acce(http://creativecommons.org/licenses/by/4.0/).
23
ss article under the CC BY license
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020
2
3
wtrFig. 1. Email Message from CloudResearch announcing the creation of virtual panels of survey participants (Moss, 2023).
different qualitative analyses (Braun and Clarke, 2013). Survey re-search is widely used in social science, marketing, information systems,human–computer interaction, and other fields where data on humanattitudes and behaviors is needed. Survey analysis and reporting areincreasingly leveraging machine learning (ML) for research purposes,as shown in Fig. 2, a dashboard from Survey2Persona (Salminen et al.,022a), an ML learning survey analysis and visualization system.
. Survey research and LLMs
Since survey research deals typically with words in the questions,ords in the responses, or both, it is natural that LLM would impacthe survey research domain. Several common tasks involved in surveyesearch could be completed through the use of these LLM models.
• For example, designing the survey instrument involves developingthe survey questions, response options, item construct (Salminenet al., 2020), and any other necessary components of the surveyinstrument — LLMs could help phrase the questions and pinpointany inconsistencies, and perhaps suggest the best response optionsto measure respondents’ opinions.
• Sampling means selecting a representative sample of individualsfrom the target population, which can vary depending on the2research question and resources available — LLMs could suggestappropriate samples and techniques for recruiting participants.As part of Sampling , LLMs can perform intelligent interviewingthrough conversational AI instead of the conventional surveywhere text is read and responded to by the respondents.
• Data cleaning and management is processing and organizing thecollected survey data to ensure its accuracy, completeness, andconsistency — LLMs could, perhaps, detect inconsistent anduniform selections, resulting in low-quality entries by analyz-ing close-ended responses and identifying gibberish and spellingmistakes in open-ended responses.
• Data analysis uses statistical and qualitative methods to analyzethe survey data and identify patterns, relationships, and trends inthe data — there are already social media posts circulating aboutpeople using ChatGPT’s Code Interpreter plugin to automate dataanalysis (Feng et al., 2023).
• Reporting and dissemination summarize the survey findings andpresent them in a format accessible to the target audience, such assummaries, visualizations, presentations, and even written reports— again, LLMs that can implement data science code could helpfacilitate this process.
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020Fig. 2. ML analysis of survey data from Survey2Person (Salminen et al., 2022a).
One can easily see LLM assisting in all tasks; at least, that is the generaldirection in which this technology is going. Overall, survey researchinvolves a range of language and analysis tasks that are near tailor-made for using LLMs. Using these models could potentially significantlyimprove the efficiency of executing these tasks. Some possible waysthat LLMs could process survey responses would be simulating humanresponses and predicting public opinion, augmenting surveys withgenerative AI to create new survey questions, filling in missing data,providing feedback to respondents, and reporting survey responses asinteraction data (i.e., using LLMs to capture and transmit the text ofthe survey questions and the responses). It remains to be seen if thesemodels can improve the effectiveness of the execution of these tasks,as these tasks require careful planning and execution to avoid bias andensure the accuracy and reliability of survey findings. Seemingly, theonly primary survey research task these LLMs cannot yet do is datacollection, which is administering the survey instrument to the selectedsample. However, as our CloudResearch April Fools’ Day spoof hintsat, creating AI-generated responses via AI-generated simulated humansmay not be far off, assuming it is not already occurring.There are also several challenges associated with survey researchthat AI models can address, which would result in increased effec-tiveness of survey research. For example, a common issue in surveyresearch is a lack of clarity and understanding that occurs when individ-uals do not fully understand the survey questions or response options,leading to inaccurate or incomplete responses. Data management andanalysis challenges related to data cleaning, organization, and analysiscan lead to errors or inaccuracies in the results. There are also ethicalconsiderations related to informed consent, confidentiality, and privacyof survey respondents (Spaeth, 1992). These challenges can impactthe validity and reliability of survey research findings, highlightingthe importance of careful planning, execution, and analysis of surveyresearch to minimize potential biases and ensure the accuracy of theresults. Again, one can envision LLMs assisting with most, if not all, ofthese challenges.
4. Motivation for using LLMs in survey research
The development of LLMs (Chen et al., 2022) has the potential torevolutionize the field of survey research and bring us closer to achiev-ing more accurate, explainable (Cambria et al., 2023), and reliable
3survey findings, and more efficient surveys. These models may also beable to improve NLP survey tasks and develop machines that can trulyunderstand human language and responses to survey collection.Gilardi et al. (2023) present evidence that ChatGPT is a suitablereplacement for human annotators for various NLP annotation tasks.Their results indicated that the model’s zero-shot accuracy exceeds thatof crowd-workers in four out of five tasks, and its intercoder agreementwas higher than that of both crowd-workers and trained annotators.Furthermore, the per-annotation cost was only $0.003, a savings oftwenty times compared to Amazon MTurk (the leading crowdsourcingplatform for surveys). These results highlight ChatGPT’s potential tosignificantly reduce the amount of labor and time spent on surveyresearch.A study by Törnberg (2023) examined the accuracy, reliability, andbias of ChatGPT when classifying Twitter users’ political affiliationbased on the content of a tweet. ChatGPT was compared to annotationprovided by expert classifiers and crowdsourced workers, traditionallyseen as the gold standard for similar tasks. Tweets from United Statespoliticians during the 2020 election were used as the ground truth tomeasure the accuracy of the LLM. The results indicated that ChatGPToutperformed human classifiers regarding accuracy and reliability andhad an equal or lower bias. Crucially, the LLM could correctly analyzemessages that require reasoning and interpretation based on contextualknowledge, abilities that are often seen as exclusive to humans. Thesefindings suggest that LLMs have substantial potential for use in thesocial sciences, enabling interpretive research on a much larger scale.Cegin et al. (2023) studied whether ChatGPT could potentiallysubstitute human workers in paraphrase generation for intent classifi-cation. For this, they quasi-replicated the data collection methodologyof an existing crowdsourcing study on a similar scale, prompting withthe same seed data and using ChatGPT instead of human labor. Theresults showed that ChatGPT-created paraphrases were more diverseand could thus lead to more robust machine-learning models.On the other hand, Bisbee et al. (2023) investigated the use ofChatGPT for measuring public opinion, showing that it is not a reliablesubstitute for human respondents. They found that ChatGPT-generatedresponses overly exaggerate the extremity and certainty of partisanand social division compared to actual opinions of those possessingthe same attributes. Measurements of partisan and racial affective
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020
trRpipopsc
eTqoaiHppriar
5
fcstdps
2sra—abmf
tttbalaubnsa
oyoiml2rfmwmlttsa
hst
6
sivct(polarization produced by prompted ‘‘persona’’ profiles in ChatGPTare seven times larger than the average human opinion, while thestandard deviation of synthetic data was only 31% of the variationfound among real human opinions. As these models are proprietary,the researchers could not identify the cause of the bias, but theirfindings raise questions about the viability of using closed-source LLMsas synthetic data.Hämäläinen et al. (2023) explored using LLMs for generating syn-hetic user research data. They used the GPT-3 model to generateesponses to open-ended questions on the topic of video games as art.esults showed that GPT-3 could generate plausible accounts of HCI ex-eriences. The researchers argue that LLM-generated data can be usefuln designing and assessing experiments because it is a cheap and rapidrocess. However, they also cautioned to double-check the correctnessf any resulting conclusions with real data. Their findings also presentotential concerns since LLMs could be used to use crowdsourcingervices. If this were to occur, the crowdsourcing of self-reported dataould become subject to unreliability.Kim and Lee (2023) analyzed how LLMs could augment surveys andnable missing data imputation, retrodiction, and zero-shot prediction.hey proposed a novel methodological framework integrating surveyuestions, individual beliefs, and temporal contexts to tailor LLMs forpinion prediction. Results suggested that the best models were highlyccurate for missing data imputation and retrodiction. They could, fornstance, help identify shifts in public support for same-sex marriage.owever, the models demonstrated limited performance for zero-shotrediction. The researchers also found that accuracy was lower foreople with lower socioeconomic status, non-partisan affiliations, andacial minorities yet was slightly higher for ideologically sorted opin-ons in contemporary periods. Thus, their results implied a need fordequate socio-demographic representation and ethical considerationselated to LLM deployment.
. Considerations of employing LLMs
Therefore, as with any new technology, there are potential bene-its and drawbacks to consider. First, LLMs may be able to generateompelling fake text and findings from survey data, which could haveignificant implications for issues like disinformation and misinforma-ion. LLMs’ ability to generate persuasive fake text or fake results fromata analysis, which could result from intentional or unintentionalrompts from survey researchers when leveraging these models toummarize (Xie et al., 2023) and analyze survey results, is a criticalissue. It has significant implications for research findings, includingdisinformation and policy implications (which often rely on surveyresearch), as malicious actors could use these models to spread falseinformation or impersonate real people. For example, these LLMs couldcreate highly convincing fake responses or survey analysis results thatcould be erroneous, spreading misinformation. Notably, injecting ar-tificial information into decision processes via public policy surveyresearch remains a top risk. The issue has political dimensions, asgovernment-funded troll factories already weaponize coordinated fakenews campaigns to offset the legitimacy of institutions (Bahrini et al.,2023).Second, there is a risk that LLMs could be used to create highlyrealistic fake text that could be used to harm individuals or groups,such as by spreading hate speech or inciting violence. Third, thereare privacy concerns about these models, as they may be trainedon sensitive or personal survey data that could be used to identifyindividuals. Fourth, there are serious data concerns that actual (human)survey respondents would not answer the survey questions themselvesbut instead rely on models like ChatGPT to provide question answersusing the survey items as prompts. In this scenario, the survey wouldnot actually be the survey participant’s responses, but the researcherwould have no reasonable way of determining this deception.As a result, it is important to carefully consider the ethical implica-tions of using LLMs and to ensure that they are used responsibly and c
4beneficially. The potential benefit (and threat) of LLMs like ChatGPTis their ability to generate highly realistic and human-like text. Thiscapability can be employed in surveys for crafting the survey itemsor summarizing survey results from the analysis of survey data, allof which LLMs can do. This could have significant implications forthe survey research field as machines can generate indistinguishablesurvey items from those written by humans. The threat is that this couldsignificantly affect the survey research field. Relatedly, there are alsoconcerns about the potential of these models to perpetuate biases inlanguage data (Chakravarthi et al., 2023). For example, if an LLM istrained on text biased against certain groups of people (Diaz et al.,018), it may reproduce those biases in its output when generatingurvey questions or responses. As a result, it is important for surveyesearchers to carefully consider the data used to train these modelsnd ensure that they are not reinforcing harmful stereotypes or biasespredominantly, one needs to remain critical about the LLM outputsnd not get complacent about them. Although, this understanding maye beyond the capabilities of those employing these models, as deter-ining the biases of the outputs in real time is not a straightforwardeat.One potential way to address the issue of bias in LLMs is throughhe use of diverse and representative training data by those traininghese models. Incorporating a wide range of perspectives and voices inhe training data may help minimize the risk of perpetuating harmfuliases. Additionally, AI researchers can use techniques like debiasinglgorithms and adversarial training to mitigate the effects of bias inanguage data. Another potential solution is to involve diverse expertsnd stakeholders in developing and evaluating LLMs, including individ-als from underrepresented communities and those directly impactedy these models. Finally, survey researchers can ensure that they areot reinforcing harmful stereotypes or biases by carefully reviewingurvey items through a diverse group of (human) survey researchersnd editing the LLM text, which might be the most fruitful approach.Overall, LLMs have the potential to significantly improve NLP tasksf survey-based research, such as machine translation, sentiment anal-sis of responses, topical classification of responses, summarization ofpen-ended question responses, and question composition of the surveytems themselves. By training on massive amounts of text data, theseodels can learn to recognize complex patterns and relationships inanguage that may not be immediately apparent to humans (Yang et al.,023). Additionally, the ability of these models to generate highlyealistic and human-like text could have significant implications forields like survey research, both positive and negative. For example,any survey researchers rely on participant recruitment companiesith panels of participants who sign up to conduct surveys for aonetary reward (Salminen et al., 2022b). These panelists could easilyeverage models like ChatGPT to respond to surveys. The result ishat the data from these surveys would not be the true responses ofhe participants themselves. In this scenario, survey participants couldubmit AI-generated responses, with survey researchers using AI tonalyze the responses.Regardless, it is a scenario that survey researchers will increasinglyave to face, and we expect this is already occurring in survey re-earch as of this manuscript’s preparation date. As such, using LLMso generate survey responses deserves additional consideration.
. Advantages of employing LLMs for survey responses
There are potential advantages to using LLMs like ChatGPT forurvey research to generate survey responses. The scalability of LLMss impressive, with these models able to generate responses to sur-ey questions quickly and at a large scale, which can be useful foronducting surveys with many participants or generating responseso open-ended survey questions. The models are also fairly consistentGilardi et al., 2023); unlike human respondents, LLMs can provideonsistent responses to survey questions, which can be particularly
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020
gtpadwSuat
7
tabgnuseful for standardizing responses and minimizing variation betweenresponses. Indeed, this is a cost-effective approach, as it eliminatesthe need for hiring and compensating human survey respondents. Thisincentivizes researchers to use LLMs, as cost is a constant issue in surveyresearch (Salminen et al., 2022b). Also, LLMs are quite flexible inenerating responses to survey questions in multiple languages, makinghem helpful in conducting surveys in multilingual contexts or witharticipants who speak different languages. These models might also beble to provide insights into language patterns and trends that may beifficult to identify through human survey responses, such as changes inord usage over time or the emergence of new language conventions.o, while there are potential challenges and limitations associated withsing LLMs for survey research, these NLP models also offer severaldvantages that may make them valuable in the survey researcher’soolkit.
. Potential issues of employing LLMs in survey responses
Of course, potential issues arise from using LLMs in survey researcho generate survey responses. There may be bias in the language modelss LLMs are trained on massive amounts of text data, which can amplifyiases present in the training data. This can result in biased languageeneration, social stereotyping, unfair discrimination, and exclusionaryorms, and it may skew survey research results (Weidinger et al.,2022). LLMs may also suffer from a lack of contextual understandingand common sense reasoning abilities. This shortcoming can generatenonsensical or inappropriate responses to survey questions (also knownas ‘hallucinations’). While LLMs have access to a vast generic vocabu-lary, they often have a limited vocabulary within a specific vertical.These models may still struggle with rare or domain-specific terms(Morozovskii and Ramanna, 2023) that may be common in surveyresearch. This can result in inaccurate or incomplete responses tosurvey questions. However, a perhaps even more dangerous situation isthe case of ‘‘compelling misinformation’’ (Spitale et al., 2023), referringto situations where the LLM produces highly convincing text thatis factually wrong. Spitale et al. (2023) tested whether people candetermine whether a tweet is organic and written by a Twitter useror synthetic and generated by GPT-3. The results showed that GPT-3 is capable of both creating accurate information that is clearer tounderstand as well as more convincing disinformation. Furthermore,people could not tell the difference between tweets generated by GPT-3 and those written by humans. So, unless the source of informationdivulges that it was wholly or partially generated using an LLM, peoplemight have no way of knowing.Apart from the above, the lack of transparency of LLMs is anothermajor concern, as the inner workings of LLMs are often opaque anddifficult to interpret. Transparency refers to the issue mentioned aboveof disclosing LLM participation and the intractability of LLM trainingand the text-generation process, sometimes called algorithmic opacity(Eslami et al., 2019). This lack of transparency makes it challengingto identify the sources of potential errors or biases in the generatedresponses, and this can make it challenging to validate the results ofsurvey research.There are also ethical considerations, as using LLMs in surveyresearch raises concerns about using AI-generated responses to replacehuman participants. For example, would using LLMs as survey re-spondents in psychology research be appropriate or even acceptablefor the research community? Would LLMs be able to mimic human-like cognition and emotions while responding to a survey involvingpsychology and behavior-related research?Overall, while LLMs have demonstrated impressive capabilities ingenerating human-like responses, several potential issues must be con-sidered when using them in the context of survey research. Theseissues highlight the need for careful consideration of the strengths andlimitations of these models, as well as the potential impact of their usein survey research and resulting implications.
58. Future of LLMs in survey research
One thing is apparent — LLMs will impact survey research, not justin response generation. These models have already impacted surveyresearch. The authors of this paper have employed LLMs in surveyresearch in multiple ways, such as converting survey questions tostatements (e.g., ‘‘Do you like ice cream?’’ to ‘‘I like ice cream’’.) andin algorithmically-generated personas from survey data, as shown inFig. 3.LLMs offer several potential future directions for survey research.First, the increased use of technology in survey research will likelycontinue to grow. This includes using online, mobile, and other digitaltechnologies to collect survey data. The use of artificial intelligenceand machine learning algorithms may also be used to help improve theaccuracy and efficiency of survey research. Second, survey researchersmay begin to explore non-traditional data sources, such as social mediadata, web analytics, and other digital data sources, to supplement orreplace traditional survey data, given that LLMs can rapidly make senseof this data. Third, survey researchers may begin to integrate datafrom multiple sources, such as survey data, administrative data, andother data sources, to gain a more comprehensive understanding of theresearch question using LLMs to aid in integrating these disparate datasources. Fourth, there may be a greater emphasis on data quality in sur-vey research, focusing on improving data collection methods, reducingnonresponse bias, and increasing response rates, perhaps using LLMs topartially address these issues. Finally, LLMs may lead to an increasedfocus on collaborative research in survey research, with researchersfrom different disciplines working together to address complex researchquestions.The implications of employing LLMs might be profound, leading to asignificant advance in survey research. For example, LLMs may improvethe design of survey questions and response options. For example, thesemodels may generate more neutral and objective questions or suggest awider range of response options less likely to influence how individualsrespond. NLP techniques inherent in these models may be used toanalyze and interpret survey responses, allowing researchers to gaindeeper insights into the data. These techniques may include sentimentanalysis, topic modeling, or other NLP techniques to identify patternsand trends in the data. LLMs may be used to personalize surveys toindividual respondents, tailoring the questions and response options totheir individual characteristics and preferences, and these models maybe used to provide real-time feedback to survey respondents, helpingto improve response rates and the accuracy of the data collected. Also,LLMs may be used to support multilingual surveys, allowing researchersto collect data from a wider range of individuals and populations.
9. Probing research question
We want to close this section with a probing question: Can syntheticdata be accurate?If the LLM can accurately represent people’s average opinions onfactors like sentiment, as some nascent work suggests (Gilardi et al.,2023), then would an LLM equally well represent the average opinionsof people when polling them about any societal matter? In a sense,the LLM is trained on public opinion, so there is a possibility that itcan reflect public opinion. Therefore, there is a possibility that thesynthetic, so-called ‘‘fake’’ response, in fact, is correct. This possibilityis often ignored by treatises that categorically reject using LLMs forpublic opinion studies due to the myriad of risks. While we are notarguing in favor of replacing Gallup polls with LLM polls, we do wantto point out that, as researchers, we must objectively examine this newtechnology by analyzing the full scope of its possibilities, even thosethat, based on first impression, appear impossible. In theory, LLMscan represent people’s opinions correctly without being a fluke (seeTable 1). For example, research in controlled experiments comparingLLM and human responses could be measured in various contexts and
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020Fig. 3. Algorithmically-generated personas from Survey2Person (Salminen et al., 2022a). (a) is the personas cast (i.e., listing); (b) is a single persona profile from the cast.
Table 1Theoretical possibilities of information accuracy by respondent source. All four optionsare theoretically possible. Quadrant 1 (Q1) is often refuted a priori, but we argue thatmore research for that quadrant is needed.Information given is accurate (i.e., it reflects theaverage opinion of the population correctly)Source of the information Yes No
LLM Q1 Q2Human Q3 Q4
situations, with the accuracy of both LLM and human respondentscompare to the opinions of the overall population.As shown in Table 1, Q1 (LLM accurately reflecting the ‘‘average’’human opinion of a given population) may not just be an April FoolsDay joke; using LLM for survey respondents might be achievable in thenear term. Q2 (LLM inaccurately reflects the ‘‘average’’ human opinion)and Q4 (Human respondents inaccurately reflect the ‘‘average’’ humanopinion), if the information is inaccurate, it is not the basis for solidresearch. Q3 (Human respondents accurately reflecting the ‘‘average’’human opinion) is, at least for now, considered the ‘gold standard’.As LLM accuracy is further investigated, however, this opinion maychange. An area of future research is the theoretical possibility ofinformation accuracy, namely, how precise does a LLM have to beconsidered accurate?Peer-reviewed evidence either for or against using LLMs is still tooscarce to draw definitive conclusions. Our concluding statement is thatLLMs will become part of the survey research process in one form oranother. How extensively, we do not yet know. For now, the researchcommunity must focus on creating ethical standards and guidelinesfor the acceptable use of LLMs in survey research. Efforts in this areaare much needed and underway (Lund et al., 2023; Pournaras, 2023;Rahimi and Abadi, 2023).
10. Conclusion
Although promising, the potential of closed-source LLMs like Chat-GPT to measure human opinion has yet to be determined. While LLMshave the potential to address some of the challenges associated withsurvey research, they may not be a comprehensive solution to all ofthese challenges. They can potentially help address challenges relatedto question-wording and response bias by generating more neutral and
6objective survey questions and providing a wider range of options thatare less likely to influence how individuals respond. Similarly, LLMscan potentially help address issues relating to a lack of clarity andunderstanding by providing more detailed explanations or examplesof survey questions or response options and answering any follow-up questions that respondents might have. However, LLMs might beunable to address sampling or nonresponse bias challenges, as theseissues relate more to the selection of survey respondents than to thesurvey questions themselves. Furthermore, ethical considerations relat-ing to informed consent, confidentiality, and the privacy of the surveyrespondents are important issues that need to be carefully consideredregardless of whether LLMs are used in survey research or not.Overall, while LLMs have the potential to address some of the chal-lenges associated with survey research, as of this writing, they shouldbe used in conjunction with other methods and approaches (Nielsenet al., 2021; Rainie and Jansen, 2009) to ensure the accuracy andvalidity of the survey results. LLMs have the potential to revolutionizethe field of NLP and bring us closer to developing machines that cantruly understand human language. However, there are potential bene-fits, drawbacks, and ethical considerations associated with these modelsthat must be carefully considered. By taking a thoughtful and nuancedapproach to the development and use of LLMs, we can ensure that theyare used responsibly and beneficially, maximizing their potential whileminimizing the associated risks.
AcronymsLLM: Large Language ModelsNLP: Natural Language ProcessingGPT: Generative Pre-trained TransformerAI: Artificial IntelligenceML: Machine Learning
Declaration of competing interest
No conflicts of interest.
References
Aldridge, A., 2001. Surveying the Social World: Principles and Practice in SurveyResearch. McGraw-Hill Education, UK.Bahrini, A., Khamoshifar, M., Abbasimehr, H., Riggs, R.J., Esmaeili, M., Majdabad-kohne, R.M., Pasehvar, M., 2023. ChatGPT: Applications, opportunities, and threats.ArXiv Preprint arXiv:2304.09103.
B.J. Jansen, S.-g. Jung and J. Salminen Natural Language Processing Journal 4 (2023) 100020
R
R
S
S
S
S
S
ST
W
X
YBisbee, J., Clinton, J., Dorff, C., Kenkel, B., Larson, J., 2023. Artificially preciseextremism: How internet-trained LLMs exaggerate our differences. SocArXiv https://doi.org/10.31235/osf.io/5ecfa.Braun, V., Clarke, V., 2013. Successful Qualitative Research: A Practical Guide forBeginners. SAGE Publications.Braun, V., Clarke, V., Boulton, E., Davey, L., McEvoy, C., 2021. The online survey asa qualitative research tool. Int. J. Soc. Res. Methodol. Theory Pract. 24, 641–654.http://dx.doi.org/10.1080/13645579.2020.1805550.Bryman, A., Cramer, D., 2002. Quantitative Data Analysis with SPSS Release 10 forWindows: A Guide for Social Scientists. Routledge, http://dx.doi.org/10.4324/9780203471548.Cambria, E., Malandri, L., Mercorio, F., Mezzanzanica, M., Nobani, N., 2023. A surveyon XAI and natural language explanations. Inf. Process. Manage. 60 (1), 103111.http://dx.doi.org/10.1016/j.ipm.2022.103111.Cegin, J., Simko, J., Brusilovsky, P., 2023. ChatGPT to replace crowdsourcing of para-phrases for intent classification: Higher diversity and comparable model robustness.ArXiv arXiv:2305.12947.Chakravarthi, B.R., Priyadharshini, R., Banerjee, S., Jagadeeshan, M.B., Kumare-san, P.K., Ponnusamy, R., Benhur, S., McCrae, J.P., 2023. Detecting abusivecomments at a fine-grained level in a low-resource language. Nat. Lang. Process.J. 3, 100006. http://dx.doi.org/10.1016/j.nlp.2023.100006.Chen, X., Xie, H., Tao, X., 2022. Vision, status, and research topics of natural languageprocessing. Nat. Lang. Process. J. 1, 100001. http://dx.doi.org/10.1016/j.nlp.2022.100001.Diaz, M., Johnson, I., Lazar, A., Piper, A., Gergle, M., 2018. Addressing age-relatedbias in sentiment analysis. In: Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems (CHI ’18). Paper 412, ACM, pp. 1–14. http://dx.doi.org/10.1145/3173574.3173986.Eslami, M., Vaccaro, K., Lee, M.K., On, A.Elazari.Bar, Gilbert, E., Karahalios, K., 2019.User attitudes towards algorithmic opacity and transparency in online reviewingplatforms. In: Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems. pp. 1–14.Feng, Y., Vanam, S., Cherukupally, M., Zheng, W., Qiu, M., Chen, H., 2023. Investigat-ing code generation performance of chat-GPT with crowdsourcing social data. In:Proceedings of the 47th IEEE Computer Software and Applications Conference. pp.1–10.Gilardi, F., Alizadeh, M., Kubli, M., 2023. ChatGPT outperforms crowd-workers fortext-annotation tasks. ArXiv arXiv:2303.15056.Hämäläinen, P., Tavast, M., Kunnari, A., 2023. Evaluating large language modelsin generating synthetic HCI research data: A case study. In: Proceedings of the2023 CHI Conference on Human Factors in Computing Systems. pp. 1–19. http://dx.doi.org/10.1145/3544548.3580688.Jansen, Karen J., Corley, K., Jansen, B.J., 2007. E-survey methodology. In: Handbookof Research on Electronic Surveys and Measurements. IGI Global, pp. 1–8.Kim, J., Lee, B., 2023. AI-augmented surveys: Leveraging large language models foropinion prediction in nationally representative surveys. ArXiv arXiv:2305.09620.7Lund, B., Wang, T., Mannuru, N.R., Nie, B., Shimray, S., Wang, Z., 2023. ChatGPTand a new academic reality: AI-written research papers and the ethics of the largelanguage models in scholarly publishing. ArXiv Preprint arXiv:2303.13367.Morozovskii, D., Ramanna, S., 2023. Rare words in text summarization. Nat. Lang.Process. J. 3, 100014. http://dx.doi.org/10.1016/j.nlp.2023.100014.Moss, A., 2023. CloudResearch revolutionizes online survey research with virtualrecruitment of AI participants. CloudResearch https://www.cloudresearch.com/resources/blog/virtual-participant-recruitment/.Nielsen, L., Salminen, J., Jung, S.-G., Jansen, B.J., 2021. Think-aloud surveys: A methodfor eliciting enhanced insights during user studies. Human-Computer Interaction–INTERACT 2021, In: 18th IFIP TC 13 International Conference, Bari, Italy, August30–September 3, 2021, Proceedings, Part V, vol. 18, pp. 504–508.Pournaras, E., 2023. Science in the era of ChatGPT, large language models andAI: Challenges for research ethics review and how to respond. ArXiv PreprintarXiv:2305.15299.ahimi, F., Abadi, A.T.B., 2023. ChatGPT and publication ethics. Arch. Med. Res. 54(3), 272–274.ainie, L., Jansen, B.J., 2009. Surveys as a complementary method for web log analysis.In: HandBook of Research on Web Log Analysis. IGI Global, pp. 39–64.alminen, J., Jansen, J., Jung, S.-G., 2022a. Survey2Persona: Rendering SurveyResponses as Personas. pp. 67–73. http://dx.doi.org/10.1145/3511047.3536403.alminen, J., Kamel, A.M.S., Jung, S.-G., Mustak, M., Jansen, B.J., 2022b. Faircompensation of crowdsourcing work: The problem of flat rates. Behav. Inf.Technol. 1–22.alminen, J., Santos, J.M., Kwak, H., An, J., Jung, S., Jansen, B.J., 2020. Personaperception scale: Development and exploratory validation of an instrument forevaluating individuals’ perceptions of personas. Int. J. Hum.-Comput. Stud. 141,102437. http://dx.doi.org/10.1016/j.ijhcs.2020.102437.paeth, J.L., 1992. Perils and Pitfalls of Survey Research (Allerton Park Institute (33rd: 1991)). Graduate School of Library and Information Science, University of Illinoisat Urbana-Champaign, http://hdl.handle.net/2142/634.pitale, G., Biller-Andorno, N., Germani, F., 2023. AI model GPT-3 (dis)informs usbetter than humans. ArXiv arXiv:2301.11924.ue, V.M., Ritter, L.A., 2012. Conducting Online Surveys. SAGE.örnberg, P., 2023. ChatGPT-4 outperforms experts and crowd workers in annotatingpolitical Twitter messages with zero-shot learning. ArXiv arXiv:2304.06588.eidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A.,Cheng, M., Balle, B., Kasirzadeh, A., 2022. Taxonomy of risks posed by languagemodels. In: 2022 ACM Conference on Fairness, Accountability, and Transparency.pp. 214–229.ie, B., Song, J., Shao, L., Wu, S., Wei, X., Yang, B., Lin, H., Xie, J., Su, J., 2023. Fromstatistical methods to deep learning, automatic keyphrase prediction: A survey. Inf.Process. Manage. 60 (4), 103382. http://dx.doi.org/10.1016/j.ipm.2023.103382.ang, Z., Liu, Y., Ouyang, C., Ren, L., Wen, W., 2023. Counterfactual can be strongin medical question and answering. Inf. Process. Manage. 60 (4), 103408. http://dx.doi.org/10.1016/j.ipm.2023.103408.