[1]\fnmLingyao \surLi \equalcontThese two authors contributed equally to this work. [2]\fnmLy \surDinh \equalcontThese two authors contributed equally to this work.
[1]\orgdivSchool of Information, \orgnameUniversity of Michigan, \orgaddress\cityAnn Arbor, \stateMI, \countryUnited States
2]\orgdivSchool of Information, \orgnameUniversity of South Florida, \orgaddress\cityTampa, \stateFL, \countryUnited States
3]\orgdivSenseable City Laboratory, \orgnameMassachusetts Institute of Technology, \orgaddress\cityCambridge, \stateMA, \countryUnited States
Academic collaboration on large language model studies increases overall but varies across disciplines
Abstract
Interdisciplinary collaboration is crucial for addressing complex scientific challenges. Recent advancements in large language models (LLMs) have shown significant potential in benefiting researchers across various fields. To explore the application of LLMs in scientific disciplines and their implications for interdisciplinary collaboration, we collect and analyze 50,391 papers from OpenAlex, an open-source platform for scholarly metadata. We first employ Shannon entropy to assess the diversity of collaboration in terms of authors’ institutions and departments. Our results reveal that most fields have exhibited varying degrees of increased entropy following the release of ChatGPT, with Computer Science displaying a consistent increase. Other fields such as Social Science, Decision Science, Psychology, Engineering, Health Professions, and Business, Management & Accounting have shown minor to significant increases in entropy in 2024 compared to 2023. Statistical testing further indicates that the entropy in Computer Science, Decision Science, and Engineering is significantly lower than that in health-related fields like Medicine and Biochemistry, Genetics & Molecular Biology. In addition, our network analysis based on authors’ affiliation information highlights the prominence of Computer Science, Medicine, and other Computer Science-related departments in LLM research. Regarding authors’ institutions, our analysis reveals that entities such as Stanford University, Harvard University, University College London, and Google are key players, either dominating centrality measures or playing crucial roles in connecting research networks. Overall, this study provides valuable insights into the current landscape and evolving dynamics of collaboration networks in LLM research. Our findings also suggest potential areas for fostering more diverse collaborations and highlight the need for continued research on the impact of LLMs on scientific research practices and outcomes.
keywords:
Large language model, interdisciplinary collaboration, diversity analysis, Shannon entropy, network analysis1 Introduction
Interdisciplinary collaboration is increasingly recognized as critical for addressing complex scientific challenges, which often involves investigators from multiple fields of expertise [1, 2, 3]. Recent advances in generative AI models and applications, particularly the development of large language models (LLM) such as OpenAI’s ChatGPT, Google’s PaLM, and Meta’s Llama, are reshaping research activities across disciplines. Researchers have demonstrated LLMs’ utility in research tasks such as literature search [4, 5], content analysis [6], and findings generation from provided data [7].
The impact of LLMs extends far beyond their computer science origins or general application in research activities. In health-related fields, LLMs are being employed to interpret protein structures [8], process electronic health records [9], and even aid in drug discovery [10]. Engineering benefits from these models through advancements in autonomous driving [11] and remote sensing [12] technologies. Social scientists are leveraging LLMs for large-scale text analysis, including attitude simulation [13] and content moderation on social media platforms [14]. Their impact also extends to finance, where they’re being used to streamline document review and perform financial analysis [15]. This wide-ranging applicability across disciplines drives us to believe that LLMs are transforming multiple aspects of the research processes, potentially impacting how researchers collaborate.
Traditionally, interdisciplinary collaboration has been essential to combine expertise from diverse disciplines to find novel solutions to scientific problems. However, the advent of LLMs is reshaping this paradigm. Researchers have leveraged tools like ChatGPT to gain cross-disciplinary insights and interpret their work from various perspectives, potentially reducing the need for direct collaboration with experts from other fields [16]. LLMs have also been used to automate certain aspects of scholarly writing process, such as synthesizing existing literature from multiple disciplines on a specific topic [17], and reframing the main findings from different disciplines into domain-specific language [18].
While LLMs may have lowered the barriers to conducting interdisciplinary research, especially for those unfamiliar with language processing techniques, they offer significant potential to enhance research practices. By automating routine tasks [5, 19], such as crunching numbers and formatting manuscripts according to journal guidelines, LLMs can free up time for researchers to focus on developing innovative solutions and gaining deeper insights [20, 17]. Well-known LLMs such as ChatGPT have also illustrated capabilities to bridge language gaps between disciplines such as environmental science and ecology [17], statistics and biology [18]. However, it’s crucial to recognize that LLMs’ effectiveness is primarily illustrated in language-related tasks. They are not substitutes for the critical examination of findings or the nuanced and domain-specific understanding essential for truly bridging disciplinary language [21, 22].
Given that the use of LLMs is reshaping the way scholars work and collaborate, the questions then arise: how have LLMs transformed interdisciplinary collaboration? In what ways have they changed how different disciplines interact and collaborate? Therefore, our study explores the application of LLMs in scientific disciplines and their implications for interdisciplinary collaboration by addressing two key research questions, as listed below. The first question aims to examine the diversity of co-authors in terms of their institutional and departmental affiliations, using the Shannon entropy measure. The second question aims to understand the structural patterns of co-authorship collaborations, using network analysis to identify key researchers, institutions, and disciplines that are active producers of LLMs research, as well as those facilitating collaborations across disciplinary boundaries.
-
•
RQ1. How diverse are the coauthors of papers utilizing LLMs in terms of their research institutions and departments across different fields?
-
•
RQ2. What are the structural patterns of co-authorship networks in research utilizing LLMs, and what roles do key entities (leading researchers, institutions, and departments) play in facilitating and enhancing collaboration?
To answer these questions, we collect 50,391 papers from OpenAlex related to LLMs from 27 scientific fields published between the release of the BERT model in October 2018 and June 2024. We use Shannon entropy to measure the authorship diversity and social network techniques to reveal the collaboration network structures. Our findings indicate that since the advent of LLMs —particularly with the release of ChatGPT in November 2022—collaboration diversity has increased across many fields, notably Computer Science, Social Science, and Psychology. Medicine is the only discipline where diversity shows a significant decrease. Our network analysis further reveals that Computer Science and Medicine consistently play influential roles in connecting different research communities regarding LLM applications. Overall, our findings demonstrate that LLM research has grown exponentially since 2022 and holds the potential to enhance interdisciplinary collaboration. It also encourages the involvement of prominent academic institutions in leading LLM-based research and applications across different domains.
2 Data and Methods
Aligned with the two research questions, our data collection and methods involve two units of analysis: papers and authors. The first research question examines papers to analyze the diversity of coauthors across research fields and institutions. The second research question centers on authors to explore the structural patterns of co-authorship networks and identify key researchers and their roles in facilitating collaboration. Combining these two units of analysis allows us to capture (1) which disciplines and institutions are most impacted by the advent of LLMs in terms of collaboration diversity, as well as (2) key researchers in LLM research and their respective departmental and institutional affiliations. These analyses help us understand how LLMs are impacting scientific collaboration by revealing patterns of interdisciplinary and cross-institutional partnerships, as well as identifying the influential actors driving LLM research across disciplinary boundaries.
2.1 Data preparation
We select OpenAlex [23] for collecting related studies for two main reasons. First, OpenAlex is an open-source repository of scholarly metadata, allowing us to gather both recently archived preprints and published articles from journals and conferences. This is particularly useful for our analysis, given the prevalence of preprinted studies in the field of LLMs. Second, OpenAlex provides its data freely and openly, ensuring that our analysis can be easily replicated by the community without any licensing restrictions.
The workflow for data cleaning is presented in subsection A.1. Within OpenAlex, we collect relevant papers on the most popular LLMs and their respective models, as detailed in Table 2 in subsection A.2. We use two general terms, “large language model” and “LLM,” along with popular open-source models (e.g., BERT, Flan-T5, LLaMA) and closed-source models (e.g., ChatGPT, Claude) based on the MMLU benchmark [24]. We observe that some models, like Yi or Phi, do not yield relevant papers, potentially introducing significant noise during paper screening. Additionally, models like grok-1 or Galactic did not return any search results. Our data collection, conducted as of June 22, 2024, results in a total of 97,242 papers.
To ensure the collected papers are relevant to the topic of LLMs, we restrict our search to titles and abstracts. However, some papers containing these keywords might still be irrelevant. Therefore, we implement several steps to filter out irrelevant papers. First, we consider only articles and preprints, excluding types such as editorials and opinions. Second, we remove potential duplicates, including those with duplicated titles and papers initially published as preprints and later as journal articles with similar titles. We use Jaccard similarity to identify and remove these duplicates (see subsection A.3). Third, we employ GPT-4 models to evaluate the relevance of a paper to LLM topics based on its title, abstract, and keywords, and filter out those papers that are not relevant to LLM topics (see subsection A.4). Applying these filtering criteria reduces the original collection to 50,391 papers for the subsequent analysis.
2.2 Measure of collaboration diversity
Our first research question seeks to assess the diversity of collaboration. OpenAlex offers a variety of information about author affiliations, including details about departments, institutions, and countries. This allows us to represent the authors’ affiliation information for a paper using a set as follows,
(1) |
where denotes the th author of a paper, denotes the set of authors’ affiliation information given a paper, represents their department information, represents their institution information, and represents their country information (see Appendix D for the analysis of countries). It is important to note that these three types of information can vary significantly within a single paper. For instance, all collaborating authors could belong to the same institution and country but be affiliated with different departments. Our subsequent analysis particularly focuses on the collaboration between institutions and departments; therefore, we consider the first two sets of variables in Equation 1.
Next, we use Shannon entropy to measure the collaboration diversity given authors’ affiliation information in a paper. Shannon entropy quantifies the uncertainty or randomness in a set of possible outcomes. In the context of information theory, it represents the average amount of information produced by a stochastic set of data sources. Mathematically, for a discrete random variable with possible values and probability distribution , the Shannon entropy is calculated as:
(2) |
where belongs to one of the aspects (e.g., , ) in the set of authors’ affiliation, is the probability of the outcome . We use this metric to measure the diversity given authors’ affiliation information, such as their affiliated departments. For example, if a paper has five authors with affiliated departments represented as , then the probabilities of , , and are calculated as 0.4, 0.4, and 0.2, respectively. Using Equation 2, the Shannon entropy is: . In general, higher entropy indicates greater diversity in collaboration based on authors’ affiliation information, while lower entropy suggests that authors’ affiliations are more uniform. An entropy of implies that all authors of a paper are from the same institution.
2.3 Measures of network structure
We construct co-authorship networks based on the bipartite network projection, which involves converting a paper-author network to a co-authorship network whereby two authors are connected if they have co-authored at least one paper together. Each connection is weighted based on the total number of papers that each pair of researchers co-authored together. This method, as described by [25, 2], allows us to identify key researchers that have notable collaborative influence in the field, as well as any differences in collaboration patterns depending on the authors’ disciplines, institutions, and countries. The formula for our weighted projection approach is shown below:
(3) |
where denotes whether author contributes to paper p (with 1 indicating authorship and 0 indicating no authorship), and similarly indicates whether author contributes to paper p. is 1 if and are authors of paper p. This method assigns a full weight of 1 to each co-authorship instance and sums these weights across all papers where and are co-authors.
With the resulting co-authorship networks, we analyze the structural properties in terms of (1) overall cohesion, (2) topology, (3) community structure, and (4) centrality measures to identify influential researchers. We compute these measures using Python’s , , visualize the networks with R’s , and modify a subset of measures based on our operationalization. Cohesion measures include the density, clustering coefficient, average path length, and size of the largest component, which contain details on the overall connectedness of the network, as well as how efficient the network is in facilitating collaborations between researchers from different disciplines, institutions, and countries.
We also determine whether a co-authorship network follows a power-law degree distribution, indicating a hubs-and-spokes structure where a few hubs accumulate most of the connections. In the co-authorship context, this means that certain key researchers act as central hubs, coordinating the majority of collaborations across the network. To do this, we compute the goodness-of-fit value, which indicates if a simulated network with the same number of nodes and edges as the co-authorship network exhibits a similar power-law distribution. Generally, a value between 2 and 3 suggests that a power-law distribution is a good fit [26].
3 Results
In subsection 3.1, we analyze the diversity of collaboration based on the institutions and departments affiliated with authors across various fields. In subsection 3.2, we examine the network to understand how different institutions or departments collaborate with each other. It should be noted that the field categorization of a paper is provided by OpenAlex (see Appendix C for more details). OpenAlex has established a BERT model that generates a score distribution for identified topics using a paper’s title, abstract, and citations [23]. Their model provides up to three topics with the highest scores, which are then mapped to fields according to the ASJC structure [23].
In addition, we sort the collected 50,391 papers based on the paper count in order, as presented in Figure 1(a). For the subsequent analysis, we focus on the top 12 fields with the most publications in the topics of LLMs: (1) Computer Science, (2) Medicine, (3) Social Science, (4) Decision Science, (5) Biochemistry, Genetics & Molecular Biology, (6) Psychology, (7) Engineering, (8) Health Professions, (9) Business, Management & Accounting, (10) Neuroscience, (11) Arts & Humanities, and (12) Materials Science. We then filter for papers with complete authors’ affiliation information, resulting in 25,933 papers with complete institution information and 16,645 papers with complete department information. Overall, the entropy shows an increase based on authors’ affiliated institution () and department () information (see Figure 1(b)).
3.1 Analysis of collaboration diversity
Figure 2(a) and Figure 2(b) show the collaboration diversity measured by Shannon Entropy for authors’ institutions and departments, respectively. It is interesting to notice that nearly all fields exhibit a sharp increasing trend after the release of ChatGPT. Before that date, only Computer Science shows a comparatively larger number of LLM-related studies. This is possible because BERT models, such as DeBERTa and RoBERTa, were widely studied before the release of ChatGPT. It is also noteworthy that other fields, such as Biochemistry, Genetics & Molecular Biology, Psychology, Neuroscience, Social Science, Engineering, and Arts & Humanities, demonstrated significant applications of LLMs even before the release of ChatGPT. After the release of ChatGPT, one of the most interesting observations regarding the trend is that Medicine shows a sharp and significant increase, with the peak close to that of Computer Science. Similarly, Business, Management & Accounting and health Professions have also witnessed a sharp and increasing trend after the release of ChatGPT.
Regarding the entropy analysis based on institutions and departments, the Shannon Entropy trends in Figure 2(a) and Figure 2(b) closely align for each respective field. Compared to 2023, most fields show a minor increase in the averaged entropy in 2024. Exceptions include Biochemistry, Genetics & Molecular Biology, Engineering, Neuroscience, Business, Management, & Accounting, and Materials Science. Moreover, the variance of entropy is much wider before ChatGPT’s release, possibly due to fewer papers on the topic of LLMs and more diverse collaborations. One possible explanation for this pattern is that researchers from other fields might have sought partnerships with Computer Science experts for such research topics.
In addition, the observed changes in entropy suggest a potential shift in research collaboration dynamics following the introduction of ChatGPT across several fields. First, papers in the field of Computer Science display a stable increasing trend. A plausible explanation for this is that many Computer Science researchers are seeking collaborations with domain experts, focusing on areas such as AI for science applications or AI for social good. However, Medicine exhibits a notable decrease in entropy. This is the only field showing a statistically significant decrease in entropy after ChatGPT’s release, while other fields display either minor increases or negligible changes. These findings highlight the varying impacts of ChatGPT’s introduction on interdisciplinary collaborations across different academic fields.
Then, we conduct a Wilcoxon rank-sum test to compare the entropy before and after the release of ChatGPT, with detailed results presented in Appendix B (see Figure 5). Our analysis reveals statistically significant changes in entropy across several fields. Computer Science, Biochemistry, Genetics & Molecular Biology, and Psychology show a statistically significant increase in entropy, while Medicine displays a statistically significant decrease. The other fields do not show any significant increase or decrease in the entropy. In particular, we observe that the mean of the entropy distribution for most fields remains at 0, with the first quartile consistently at 0. This suggests that many researchers primarily collaborate with colleagues from their own institutions or departments on LLM-related topics.
We conduct additional Wilcoxon rank-sum tests to compare entropy across different fields, with results presented as heatmaps in Figure 6. Several consistent patterns emerge when examining both institution and department results before and after ChatGPT’s release. Certain fields like Computer Science, Medicine, and Neuroscience exhibit more significant differences compared to other fields, as evidenced by darker blue cells in the heatmaps. Following ChatGPT’s release, there is also an increase in significant differences across fields, illustrated by a higher prevalence of darker cells in the right heatmap in Figure 6, compared to their counterparts in the left heatmap of the same figures. Another interesting observation is that Computer Science consistently demonstrates significantly lower entropy than Medicine, Biochemistry, Genetics & Molecular Biology, Health Professions, and Neuroscience.
3.2 Analysis of collaboration network
We analyze the co-authorship networks in terms of structural cohesion, topology, community structures, and influential researchers with respect to various dimensions of centrality. The network metrics are presented in Table 1, and the network visuals are presented in Figure 3.
Based on Table 1, the institution-based and department-based networks both have low density (0.0001) but high clustering coefficients (0.61 and 0.75, respectively), indicating that while direct collaborations are limited, existing ones form tight-knit clusters. The largest components comprise a notable portion of all network edges (77% for institutions network, and 70% for departmental affiliation network), meaning that most researchers are part of research communities that are reachable from each other.
The relatively high average shortest path length (16.74 for institution-based, and 16.61 for department-based networks) suggests extensive reach within the networks, indicating that while direct connections between researchers might be limited, there are alternative paths that connect them. The primary difference is that the department-based network exhibits a stronger fit towards a power-law distribution (=2.32), suggesting that central departments, namely Computer Science, and Medicine, are major hubs that facilitate numerous collaborations. On the other hand, the institution-based network (=4.06) does not exhibit a clear hub-and-spokes structure, indicating a more evenly distributed pattern of collaborations across various institutions.
Degree centrality results highlight the prominence of Computer Science, Medicine, and other Computer Science-related departments in LLM research. Furthermore, the high betweenness centrality of these same departments highlights their roles in facilitating collaboration between departments that would have otherwise been unconnected. As illustrated in Figure 3, the clusters of departments between co-authors not only demonstrate the dominance of Computer Science and Medicine and their related disciplines (the orange cluster and the green cluster for Computer Science and Medicine-related disciplines, respectively) in the network, but also shows that these two disciplines connect fields with little to no overlap. For instance, Medicine is in the shortest path between Engineering and Social Science in 196 instances. Similarly, Computer Science is in the shortest path between Medicine and Psychology and Neuroscience for 384 instances.
With respect to key institutions (the “Degree Centrality” and “Betweenness Centrality” rows in Table 1), Stanford University is consistently the highest in terms of degree and betweenness centrality, indicating their active involvement as both collaborators and facilitators of collaboration between institutions in LLM research. While Harvard University and University College London are not as central in terms of direct collaborations with other universities, their high betweenness indicates they actively connect institutions with each other, as well as with industrial tech companies: for instance, University College London connects Google (United States) and University of Zurich (162 instances); and Harvard University connects Allen Institute and University of Washington (210 instances).
Closeness centrality results show how researchers from departments and institutions are, on average, reachable to other researchers. Top 5 departments with the highest closeness centrality are in language-related disciplines, such as Computational Language Modeling, German Language, and Literature & Education, which shows the extensive application of LLMs in domains where language analysis or content generation is needed. Top 5 institutions in terms of closeness centrality are primarily universities from countries outside the U.S. (see Table 1). This finding shows that LLM-related research is a global effort, with international institutions acting as connectors across geographical boundaries.
Eigenvector centrality results show more consistency with degree centrality and betweenness centrality results than with closeness centrality, highlighting researchers who are not only well-connected but also connected to other influential researchers. Researchers with the highest eigenvector centrality scores are also from Medicine, Computer Science, and their related fields. Biomedical Informatics has the highest eigenvector centrality, as it is directly connected to Medicine and Computer Science, the two most influential departments in our dataset. The eigenvector centrality analysis for institutions underscores the critical role of medical research institutes, such as the University of Colorado Anschutz Medical Campus, European Bioinformatics Institute, and Lawrence Berkeley National Laboratory, in facilitating collaborations that significantly impact the LLM research community.
Metric | Institutions | Departments |
No. Nodes | 8391 | 6604 |
No. Edges | 35080 | 16811 |
Density | 0.001 | 0.0008 |
Avg. Shortest Path | 16.74 | 16.61 |
Clustering Coeff. | 0.61 | 0.75 |
No. Components | 1569 | 1678 |
Largest Component | 6485 | 4616 |
Power-law Exponent | 4.06 | 2.32 |
No. Comm.(Louvain) | 1613 | 1730 |
No. Comm. (CNM) | 1684 | 1775 |
Degree Centrality | Stanford University (343) | Computer Science (835) |
(top 5) | University of Washington (287) | Medicine (518) |
Peking University (262) | Computer Science & Eng. (422) | |
Tsinghua University (256) | Artificial Intelligence (289) | |
University of Waterloo (227) | Engineering (226) | |
Betweenness | University College London (0.028) | Computer Science (0.089) |
Centrality | Harvard University (0.026) | Computer Science & Eng. (0.038) |
(top 5) | Stanford University (0.021) | Medicine (0.035), |
University of Waterloo (0.021) | Psychology (0.034) | |
University of Washington (0.021) | Engineering (0.034) | |
Closeness Centrality | University of Craiova (0.5) | Computational Lang. Modeling (0.50) |
(top 5) | Constantin Brâncuşi Univ. (0.5) | German Language (0.50) |
Nutrition Center of Philippines (0.5) | Information & Architecture (0.50) | |
West Visayas State University (0.5) | Literature & Education (0.50) | |
Yalova University (0.5) | Vet. Public Health & Epidem. (0.50) | |
Eigenvector | Univ. of Colorado Anschutz (0.61) | Biomedical Informatics (0.45) |
Centrality | Lawrence Berkeley Natl. Lab. (0.54) | Computer Science (0.40) |
(top 5) | European Bioinfo. Institute (0.37) | Medicine (0.38) |
Critical Path Institute (0.31) | Envir. Genomics & Sys. Biology (0.30) | |
University of Illinois System (0.11) | Bioinformatics (0.26) |
4 Discussion
4.1 Key findings and implications
Collaboration diversity results, as well as co-authorship structures, reveal the notable and complex effects that LLMs have on fostering interdisciplinary and cross-institutional collaborations. The integration of LLMs into research areas has led to a potential shift in collaboration patterns. Below, we summarize several key findings and implications.
First, there has been an overall significant increase in the number of publications on LLM-related topics since the advent of ChatGPT in 2022. This surge extends beyond Computer Science and artificial intelligence to disciplines such as Medicine, Social Science, and Engineering. Additionally, the overall entropy, which measures the diversity of collaboration, has increased. This observation implies that collaboration diversity has broadened across these fields.
The impact on specific fields varies. Computer Science sees a stable increase in entropy, suggesting that computer scientists continue to seek collaboration with researchers from other disciplines. This trend aligns with our observation that researchers are increasingly exploring topics in AI for social good or AI for science [27]. Health-related fields, including Medicine, Neuroscience, Health Professions, and Biochemistry, Genetics, & Molecular Biology, exhibit higher entropy compared to Computer Science, Social Sciences, and Engineering. However, Medicine is the only area showing a significant decrease in entropy. One possible explanation is that the advancement of LLMs, particularly with their simple deployment interfaces, could enable researchers from disciplines like Medicine to more effectively use these AI models for domain-specific issues compared to the past. This is already being seen in applications such as drug discovery and repurposing [28].
Therefore, while LLMs might be associated with broader collaborations across disciplines and institutions, they may also encourage specialized collaborations within certain fields, such as Medicine, where domain-specific knowledge is often needed. In particular, medical-related research may have specific data privacy requirements, such as HIPAA regulations for electronic health records [29], as well as domain-specific expertise, such as in pathology [30] and clinical diagnosis [31], to evaluate whether LLM-predicted outputs are correlated with improved patient outcomes. Unlike previous AI models, researchers in Medicine can now use LLMs to address domain-specific issues more effectively without always seeking external collaborations.
These findings suggest that interdisciplinary collaboration may foster in contexts where LLMs may be used for analyzing and generating data that could be applicable to multiple disciplines, as opposed to health-related fields like Medicine where the data is typically specific to that field. Researchers in these disciplines can now use LLMs to address domain-specific issues more effectively without always seeking external collaborations. Additionally, creating clear guidelines and support for managing data privacy, ethical considerations, and domain-specific adaptations of LLM technologies can be crucial for leveraging their full potential in sensitive fields like Medicine.
Structural analysis of co-authorship connections shows that despite differences in collaboration diversity after the advent of ChatGPT, Computer Science and Medicine remain the most represented disciplines in the network. In particular, researchers from these two disciplines (via departmental affiliation) have the highest degree and betweenness centrality scores, indicating their roles as both active researchers and facilitators of cross-field collaborations. This finding is especially important with regard to the field of Computer Science because, despite their own papers having less disciplines represented (less entropy compared to other disciplines), they are structurally necessary to connect other disciplines together. For instance, Medicine bridges Computer Science and Engineering and Information Technology (334 times), Engineering and Social Science (196 times), and Neurosurgery and Intelligent Technology and Engineering (144 times). Relatedly, the influence of Medicine is also reflected in the cross-institutional analysis, where medical institutions and associated national laboratories (e.g. University of Colorado Anschutz Medical Campus, European Bioinformatics Institute, Jackson Laboratory) are the most influential in terms of eigenvector centrality, highlighting their central roles in connecting with other influential institutions and facilitating extensive cross-institutional collaborations.
Overall, the study of author collaboration and network analysis provides valuable insights into the dynamics and patterns of interdisciplinary research in LLMs, highlighting how knowledge and expertise are exchanged across various fields. Our analysis reveals that the diversity of academic collaborations increases overall, but varies across disciplines. Fields with methodological expertise and domain generality like Computer Science exhibit an increasing collaboration diversity as they might have sought for applications with other different fields, such as Psychology and Social Sciences. In contrast, disciplines like Medicine show a decreased diversity in collaboration after the release of ChatGPT due to the necessity of domain-specific knowledge in addition to methodological expertise and the more effective application of these AI models compared to the past. Extensive cross-institutional and cross-country collaborations are also facilitated by two dominant fields, Computer Science and Medicine. Leading research institutions and associated national labs in the U.S., U.K., and China are actively conducting research using LLMs and fostering interdisciplinary partnerships while sharing expertise and resources.
4.2 Opportunities for future work
This study presents several avenues for future research, each addressing current limitations and opportunities for expansion. One primary constraint lies in the data quality of OpenAlex, which has been shown to have missing information issues, particularly in author affiliation and abstract details [32]. These gaps could potentially lead to a loss of valuable insights from papers or preprints not fully captured by OpenAlex. Future research could benefit from conducting sensitivity analyses to assess the impact of these missing values on the patterns observed in this study, thereby providing a more robust understanding of the data’s reliability and comprehensiveness.
Another area for improvement concerns OpenAlex’s categornization of disciplines. While they utilize a BERT-based model to identify topics and subsequently map them to fields based on Scopus’s ASJC structure (see Appendix C), there is a lack of clear statement regarding the training and testing data size, as well as the reported performance metrics for discipline classification. A valuable direction for future work would be to evaluate the accuracy and reliability of this discipline classification system, potentially developing more refined or domain-specific categorization methods for LLM-related research.
Next, future research could focus on continuously collecting papers and expanding the scope to encompass the broader concept of generative AI, such as multimodal LLMs (e.g., GPT-4o). Given that the landscape of LLMs has undergone dramatic changes, with new models and breakthroughs emerging at an unprecedented pace, this work could allow for a more comprehensive longitudinal analysis of trends, innovations, and shifts in academia. Additionally, we could explore the interplay between academic research and industry developments and track how breakthroughs in LLMs influence and are influenced by practical applications. This could provide valuable insights into the trajectory of LLM and generative AI development and help identify emerging subfields, interdisciplinary connections, and potential areas for future breakthroughs.
Last, while this study conducts a before-and-after comparison, the observed differences should not be interpreted as causal, given the inherent time-varying trends in collaboration along with other confounding effects are not well controlled. Preliminary analyses using the Bayesian Structural Time Series (BSTS) model are conducted in this study (see Figure 7); however, a more robust quasi-experimental design incorporating control and treatment groups could merit further exploration to infer the causal effects of LLMs on collaboration patterns.
Declarations
-
•
Conflict of interest/Competing interests.
The author declares no competing interests. -
•
Data availability.
The data files can be accessed at: https://doi.org/10.5281/zenodo.13118978 -
•
Code availability.
The code files can be accessed at: https://github.com/Lingyao1219/llm-science
Appendix A Data Preparation
A.1 Data preparation workflow
Figure 4 outlines the process for collecting and cleaning the dataset of LLM-relevant papers for our analysis. The process begins with a broad search using general terms related to LLMs and popular models based on the MMLU benchmark (see subsection A.2) [24], spanning from October 2018 to June 2024. We apply this search to the title and abstract to avoid excessive noise in the dataset. This initial search yields a collection of 97,242 papers, which then undergo a series of filtering steps to enhance relevance and remove duplicates.
The cleaning process involves several steps, each progressively narrowing down the dataset. First, the collection is limited to preprints and articles, reducing it to 87,489 papers. The next step involves removing duplicates based on titles. Additionally, since a preprint might change its title upon official publication, we examine papers with slight variations between preprints and formal publications using Jaccard similarity (see subsection A.3). After reviewing the papers with Jaccard similarity, we find that a paper with a similarity score greater than 0.6 is very likely to be duplicated, and we remove those potential duplicates. However, it is still possible for a paper to contain LLM-related keywords but not relevant to LLM studies, such as the keyword “PaLM” which often appears in papers discussing “palm trees” or “palm oil.” To address this issue, we employ the GPT-4o model to support the evaluation of the relevance of a paper (see subsection A.4). This filtering process results in a final set of 50,391 papers, representing a focused and highly relevant collection for our analysis. The distribution of these 50,391 papers by field is presented in Figure 1.
A.2 Paper collection
Table 2 below shows the specific search terms used to collect papers from OpenAlex. The search terms include two general terms, “large language model” and “LLM,” along with popular open-source models (e.g., BERT, Flan-T5, LLaMA) and closed-source models (e.g., ChatGPT, Claude) based on the MMLU benchmark [24].
Category | Search terms |
---|---|
General LLM terms | (Large language model) OR LLM |
Close-sourced | (GPT 2) OR (GPT 3) OR (GPT 4) OR ChatGPT OR (Claude Instant) OR (Claude 1.3) OR (Claude 2) OR (Claude 3) OR (Google PaLM) OR (Google Gemini) OR (PaLM 2) OR (Gemini Pro) OR (Gemini Ultra) OR (davinci-002) OR (davinci-003) OR (Chinchilla 70B) |
Open-sourced | BERT OR RoBERTa OR (Meta LLaMA) OR (LLaMA 2) OR (LLaMA 3) OR Mistral OR Mixtral OR Qwen OR DBRX OR (Falcon 40B) OR (Falcon 7B) OR (Falcon 180B) OR (OPT 66B) OR (BLOOM 176B) OR (GLM 130B) OR (Flan T5) OR (Flan PaLM) |
A.3 Duplicates removal
We first remove the duplicated papers based on the title information. This handling reduces the collection of papers to 70,644. Given that a preprint might change its title when officially published, we then use the Jaccard similarity coefficient to identify works with variations in titles but are actually the same. The Jaccard similarity coefficient is a measure of similarity between two sets. It is defined as the size of the intersection divided by the size of the union of the sets, as shown below:
(4) |
In our context, sets and represent the collections of words from the titles and author names of a preprint and its paired articles, respectively. denotes the number of common words between the titles and author names of the preprint and its paired articles, while is the total number of unique words in both titles and author names combined. We output all works with and find that often implies the preprint and the published article are the same paper. Therefore, we remove all those preprints showing a .
A.4 Relevance evaluation
This step involves filtering out papers that are not actually discussing LLM-related topics, even if their abstracts contain key terms such as “Palm.” We take two steps to do the filtering. First, we consider papers with identified topics (provided by OpenAlex) including “natural language processing,” “artificial intelligence,” “machine learning,” “text mining,” “deep learning,” “transfer learning,” “question answering,” and “speech recognition” as LLM-relevant. Then, we use GPT-4 to determine their relevance based on the title, abstract, and topics. The prompt is designed as follows:
To validate the classification, we manually reviewed 190 randomly selected papers, of which 86 are classified as irrelevant and 104 as relevant based on our manual annotation. We then compare these results with those returned by GPT-4o. We use F1-score and accuracy to evaluate performance. Overall, GPT-4’s identification achieves an agreement of 0.96 with manual annotation, with F1-scores of 0.96 for both irrelevant and relevant classes.
Appendix B Statistical Analysis
B.1 Wilcoxon rank-sum test
Figure 5 and Figure 6 provide statistical support for the analysis presented in subsection 3.1. Both figures employ the Wilcoxon rank-sum test, a non-parametric method ideal for comparing two groups of data that are either interval-scaled or not normally distributed, to quantify the statistical significance of observed differences in entropy. Figure 5 examines whether there is a significant difference in entropy before and after the release of ChatGPT, with subfigure (a) illustrating the entropy distribution and test results based on authors’ institutions, and subfigure (b) presenting the same analysis for authors’ departments. Figure 6 investigates whether two fields display significant differences in entropy, with subfigure (a) showing the Wilcoxon rank-sum test results comparing entropy across two fields before and after ChatGPT’s release based on authors’ institutions, and subfigure (b) presenting the same comparison based on authors’ departments.
B.2 Bayesian structural time series analysis
The entropy changes after the launch of ChatGPT cannot be solely attributed to causal effects due to inherent trends in entropy over time. To address this, a Bayesian structural time series (BSTS) model is fitted for each field for time series forecasting and causal inference [33, 34]. Figure 7 displays results where the impact of ChatGPT is statistically significant. Trained by pre-intervention data, the BSTS model predicts post-intervention entropy (the dashed line in each panel’s top subfigure). The difference between these predictions and observations provides an estimate of the intervention’s pointwise effect (second subfigure in each panel), and their cumulative sum indicates the total effect over time (third subfigure in each panel). As shown, the release of ChatGPT consistently leads to a statistically significant increase in the entropy in Computer Science. It also leads to a statistically significant decrease in the entropy in Medicine based on institutions, and a statistically significant decrease in the entropy in Business, Management & Accounting based on departments. Other fields, however, do not show significant impacts by ChatGPT based on the BSTS causal inference.
Appendix C OpenAlex topic and field classification
OpenAlex has identified 4,516 topics based on the publication-level classification system developed by Waltman and Van Eck [35]. The classification of topics for each paper is based on OpenAlex’s proprietary model, which fine-tunes the multilingual BERT (mBERT) model for topic classification. The model’s input comprises a paper’s title, abstract, and citations, although only approximately half of the papers have usable abstracts. According to OpenAlex’s performance report, when all information for a paper is available, the model achieves a (top K = 1) accuracy of 0.72. The (top K = 1) accuracy refers to the percentage of samples where the correct label appears as the top prediction. On average, their model achieves a (top K = 1) accuracy of 0.53 and a (top K = 3) accuracy of 0.73. OpenAlex provides up to three topics for each paper [23].
Subsequently, OpenAlex establishes a one-to-one relationship between these topics and higher-level fields. The fields are organized in a hierarchical structure, including subfields, fields, and domains, derived from Scopus’s ASJC (All Science Journal Classification) structure. This matching process is conducted using an LLM and further verified by OpenAlex’s annotators. As a result, each paper can be associated with up to three fields, corresponding to the three identified topics [23].
While the classification performance for field classification is not explicitly reported, it is reasonable to assume that the accuracy for mapping the 4,516 topics to 26 fields could be significantly higher than the reported (top K = 1) accuracy for topic classification. This assumption is based on the fact that the field classification represents a narrowing-down process from a larger set of topics to a smaller set of fields [23].
Appendix D Network analysis for countries
The network metrics based on authors’ affiliated country information are presented in Table 3, and the network visuals are presented in Figure 8.
Metric | Countries |
---|---|
No. Nodes | 155 |
No. Edges | 1745 |
Density | 0.15 |
Avg. Shortest Path | 9.36 |
Clustering Coeff. | 0.77 |
No. Components | 16 |
Largest Component | 140 |
Power-law Exponent | 5.22 |
No. Comm.(Louvain) | 20 |
No. Comm. (CNM) | 19 |
Degree Centrality (top 5) | US (109), GB (91), CN (81), DE (79), CA (75) |
Betweenness Centrality (top 5) | PL (0.056), GB (0.053), CA (0.051), FR (0.050), TR (0.050) |
Closeness Centrality (top 5) | PT (0.157), MX (0.157), QA (0.155), TR (0.154), SY (0.154) |
Eigenvector Centrality (top 5) | US (0.630), CN (0.498), GB (0.395), DE (0.222), CA (0.194) |
References
- \bibcommenthead
- Shi and Evans [2023] Shi, F., Evans, J.: Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines. Nature Communications 14, 1641 (2023) https://doi.org/10.1038/s41467-023-36741-4
- Dinh et al. [2024] Dinh, L., Barley, W.C., Johnson, L., Allan, B.F.: Hyperauthored papers disproportionately amplify important egocentric network metrics. Quantitative Science Studies, 1–24 (2024)
- Venturini et al. [2024] Venturini, S., Sikdar, S., Rinaldi, F., Tudisco, F., Fortunato, S.: Collaboration and topic switches in science. Scientific Reports 14(1), 1258 (2024) https://doi.org/10.1038/s41598-024-51606-6 . Number: 1 Publisher: Nature Publishing Group. Accessed 2024-02-20
- Khraisha et al. [2024] Khraisha, Q., Put, S., Kappenberg, J., Warraitch, A., Hadfield, K.: Can large language models replace humans in systematic reviews? evaluating gpt-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods (2024)
- Le [2023] Le, F.: How chatgpt is transforming the postdoc experience. Nature 622, 655 (2023)
- Pilny et al. [2024] Pilny, A., McAninch, K., Slone, A., Moore, K.: From manual to machine: assessing the efficacy of large language models in content analysis. Communication Research Reports, 1–10 (2024)
- Byun et al. [2023] Byun, C., Vasicek, P., Seppi, K.: Dispensing with humans in human-computer interaction research. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–26 (2023)
- Ferruz et al. [2022] Ferruz, N., Schmidt, S., Höcker, B.: Protgpt2 is a deep unsupervised language model for protein design. Nature communications 13(1), 4348 (2022)
- Yang et al. [2022] Yang, X., Chen, A., PourNejatian, N., Shin, H.C., Smith, K.E., Parisien, C., Compas, C., Martin, C., Costa, A.B., Flores, M.G., et al.: A large language model for electronic health records. NPJ digital medicine 5(1), 194 (2022)
- Savage [2023] Savage, N.: Drug discovery companies are customizing chatgpt: here’s how. Nat Biotechnol 41(5), 585–586 (2023)
- Ma et al. [2024] Ma, Y., Cui, C., Cao, X., Ye, W., Liu, P., Lu, J., Abdelraouf, A., Gupta, R., Han, K., Bera, A., et al.: Lampilot: An open benchmark dataset for autonomous driving with language model programs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15141–15151 (2024)
- Kuckreja et al. [2024] Kuckreja, K., Danish, M.S., Naseer, M., Das, A., Khan, S., Khan, F.S.: Geochat: Grounded large vision-language model for remote sensing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 27831–27840 (2024)
- Argyle et al. [2023] Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., Wingate, D.: Out of one, many: Using language models to simulate human samples. Political Analysis 31(3), 337–351 (2023)
- Li et al. [2024] Li, L., Fan, L., Atreja, S., Hemphill, L.: “hot” chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media. ACM Transactions on the Web 18(2), 1–36 (2024)
- Wu et al. [2023] Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., Mann, G.: Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023)
- Susarla et al. [2023] Susarla, A., Gopal, R., Thatcher, J.B., Sarker, S.: The janus effect of generative ai: Charting the path for responsible conduct of scholarly activities in information systems. Information Systems Research 34(2), 399–408 (2023)
- Dwivedi et al. [2023] Dwivedi, Y.K., Kshetri, N., Hughes, L., Slade, E.L., Jeyaraj, A., Kar, A.K., Baabdullah, A.M., Koohang, A., Raghavan, V., Ahuja, M., et al.: Opinion paper:“so what if chatgpt wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy. International Journal of Information Management 71, 102642 (2023)
- Einarsson et al. [2024] Einarsson, H., Lund, S.H., Jónsdóttir, A.H.: Application of chatgpt for automated problem reframing across academic domains. Computers and Education: Artificial Intelligence 6, 100194 (2024)
- Owens [2023] Owens, B.: How nature readers are using chatgpt. Nature 615(7950), 20 (2023)
- Agathokleous et al. [2023] Agathokleous, E., Saitanis, C.J., Fang, C., Yu, Z.: Use of chatgpt: What does it mean for biology and environmental science? Science of The Total Environment 888, 164154 (2023)
- Meyer et al. [2023] Meyer, J.G., Urbanowicz, R.J., Martin, P.C., O’Connor, K., Li, R., Peng, P.-C., Bright, T.J., Tatonetti, N., Won, K.J., Gonzalez-Hernandez, G., et al.: Chatgpt and large language models in academia: opportunities and challenges. BioData Mining 16(1), 20 (2023)
- Barley et al. [2022] Barley, W.C., Dinh, L., Workman, H., Fang, C.: Exploring the relationship between interdisciplinary ties and linguistic familiarity using multilevel network analysis. Communication Research 49(1), 33–60 (2022) https://doi.org/10.1177/0093650220926001
- Priem et al. [2022] Priem, J., Piwowar, H., Orr, R.: Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833 (2022)
- Hendrycks et al. [2020] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)
- Breiger [1974] Breiger, R.L.: The duality of persons and groups. Social Forces 53(2), 181–190 (1974)
- Newman [2005] Newman, M.E.: Power laws, pareto distributions and zipf’s law. Contemporary physics 46(5), 323–351 (2005)
- Tomašev et al. [2020] Tomašev, N., Cornebise, J., Hutter, F., Mohamed, S., Picciariello, A., Connelly, B., Belgrave, D.C., Ezer, D., Haert, F.C.v.d., Mugisha, F., et al.: Ai for social good: unlocking the opportunity for positive impact. Nature Communications 11(1), 2468 (2020)
- Chakraborty et al. [2023] Chakraborty, C., Bhattacharya, M., Lee, S.-S.: Artificial intelligence enabled chatgpt and large language models in drug target discovery, drug discovery, and development. Molecular Therapy-Nucleic Acids 33, 866–868 (2023)
- Jiang et al. [2023] Jiang, L.Y., Liu, X.C., Nejatian, N.P., Nasir-Moin, M., Wang, D., Abidin, A., Eaton, K., Riina, H.A., Laufer, I., Punjabi, P., et al.: Health system-scale language models are all-purpose prediction engines. Nature 619(7969), 357–362 (2023)
- Lu et al. [2024] Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Zhao, M., Chow, A.K., Ikemura, K., Kim, A., Pouli, D., Patel, A., et al.: A multimodal generative ai copilot for human pathology. Nature, 1–3 (2024)
- Peng et al. [2023] Peng, C., Yang, X., Chen, A., Smith, K.E., PourNejatian, N., Costa, A.B., Martin, C., Flores, M.G., Zhang, Y., Magoc, T., et al.: A study of generative large language model for medical research and healthcare. NPJ digital medicine 6(1), 210 (2023)
- Zhang et al. [2024] Zhang, L., Cao, Z., Shang, Y., Sivertsen, G., Huang, Y.: Missing institutions in openalex: possible reasons, implications, and solutions. Scientometrics, 1–23 (2024)
- Brodersen et al. [2015] Brodersen, K.H., Gallusser, F., Koehler, J., Remy, N., Scott, S.L.: Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics 9, 247–274 (2015)
- Hu and Chen [2021] Hu, S., Chen, P.: Who left riding transit? examining socioeconomic disparities in the impact of covid-19 on ridership. Transportation Research Part D: Transport and Environment 90, 102654 (2021)
- Waltman and Van Eck [2012] Waltman, L., Van Eck, N.J.: A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology 63(12), 2378–2392 (2012)