Generation and human-expert evaluation of interesting research ideas
using knowledge graphs and large language models
使用知识图和大型语言模型生成有趣的研究想法并进行人类专家评估

Xuemei Gu 顾雪梅 Mario Krenn 马里奥·克伦 Max Planck Institute for the Science of Light, Staudtstrasse 2, 91058 Erlangen, Germany
马克斯·普朗克光科学研究所，Staudtstrasse 2, 91058 埃尔兰根, 德国 xuemei.gu@mpl.mpg.de
mario.krenn@mpl.mpg.de

Abstract 抽象的

Advanced artificial intelligence (AI) systems with access to millions of research papers could inspire new research ideas that may not be conceived by humans alone. However, how interesting are these AI-generated ideas, and how can we improve their quality? Here, we introduce SciMuse, a system that uses an evolving knowledge graph built from more than 58 million scientific papers to generate personalized research ideas via an interface to GPT-4. We conducted a large-scale human evaluation with over 100 research group leaders from the Max Planck Society, who ranked more than 4,000 personalized research ideas based on their level of interest. This evaluation allows us to understand the relationships between scientific interest and the core properties of the knowledge graph. We find that data-efficient machine learning can predict research interest with high precision, allowing us to optimize the interest-level of generated research ideas. This work represents a step towards an artificial scientific muse that could catalyze unforeseen collaborations and suggest interesting avenues for scientists.
先进的人工智能（AI）系统可以访问数百万篇研究论文，可以激发新的研究想法，而这些想法可能无法由人类独自构想。然而，这些人工智能产生的想法有多有趣，我们如何提高它们的质量呢？在这里，我们介绍 SciMuse，这是一个系统，它使用由超过 5800 万篇科学论文构建的不断发展的知识图谱，通过 GPT-4 的接口生成个性化的研究想法。我们与来自马克斯·普朗克学会的 100 多名研究小组负责人进行了大规模的人类评估，他们根据自己的兴趣程度对 4,000 多个个性化研究想法进行了排名。这种评估使我们能够理解科学兴趣与知识图谱核心属性之间的关系。我们发现数据高效的机器学习可以高精度预测研究兴趣，使我们能够优化生成的研究想法的兴趣水平。这项工作代表了向人工科学缪思迈出的一步，它可以促进不可预见的合作，并为科学家提出有趣的途径。

Introduction 介绍

A compelling idea is often at the heart of successful research projects, crucial for their success and impact. However, with the enormous growth in the number of scientific papers published each year [1, 2, 3], it becomes increasingly difficult for researchers to uncover novel and interesting ideas. This difficulty is even more pronounced for those looking for interdisciplinary research avenues or collaborations, as they face an overwhelming sea of literature.
引人注目的想法往往是成功研究项目的核心，对其成功和影响至关重要。然而，随着每年发表的科学论文数量的巨大增长 [1,2,3]，研究人员发现新颖有趣的想法变得越来越困难。对于那些寻求跨学科研究途径或合作的人来说，这种困难更加明显，因为他们面临着铺天盖地的文献海洋。

Automated systems capable of extracting insights from the millions of scientific papers might offer a solution [4, 2]. One promising approach involves the use of knowledge graphs, which map the relationships between different research concepts and domains. In a pioneering work, the authors of [5] demonstrate potentially more efficient research strategies in the field of biochemistry by compressing the content of millions of scientific papers into knowledge graphs. These graphs not only help in mapping existing knowledge but also enable the discovery of surprising and impactful ideas by linking previously unconnected concepts. For instance, researchers have utilized knowledge graphs to forecast future research directions in quantum physics [6], biomedicine [7, 8], and artificial intelligence [9]. Beyond trend prediction and uncovering new links, these approaches have demonstrated that surprising combinations of research concepts are strongly associated with high-impact discoveries [10]. Additionally, human-aware AI systems can generate scientifically promising ‘alien’ hypotheses that might otherwise be overlooked [11], and knowledge graphs have been used to predict the impact of new research connections before a paper is written [12].
能够从数百万篇科学论文中提取见解的自动化系统可能会提供解决方案 [4, 2]。一种有前途的方法涉及使用知识图，它映射不同研究概念和领域之间的关系。在一项开创性的工作中，[5] 的作者通过将数百万篇科学论文的内容压缩到知识图谱中，展示了生物化学领域潜在的更有效的研究策略。这些图表不仅有助于映射现有知识，还可以通过链接以前不相关的概念来发现令人惊讶和有影响力的想法。例如，研究人员利用知识图来预测量子物理 [6]、生物医学 [7, 8] 和人工智能 [9] 等领域的未来研究方向。除了趋势预测和发现新联系之外，这些方法还表明，研究概念的令人惊讶的组合与高影响力的发现密切相关[10]。此外，具有人类意识的人工智能系统可以生成在科学上有前途的“外星人”假设，否则这些假设可能会被忽视[11]，并且知识图已被用来在撰写论文之前预测新研究联系的影响[12]。

Some recent efforts demonstrate how to generate research ideas in the form of text. One such example is PaperRobot, which starts with a knowledge graph and incrementally translates the idea into a draft of a paper [13]. With the growing prominence of large language models (LLMs), various systems now leverage these models to generate research ideas. SciMON, for instance, generates novel scientific ideas by comparing them to prior work and continuously enhancing their novelty [14]. Another system uses LLMs to mine large-scale scientific literature and generate hypotheses by finding unanticipated connections between research topics [15]. Similarly, ResearchAgent develops new research ideas by analyzing scientific literature and refining them progressively to ensure both novelty and relevance [16].
最近的一些努力展示了如何以文本形式产生研究想法。 PaperRobot 就是一个这样的例子，它从知识图开始，逐步将想法转化为论文草稿 [13]。随着大型语言模型 (LLMs) 的日益突出，各种系统现在利用这些模型来产生研究想法。例如，SciMON 通过与之前的工作进行比较并不断增强其新颖性来产生新颖的科学想法[14]。另一个系统使用 LLMs 挖掘大规模科学文献，并通过发现研究主题之间意想不到的联系来生成假设 [15]。同样，ResearchAgent 通过分析科学文献并逐步完善它们来开发新的研究想法，以确保新颖性和相关性 [16]。

While novelty and relevance of the generated ideas are crucial, a critical question arises: Are these AI-generated research ideas interesting for human scientists? The aforementioned works conducted small-scale human evaluations involving one biomedical domain expert [13], six natural language processing (NLP) PhD students [14], three social science PhD students [15] and ten PhD students in computer science and biomedicine [16].
虽然所生成想法的新颖性和相关性至关重要，但出现了一个关键问题：这些人工智能生成的研究想法对人类科学家来说是否有趣？上述工作进行了小规模的人类评估，涉及一名生物医学领域专家[13]、六名自然语言处理（NLP）博士生[14]、三名社会科学博士生[15]和十名计算机科学和生物医学博士生[16] ]。

However, it is often experienced researchers who define and evaluate research projects by writing and assessing research grant applications, as well as leading and shaping the research agenda of their groups. It would be particularly insightful to see how these experienced scientists evaluate AI-generated project ideas. With more evaluators and a greater number of evaluations, we could develop tools to predict which research ideas will be interesting in the future. This is precisely the goal of our paper, aiming to suggest interesting research projects and collaborations for scientists.
然而，通常是经验丰富的研究人员通过撰写和评估研究资助申请以及领导和制定其小组的研究议程来定义和评估研究项目。看看这些经验丰富的科学家如何评估人工智能生成的项目想法将特别有洞察力。有了更多的评估者和更多的评估，我们可以开发工具来预测哪些研究想法将来会有趣。这正是我们论文的目标，旨在为科学家提出有趣的研究项目和合作。

Here, we introduce SciMuse, a system designed to suggest new personalized research ideas for individual scientists or collaborations between researchers. To achieve this, we first generate a knowledge graph from more than 58 million papers, incorporating semantic and impact information. We then identify sub-graphs relevant to the research interests of individual scientists and use these sub-graphs to select research topics. Using GPT-4 [17], we formulate these research topics into comprehensive research suggestions. To evaluate our approach, we conducted a large-scale survey with over 100 research group leaders from the Max Planck Society in natural sciences and technology (such as the Institutes for Biogeochemistry, Astrophysics, Quantum Optics, and Intelligent Systems) and social sciences and humanities (such as the Institutes for Geoanthropology, Demographic Research, and Human Development). These experienced researchers assessed the interest level of more than 4,000 personalized AI-generated project suggestions. We analyzed the evaluations and found clear correlations between the properties of the knowledge graph and the interest level of the research suggestions. Using these correlations, we trained a machine learning model to predict research interest based solely on knowledge graph data. The model achieves high precision for the top-N predicted interesting suggestions, with precision exceeding 50% for N $\leq$ 15. Our findings demonstrate the potential of SciMuse for suggesting highly interesting research ideas and collaborations, highlighting the role of artificial intelligence as a source of inspiration in science [18, 19, 20, 21].
在这里，我们介绍 SciMuse，这是一个旨在为个别科学家或研究人员之间的合作提出新的个性化研究想法的系统。为了实现这一目标，我们首先从超过 5800 万篇论文中生成一个知识图谱，其中包含语义和影响信息。然后，我们识别与个别科学家的研究兴趣相关的子图，并使用这些子图来选择研究主题。使用GPT-4 [17]，我们将这些研究主题制定为综合的研究建议。为了评估我们的方法，我们对来自马克斯·普朗克学会自然科学和技术（例如生物地球化学、天体物理学、量子光学和智能系统研究所）以及社会科学和人文学科的 100 多名研究小组负责人进行了大规模调查（例如地球人类学研究所、人口统计研究所和人类发展研究所）。这些经验丰富的研究人员评估了 4,000 多个人工智能生成的个性化项目建议的兴趣水平。我们对评估进行了分析，发现知识图谱的属性与研究建议的兴趣水平之间存在明显的相关性。利用这些相关性，我们训练了一个机器学习模型，仅根据知识图数据来预测研究兴趣。该模型对前 N 个预测的有趣建议实现了高精度，其中 N $\leq$ 15 的精度超过 50%。我们的研究结果证明了 SciMuse 在提出高度有趣的研究想法和合作方面的潜力，凸显了人工智能作为科学灵感的源泉[18,19,20,21]。

Results

Creating the knowledge graph – While we could directly use publicly available large language models such as GPT-4 [17] or Gemini [22] or Claude [23] to suggest new research ideas and collaborations, our control over the generated ideas would be limited to the structure of the prompt. Therefore, we decided to build a large knowledge graph from the scientific literature to identify the personalized research interests of scientists.

Refer to caption — Figure 1: SciMuse suggests research ideas or collaborations using knowledge graph and GPT-4. (a), Generation of a knowledge graph. Nodes represent scientific concepts extracted from about 2.44 million paper titles and abstracts from four academic preprint servers. Using natural language processing (NLP) tools such as RAKE [24] to create a concept list, we then refined it with customized NLP techniques, manual review, and GPT, removing non-conceptual phrases like verbs, ordinal numbers, conjunctions, and adverbials. Wikipedia was used to restore any mistakenly removed concepts. In the end, we obtained a final list of 123,128 concepts. Edges are created when two concepts co-occur in the title or abstract of more than 58 million scientific papers in the OpenAlex database [25]. These edges are augmented with citation information, which can serve as a proxy for impact. A mini-knowledge graph as an example is shown for two randomly selected papers [26, 27] in OpenAlex. (b), AI-generated research collaborations. We first process the publications of Researcher A and Researcher B through our refined concept list from (a), generating individual concept lists for each researcher. We then use GPT-4 to enhance these lists to create high-quality concept representations. These refined lists identify distinct subnetworks within our knowledge graph that correspond to each researcher’s interests. To propose research collaborations or ideas, we identify and combine relevant concept pairs between the two researchers along with their research information. This combined input is then fed into GPT-4, which generates personalized research ideas or collaboration projects.

The knowledge graph, depicted in Fig. 1(a), consists of vertices, representing scientific concepts, and edges are drawn when two concepts jointly appear in a title or abstract of a scientific paper. The concept list is generated from the titles and abstracts of around 2.44 million papers from arXiv, bioRxiv, ChemRxiv, and medRxiv, with a data cutoff in February 2023. Rapid Automatic Key-word Extraction (RAKE) algorithm based on statistical text analysis is used to extract candidate concepts [24]. Those candidates are further refined using GPT, Wikipedia, and human annotators, resulting in 123,128 concepts in the natural and social sciences. We then use more than 58 million scientific papers from the open-source database OpenAlex [25] to create edges. These edges contain information about the co-occurrence of concepts in scientific papers (in titles and abstracts) and their subsequent citation rates. This new knowledge graph representation was recently introduced in [12] to predict the impact of future research topics. As a result, we have an evolving knowledge graph that captures part of the evolution of science from 1665 (a text by Robert Hooke on the observation of a great spot on Jupiter [29]) to April 2023. Details of the knowledge graph generation are depicted in Fig. 1(a) and the Appendix.

Personalized research suggestions – We focus on generating personalized research proposals for collaborations between two scientists, both group leaders from the Max Planck Society. One of these researchers will later evaluate the proposal.

To generate suggestions for pairs of researchers, as depicted in Fig. 1(b), we begin by identifying the research interests of both Researcher A and Researcher B. This is done by analyzing all their published papers from the past two years. Specifically, we extract their concepts from the titles and abstracts of these papers using the full concept list shown in Fig. 1(a). The personalized concept lists are further refined by GPT-4. Consequently, we are able to construct a subgraph in the knowledge graph for each researcher based on theirs personalized concepts.

With the researchers’ subgraphs, we generate a prompt for GPT-4 to create a research project (details in the Appendix). In the prompt, we provide the titles of up to seven papers from each researcher and ask GPT-4 to create a research project based on two selected scientific concepts. We choose these concepts in three different ways. In one-third of the cases, we use a randomly sampled concept pair, with one concept from each researcher. In another third, we select the concept pair with the highest predicted future impact, using an adaptation of [12]. In the final third, we do not specify concept pairs, instead asking GPT-4 to create the project using only the paper titles. Although we cannot directly relate these pure GPT-4 suggestions to knowledge graph features and interest levels from human evaluation, they serve as an important sanity check for our method (see Appendix). The prompt itself employs self-reflection, as described in [30], to improve the response. Specifically, we ask GPT-4 to generate three ideas, reflect upon them, and improve them twice. In the end, GPT-4 selects the most suitable project idea as the final result.

Human Evaluation – To assess how interesting these AI-generated ideas are, we asked research group leaders at scientific institutes, who regularly deal with and act upon research ideas, to participate in the evaluation. Specifically, 110 research group leaders from 54 Max Planck Institutes within the Max Planck Society (one of the largest research societies worldwide) participated (see Fig. 2(a) and (b)). They were tasked with evaluating up to 48 personalized research projects for their interest level, ranging from 1 (‘not interesting’) to 5 (‘very interesting’). Of the 110 researchers, 104 are from natural science institutes, and 6 are from social science institutes. In total, we received 4,451 responses. The full statistics are shown in Fig. 2(c). Notably, 1,107 projects received an interest level of 4 or 5 (nearly 25% of the projects), with 394 of these being ranked as very interesting.

Interest versus knowledge graph features – We find that, on average, there is no significant difference in the interest value between projects generated by the three different methods: random concept pairs, high-impact concept pairs, and without concept pairs. The fact that the sanity test (a project generated without a concept pair in the prompt) and the cases where we provide concept pairs yield very similar results allows us to further analyze which knowledge graph features strongly influence the interest. If we can determine which features affect the interestingness of a research project, we can use this insight in the future to suggest research projects with higher research interest.

We first compute 144 knowledge graph features for each concept pair used in a research project. The first 141 features are the same features as those used to predict the future impact of concept pairs, as described in [12]. The features include node characteristics of the first and second concepts, such as node degree and PageRank [31], as well as edge features, such as the Simpson similarity, and the Sørensen–Dice coefficient [32]. Additionally, several features are based on impact information, such as citations within the last year. The final three features are the predicted impact and two different distance metrics of the researchers’ subgraphs (see Fig. 1(b)). The first distance metric considers only the subgraphs, using the concepts from Researcher A’s and Researcher B’s concept lists to determine the distance between these subgraphs. The second metric accounts for the entire neighborhood of the subgraphs, defined as semantic distances. For this, we collect the neighbors of all concepts in both researchers’ concept lists and determine the distance between these expanded subgraphs built from neighboring concepts.

We then split the 2,996 suggested research projects, created using concept pairs from the knowledge graph, into 50 equally sized bins. For each bin, we compute the mean interest and its standard deviation.

In Fig. 3, we display these correlations and identify several notable properties. For instance, the degree and PageRank of the first concept, selected from the evaluating researcher’s concept list, is strongly negatively correlated with human-evaluated interest-level. This means that the more widely connected a concept is within the knowledge graph, the less appealing the research projects are. A similar effect is observed for the citation rate: the more frequently a concept has been cited in the past (in the last year, and sum over all years), the less interesting the research projects are evaluated. Some features, such as the rank of concept B’s citation growth rate or the minimum count of the total number of papers containing concept A or B up to two years ago, show peculiar behaviour for very large or small values. This behaviour could be exploited to predict the interest level. On the other hand, we find a strong positive correlation between the Simpson similarity coefficient of the two concepts and the evaluated interest-level. Additionally, using semantic distance feature, we find a negative correlation in Fig. 3(h), indicating that research proposals from researchers in similar fields are considered more interesting than those from distant fields. This finding is consistent with Fig. 2(c), where research proposals from the same institutes are generally considered more interesting than those from other institutes (with different research focus).

We show the correlations for all 2,996 answers containing concepts from the knowledge graph (blue), as well as for the top 50% and top 25% of concept pairs with the highest predicted impact (green and red, respectively) in Fig. 3, indicating that some correlations are stronger for suggestions using high-impact concept pairs.

Predicting interest – Given that the features of the knowledge graph significantly influence the interest in suggested research projects, we can take this analysis a step further by training a machine learning model to predict the level of interest based solely on these properties. If successful, this approach would allow us to suggest research projects that are more likely to be considered highly interesting in the future for scientists.

We start with a concept pair, compute the relevant features in the knowledge graph, and use these features to predict whether a research proposal will receive an interest rating of 4 or 5 (on a scale from 1 to 5: not interesting to very interesting) or below 4, as illustrated in Fig. 4(a). Due to the scarcity of training data – each data point representing the evaluation of a proposed research project’s interestingness by a research group leader – we employ a low-data machine learning technique. Specifically, we use a small neural network configured with 25 individually high-performing features, 50 neurons in a single hidden layer, and one output neuron, incorporating dropout to train the neural network [33]. To ensure robust evaluation and maximize the utility of our limited data, we utilize Monte Carlo cross-validation, also known as repeated random sub-sampling validation (see Appendix).

For our binary classification task, we achieve an average Area Under the Curve (AUC) of the receiver operating characteristic (ROC) curve [34] of nearly 2/3, as shown in Fig. 4(b). More relevant for our task is achieving high precision, as we want SciMuse to suggest highly interesting projects within a very small number of overall suggestions. For this, we compute the precision of the top-N highest predicted concept pairs. For small N, we find a precision higher than 65%. This indicates that within the highest predicted suggested concept pairs, roughly 65% are evaluated with high interest level, as illustrated in Fig. 4(c). This precision is significantly higher than random selection, which achieves only 23%. Additionally, we can ask what is the probability of obtaining at least one highly interesting suggestion within the first N suggestions. Fig. 4(d) shows that our machine learning method provides a significantly higher probability of finding interesting suggestions within the first few suggestions compared to random sampling.

Discussion

Our results show that one could predict which project suggestions experienced researchers will find interesting by analyzing the knowledge-graph properties of the concept pairs used for the prompts to GPT-4, without considering the detailed text produced by GPT-4. This finding allows us to enhance SciMuse such that it can select novel, and high-interest research topics from knowledge graphs and translate them into full-fledged proposals using modern large language models. As publicly available large language models like GPT-4 [17], Gemini 1.5 [22], LLaMa3 [35], and Claude [23] become increasingly powerful, with improvements occurring nearly monthly [36], we anticipate that the generated personalized research ideas will become more targeted and reasonable.

The methodologies demonstrated in our work, employed by SciMuse, have the potential to inspire novel, unexpected cross-disciplinary research on a large scale. By providing a big-picture view through the analysis of millions of scientific papers, SciMuse allows the discovery of interesting research projects between scientists in different domains, which might otherwise be very challenging to find. Research projects in distant fields are known to have great potential for impactful, award-winning results [37, 5, 1, 2]. Therefore, large scientific societies, national funding agencies, and other stakeholders might be motivated to implement methodologies in the line of SciMuse, which could foster new highly interdisciplinary and interesting collaborations and ideas that might otherwise remain untapped. This, hopefully, could advance the progress and impact of science at a large scale.

Acknowledgements

The authors wholeheartedly thank all the researchers who spent the time participating in our study. The authors also thank the organizers of OpenAlex, arXiv, bioRxiv, and medRxiv for making scientific resources freely accessible. X.G. acknowledges support from the Alexander von Humboldt Foundation.

Ethics Statement

The research was reviewed and approved by the Ethics Council of the Max Planck Society.

References

Fortunato et al. [2018] S. Fortunato, C. T. Bergstrom, K. Börner, J. A. Evans, D. Helbing, S. Milojević, A. M. Petersen, F. Radicchi, R. Sinatra, B. Uzzi, et al., Science of science, Science 359, eaao0185 (2018).
Wang and Barabási [2021] D. Wang and A.-L. Barabási, The science of science (Cambridge University Press, 2021).
Bornmann et al. [2021] L. Bornmann, R. Haunschild, and R. Mutz, Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases, Humanities and Social Sciences Communications 8, 1 (2021).
Evans and Foster [2011] J. A. Evans and J. G. Foster, Metaknowledge, Science 331, 721 (2011).
Rzhetsky et al. [2015] A. Rzhetsky, J. G. Foster, I. T. Foster, and J. A. Evans, Choosing experiments to accelerate collective discovery, Proc. Natl. Acad. Sci. USA 112, 14569 (2015).
Krenn and Zeilinger [2020] M. Krenn and A. Zeilinger, Predicting research trends with semantic and neural networks with an application in quantum physics, Proc. Natl. Acad. Sci. USA 117, 1910 (2020).
Sybrandt et al. [2020] J. Sybrandt, I. Tyagin, M. Shtutman, and I. Safro, Agatha: automatic graph mining and transformer based hypothesis generation approach, in Proceedings of the 29th ACM International Conference on Information & Knowledge Management (2020) pp. 2757–2764.
Nadkarni et al. [2021] R. Nadkarni, D. Wadden, I. Beltagy, N. A. Smith, H. Hajishirzi, and T. Hope, Scientific language models for biomedical knowledge base completion: an empirical study, arXiv:2106.09700 (2021).
Krenn et al. [2023] M. Krenn, L. Buffoni, B. Coutinho, S. Eppel, J. G. Foster, A. Gritsevskiy, H. Lee, Y. Lu, J. P. Moutinho, N. Sanjabi, et al., Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network, Nature Machine Intelligence 5, 1326 (2023).
Shi and Evans [2023] F. Shi and J. Evans, Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines, Nature Communications 14, 1641 (2023).
Sourati and Evans [2023] J. Sourati and J. A. Evans, Accelerating science with human-aware artificial intelligence, Nature Human Behaviour 7, 1682 (2023).
Gu and Krenn [2024] X. Gu and M. Krenn, Forecasting high-impact research topics via machine learning on evolving knowledge graphs, arXiv:2402.08640 (2024).
Wang et al. [2019] Q. Wang, L. Huang, Z. Jiang, K. Knight, H. Ji, M. Bansal, and Y. Luan, Paperrobot: Incremental draft generation of scientific ideas, arXiv:1905.07870 (2019).
Wang et al. [2024] Q. Wang, D. Downey, H. Ji, and T. Hope, Scimon: Scientific inspiration machines optimized for novelty, Annual Meeting of the Association for Computational Linguistics (2024).
Yang et al. [2023] Z. Yang, X. Du, J. Li, J. Zheng, S. Poria, and E. Cambria, Large language models for automated open-domain scientific hypotheses discovery, arXiv:2309.02726 (2023).
Baek et al. [2024] J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, Researchagent: Iterative research idea generation over scientific literature with large language models, arXiv:2404.07738 (2024).
Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv:2303.08774 (2023).
Krenn et al. [2022] M. Krenn, R. Pollice, S. Y. Guo, M. Aldeghi, A. Cervera-Lierta, P. Friederich, G. dos Passos Gomes, F. Häse, A. Jinich, A. Nigam, et al., On scientific understanding with artificial intelligence, Nature Reviews Physics 4, 761 (2022).
Hope et al. [2023] T. Hope, D. Downey, D. S. Weld, O. Etzioni, and E. Horvitz, A computational inflection for scientific discovery, Communications of the ACM 66, 62 (2023).
Wang et al. [2023] H. Wang, T. Fu, Y. Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac, et al., Scientific discovery in the age of artificial intelligence, Nature 620, 47 (2023).
AI4Science and Quantum [2023] M. R. AI4Science and M. A. Quantum, The impact of large language models on scientific discovery: a preliminary study using gpt-4, arXiv:2311.07361 (2023).
Reid et al. [2024] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv:2403.05530 (2024).
Anthropic [2024] Anthropic, The claude 3 model family: Opus, sonnet, haiku, Papers with Code (2024), available at https://paperswithcode.com/paper/the-claude-3-model-family-opus-sonnet-haiku.
Rose et al. [2010] S. Rose, D. Engel, N. Cramer, and W. Cowley, Automatic keyword extraction from individual documents, Text mining: applications and theory , 1 (2010).
Priem et al. [2022] J. Priem, H. Piwowar, and R. Orr, Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts, arXiv:2205.01833 (2022).
Wakonig et al. [2019] K. Wakonig, A. Diaz, A. Bonnin, M. Stampanoni, A. Bergamaschi, J. Ihli, M. Guizar-Sicairos, and A. Menzel, X-ray fourier ptychography, Science advances 5, eaav0282 (2019).
Johnson et al. [2023] A. S. Johnson, D. Perez-Salinas, K. M. Siddiqui, S. Kim, S. Choi, K. Volckaert, P. E. Majchrzak, S. Ulstrup, N. Agarwal, K. Hallman, et al., Ultrafast x-ray imaging of the light-induced phase transition in vo2, Nature Physics 19, 215 (2023).
Commission [2024] E. Commission, Eurostat gisco - nuts geodata (2024).
Hooke [1665] R. Hooke, A spot in one of the belts of jupiter, Philosophical Transactions of the Royal Society of London 1, 3 (1665).
Madaan et al. [2024] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al., Self-refine: Iterative refinement with self-feedback, Advances in Neural Information Processing Systems 36 (2024).
Page et al. [1999] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking : Bringing order to the web, Stanford InfoLab (1999).
Barabási [2016] A.-L. Barabási, Network Science (Cambridge University Press, 2016).
Srivastava et al. [2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15, 1929 (2014).
Fawcett [2004] T. Fawcett, Roc graphs: Notes and practical considerations for researchers, Machine learning 31, 1 (2004).
AI [2024] M. AI, Llama 3: Open foundation and fine-tuned chat models, https://github.com/meta-llama/llama3 (2024).
Chiang et al. [2024] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, et al., Chatbot arena: An open platform for evaluating llms by human preference, arXiv:2403.04132 (2024).
Uzzi et al. [2013] B. Uzzi, S. Mukherjee, M. Stringer, and B. Jones, Atypical combinations and scientific impact, Science 342, 468 (2013).

Appendix

Datasets for creating knowledge graph – We compiled a list of scientific concepts using metadata from arXiv, bioRxiv, medRxiv, and chemRxiv. The arXiv data is available on Kaggle, while bioRxiv, medRxiv, and chemRxiv metadata can be accessed through their APIs. Our dataset includes $\sim$ 2.44 million papers, with a data cutoff in February 2023.

For edge generation, we used the OpenAlex database snapshot, available for download in OpenAlex bucket, with a data cutoff in April 2023. For more details, refer to the OpenAlex website [25]. The complete dataset is around 330 GB, expanding to 1.6 TB when decompressed. We focused on scientific journal papers with publication time, title, abstract, and citation information, reducing the dataset to a more manageable 68 GB gzip-compressed file, comprising about 92 million papers.

Creating the concept list – From four preprint dataset of approximately 2.44 million papers, we analyzed each article’s title and abstract using the RAKE algorithm, enhanced with additional stopwords, to extract potential concept candidates. These candidates were stored for subsequent analysis. We filtered out concepts to retain only those with two words that appeared in nine or more articles, and those with three or more words that appeared in six or more articles. This step significantly reduced the noise from the RAKE-generated concepts, yielding a refined list of 726,439 concepts.

To further enhance the quality of the identified concepts, we developed a suite of automatic tools designed to identify and eliminate common, domain-independent errors often associated with RAKE. Additionally, we conducted a manual review to remove inaccuracies in the concepts, such as non-conceptual phrases, verbs, ordinal numbers, conjunctions, and adverbials, reducing the list to 368,825 concepts.

Next, we used GPT-3.5 to further refine the concepts, which resulted in the removal of 286,311 concepts. To address potential incorrect removals, we used Wikipedia to recover mistakenly removed concepts, successfully restoring 40,614 concepts. This process ultimately produced a final list of 123,128 concepts.

Classification of Max Planck Institutes – We classify all 87 Max Planck Institutes into two classes: Class 1, abbreviated as nat, includes natural sciences, technology, mathematics, and medicine (68 institutes), while Class 2, abbreviated as soc, includes social sciences and humanities (19 institutes). We did manual classification based on institute titles and research fields, and we also used GPT-4o for automatic classification. The two approaches perfectly matched each other.

Prompt to GPT-4 for concept refinement – The prompt to refine the researchers’ concept list is:
A scientist has written the following papers:
0) title1
1) title2
2) title3
…

I have a noisy list of the researchers topics of interest, and I would like that you help me filtering them. Please look at the list below, and return all concepts in that list, which are relevant to the scientists research (based on their paper titles), and that are meaningful in the context of their research direction. The concepts can be detailed, I mainly want that you filter out not meaningful concepts, words which are not concepts, or concepts that are too general for the direction of the scientist (for example, artificial intelligence might be a meaningful concept for a geologist, but not for a machine learning researcher). Do not change or add any of the concepts. only remove them or keep them.

concept_list=[c1, c2, c3, c4, c5, c6, ….]

Prompt to GPT-4 for project idea generation – The prompt to suggest research ideas using concept pair from knowledge graph is:
Two researchers A and B, with expertise in “concept1” and “concept2” respectively, are eager to collaborate on a novel interdisciplinary project that leverages their unique strengths and creates synergy between their fields.

To better understand their backgrounds, here are the titles of recent publications from each researcher:
Researcher A:
1: title1
2: title2
3: title3
…

Researcher B:
1: title1
2: title2
3: title3
…

Please suggest a creative and surprising scientific project that combines “concept1” and “concept2”. In your response, follow this outline:

First, explain “concept1” and “concept2” in one short sentence each.

Then, do the following three steps 3 times, improving in each time the response:
A) Describe 4 interesting and new scientific contexts, in which those two concepts might appear together in a natural and useful way.
B) Criticize the 4 contexts (one short sentence each), based on how well the contexts merge the idea of the two concepts.
C) Give a 2 sentence summary of your reflections above, on how well one can combine these concepts naturally and interestingly.

Then, start finding a project. Taking your reflections from (A-C) into account, define in your response a project title, followed by a brief explanation of the project’s main objective.

Finally, address the following questions (Take the full reflections (A-C) into account):
What specific interesting research questions will this project address, that will lead to innovative novel results? [2 bullet points, one sentence each]

Rather than relying on a knowledge graph to supply “concept1” and “concept2” for GPT-4, it is possible to direct GPT-4 to extract these concepts from the titles of research papers authored by Researcher A and Researcher B, respectively. Subsequently, GPT-4 can use these identified concepts within the same prompting context to generate innovative research ideas.

Interest-Evaluation for three different generation methods – In Fig. 5, we show the three different generation methods for the research suggestions. The interest-level distributions are very similar, particularly between those with and without concepts from the knowledge graph. This similarity allows us to analyze the correlations between the properties of knowledge graph and interest level, and to use these properties to predict the interest level of proposals.

Predicting high interest from knowledge graph features – In Fig. 4 in the main text, our goal is to predict whether a certain research proposal will be evaluated with high interest. Specifically, using only data from the knowledge graph (and not the final text of the research proposal generated with GPT), we want to predict whether the proposal receives an interest value of 4 or 5 (on a scale from 1 to 5: not interesting to very interesting) or below 4, which constitutes a binary classification task.

Due to the small dataset size (2,996 answers with properties from the knowledge graph), we use a data-efficient learning method for the prediction task, specifically a small neural network with dropout. The input to the neural network consists of the 25 best-performing features from the knowledge graph. The neural network has only one hidden layer with 50 neurons and a single output neuron. We use mean square error as the loss function.

To get a consistent evaluation of the neural network performance for this small dataset, we perform Monte Carlo cross-validation. In this method, the dataset is randomly split into training and validation sets multiple times, and the model is trained and evaluated on each split. This process ensures that the performance metrics are robust and not dependent on a particular split of the data. We continue splitting and evaluating until the standard deviation of the mean AUC is less than $\frac{10^{-2}}{3}$ , which is achieved after 130 iterations. This approach provides a reliable estimate of the model’s performance, which is crucial for small datasets where individual splits may lead to high variance in the evaluation metrics.

The neural network performance is not specifically sensitive to hyperparameter choices, thus we refrained from hyperparameter optimization, and instead used a reasonable choice: learning rate=0.003, dropout=20%, weight decay=0.0007, training dataset=75%, validation dataset=15%, test dataset=10%.

We experimented with other data-efficient learning methods, such as decision trees, but they did not significantly outperform the neural network.

Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models使用知识图和大型语言模型生成有趣的研究想法并进行人类专家评估