Journal of Data and Information Science Feed

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Tue, 01 Oct 2024 00:00:00 GMT

The purpose of this study is to serve as a comprehensive review of the existing annotated corpora. This review study aims to provide information on the existing annotated corpora for event extraction, which are limited but essential for training and improving the existing event extraction algorithms. In addition to the primary goal of this study, it provides guidelines for preparing an annotated corpus and suggests suitable tools for the annotation task.

This study employs an analytical approach to examine available corpus that is suitable for event extraction tasks. It offers an in-depth analysis of existing event extraction corpora and provides systematic guidelines for researchers to develop accurate, high-quality corpora. This ensures the reliability of the created corpus and its suitability for training machine learning algorithms.

Our exploration reveals a scarcity of annotated corpora for event extraction tasks. In particular, the English corpora are mainly focused on the biomedical and general domains. Despite the issue of annotated corpora scarcity, there are several high-quality corpora available and widely used as benchmark datasets. However, access to some of these corpora might be limited owing to closed-access policies or discontinued maintenance after being initially released, rendering them inaccessible owing to broken links. Therefore, this study documents the available corpora for event extraction tasks.

Our study focuses only on well-known corpora available in English and Chinese. Nevertheless, this study places a strong emphasis on the English corpora due to its status as a global lingua franca, making it widely understood compared to other languages.

We genuinely believe that this study provides valuable knowledge that can serve as a guiding framework for preparing and accurately annotating events from text corpora. It provides comprehensive guidelines for researchers to improve the quality of corpus annotations, especially for event extraction tasks across various domains.

This study comprehensively compiled information on the existing annotated corpora for event extraction tasks and provided preparation guidelines.

Early identification of scientific breakthroughs through outlier analysis based on research entities

Wed, 04 Sep 2024 00:00:00 GMT

To address the “anomalies” that occur when scientific breakthroughs emerge, this study focuses on identifying early signs and nascent stages of breakthrough innovations from the perspective of outliers, aiming to achieve early identification of scientific breakthroughs in papers.

This study utilizes semantic technology to extract research entities from the titles and abstracts of papers to represent each paper’s research content. Outlier detection methods are then employed to measure and analyze the anomalies in breakthrough papers during their early stages. The development and evolution process are traced using literature time tags. Finally, a case study is conducted using the key publications of the 2021 Nobel Prize laureates in Physiology or Medicine.

Through manual analysis of all identified outlier papers, the effectiveness of the proposed method for early identifying potential scientific breakthroughs is verified.

The study’s applicability has only been empirically tested in the biomedical field. More data from various fields are needed to validate the robustness and generalizability of the method.

This study provides a valuable supplement to current methods for early identification of scientific breakthroughs, effectively supporting technological intelligence decision-making and services.

The study introduces a novel approach to early identification of scientific breakthroughs by leveraging outlier analysis of research entities, offering a more sensitive, precise, and fine-grained alternative method compared to traditional citation-based evaluations, which enhances the ability to identify nascent breakthrough innovations.

Community detection on elite mathematicians’ collaboration network

Wed, 28 Aug 2024 00:00:00 GMT

This study focuses on understanding the collaboration relationships among mathematicians, particularly those esteemed as elites, to reveal the structures of their communities and evaluate their impact on the field of mathematics.

Two community detection algorithms, namely Greedy Modularity Maximization and Infomap, are utilized to examine collaboration patterns among mathematicians. We conduct a comparative analysis of mathematicians’ centrality, emphasizing the influence of award-winning individuals in connecting network roles such as Betweenness, Closeness, and Harmonic centrality. Additionally, we investigate the distribution of elite mathematicians across communities and their relationships within different mathematical sub-fields.

The study identifies the substantial influence exerted by award-winning mathematicians in connecting network roles. The elite distribution across the network is uneven, with a concentration within specific communities rather than being evenly dispersed. Secondly, the research identifies a positive correlation between distinct mathematical sub-fields and the communities, indicating collaborative tendencies among scientists engaged in related domains. Lastly, the study suggests that reduced research diversity within a community might lead to a higher concentration of elite scientists within that specific community.

The study’s limitations include its narrow focus on mathematicians, which may limit the applicability of the findings to broader scientific fields. Issues with manually collected data affect the reliability of conclusions about collaborative networks.

This study offers valuable insights into how elite mathematicians collaborate and how knowledge is disseminated within mathematical circles. Understanding these collaborative behaviors could aid in fostering better collaboration strategies among mathematicians and institutions, potentially enhancing scientific progress in mathematics.

The study adds value to understanding collaborative dynamics within the realm of mathematics, offering a unique angle for further exploration and research.

Navigating interdisciplinary research: Historical progression and contemporary challenges

Thu, 01 Aug 2024 00:00:00 GMT

Interdisciplinary research plays a crucial role in addressing complex problems by integrating knowledge from multiple disciplines. This integration fosters innovative solutions and enhances understanding across various fields. This study explores the historical and sociological development of interdisciplinary research and maps its evolution through three distinct phases: pre-disciplinary, disciplinary, and post-disciplinary. It identifies key internal dynamics, such as disciplinary diversification, reorganization, and innovation, as primary drivers of this evolution. Additionally, this study highlights how external factors, particularly the urgency of World War II and the subsequent political and economic changes, have accelerated its advancement. The rise of interdisciplinary research has significantly reshaped traditional educational paradigms, promoting its integration across different educational levels. However, the inherent contradictions within interdisciplinary research present cognitive, emotional, and institutional challenges for researchers. Meanwhile, finding a balance between the breadth and depth of knowledge remains a critical challenge in interdisciplinary education.

Data-enhanced revealing of trends in Geoscience

Thu, 01 Aug 2024 00:00:00 GMT

This article presents an in-depth analysis of global research trends in Geosciences from 2014 to 2023. By integrating bibliometric analysis with expert insights from the Deeptime Digital Earth (DDE) initiative, this article identifies key emerging themes shaping the landscape of Earth Sciences^①.

The identification process involved a meticulous analysis of over 400,000 papers from 466 Geosciences journals and approximately 5,800 papers from 93 interdisciplinary journals sourced from the Web of Science and Dimensions database. To map relationships between articles, citation networks were constructed, and spectral clustering algorithms were then employed to identify groups of related research, resulting in 407 clusters. Relevant research terms were extracted using the Log-Likelihood Ratio (LLR) algorithm, followed by statistical analyses on the volume of papers, average publication year, and average citation count within each cluster. Additionally, expert knowledge from DDE Scientific Committee was utilized to select top 30 trends based on their representation, relevance, and impact within Geosciences, and finalize naming of these top trends with consideration of the content and implications of the associated research. This comprehensive approach in systematically delineating and characterizing the trends in a way which is understandable to geoscientists.

Thirty significant trends were identified in the field of Geosciences, spanning five domains: deep space, deep time, deep Earth, habitable Earth, and big data. These topics reflect the latest trends and advancements in Geosciences and have the potential to address real-world problems that are closely related to society, science, and technology.

The analyzed data of this study only contain those were included in the Web of Science.

This study will strongly support the organizations and individual scientists to understand the modern frontier of earth science, especially on solid earth. The organizations such as the surveys or natural science fund could map out areas for future exploration and analyze the hot topics reference to this study.

This paper integrates bibliometric analysis with expert insights to highlight the most significant trends on earth science and reach the individual scientist and public by global voting.

Identifying multidisciplinary problems from scientific publications based on a text generation method

Thu, 25 Jul 2024 00:00:00 GMT

A text generation based multidisciplinary problem identification method is proposed, which does not rely on a large amount of data annotation.

The proposed method first identifies the research objective types and disciplinary labels of papers using a text classification technique; second, it generates abstractive titles for each paper based on abstract and research objective types using a generative pre-trained language model; third, it extracts problem phrases from generated titles according to regular expression rules; fourth, it creates problem relation networks and identifies the same problems by exploiting a weighted community detection algorithm; finally, it identifies multidisciplinary problems based on the disciplinary labels of papers.

Experiments in the “Carbon Peaking and Carbon Neutrality” field show that the proposed method can effectively identify multidisciplinary research problems. The disciplinary distribution of the identified problems is consistent with our understanding of multidisciplinary collaboration in the field.

It is necessary to use the proposed method in other multidisciplinary fields to validate its effectiveness.

Multidisciplinary problem identification helps to gather multidisciplinary forces to solve complex real-world problems for the governments, fund valuable multidisciplinary problems for research management authorities, and borrow ideas from other disciplines for researchers.

This approach proposes a novel multidisciplinary problem identification method based on text generation, which identifies multidisciplinary problems based on generative abstractive titles of papers without data annotation required by standard sequence labeling techniques.

Publication behaviour and (dis)qualification of chief editors in Turkish national Social Sciences journals

Wed, 24 Jul 2024 00:00:00 GMT

This study investigated the publication behaviour of 573 chief editors managing 432 Social Sciences journals in Turkey. Direct inquiries into editorial qualifications are rare, and this research aims to shed light on editors’ scientific leadership capabilities.

This study contrasts insider publication behaviour in national journals with international articles in journals indexed by the Web of Science (WOS) and Scopus. It argues that editors demonstrating a consistent ability to publish in competitive WOS and Scopus indexed journals signal high qualifications, while editors with persistent insider behaviour and strong local orientation signal low qualification. Scientific leadership capability is measured by first-authored publications. Correlation and various regression tests are conducted to identify significant determinants of publication behaviour.

International publications are rare and concentrated on a few individuals, while insider publications are endemic and constitute nearly 40% of all national articles. Editors publish 3.2 insider papers and 8.1 national papers for every SSCI article. 62% (58%) of the editors have no SSCI (Scopus) article, 53% (63%) do not have a single lead-authored WOS (Scopus) article, and 89% publish at least one insider paper. Only a minority consistently publish in international journals; a fifth of the editors have three or more SSCI publications, and a quarter have three or more Scopus articles. Editors with foreign Ph.D. degrees are the most qualified and internationally oriented, whereas non-mobile editors are the most underqualified and underperform other editors by every measure. Illustrating the overall lack of qualification, nearly half of the professor editors and the majority of the WOS and Scopus indexed journal editors have no record of SSCI or Scopus publications.

This research relies on local settings that encourage national publications at the expense of international journals. Findings should be evaluated in light of this setting and bearing in mind that narrow localities are more prone to peer favouritism.

Incompetent and nepotistic editors pose an imminent threat to Turkish national literature. A lasting solution would likely include the dismissal and replacement of unqualified editors, as well as delisting and closure of dozens of journals that operate in questionable ways and serve little scientific purpose.

To my knowledge, this is the first study to document the publication behaviour of national journal chief editors.

Research evolution of metal organic frameworks: A scientometric approach with human-in-the-loop

Fri, 19 Jul 2024 00:00:00 GMT

This paper reports on a scientometric analysis bolstered by human-in-the-loop, domain experts, to examine the field of metal-organic frameworks (MOFs) research. Scientometric analyses reveal the intellectual landscape of a field. The study engaged MOF scientists in the design and review of our research workflow. MOF materials are an essential component in next-generation renewable energy storage and biomedical technologies. The research approach demonstrates how engaging experts, via human-in-the-loop processes, can help develop a comprehensive view of a field’s research trends, influential works, and specialized topics.

A scientometric analysis was conducted, integrating natural language processing (NLP), topic modeling, and network analysis methods. The analytical approach was enhanced through a human-in-the-loop iterative process involving MOF research scientists at selected intervals. MOF researcher feedback was incorporated into our method. The data sample included 65,209 MOF research articles. Python3 and software tool VOSviewer were used to perform the analysis.

The findings demonstrate the value of including domain experts in research workflows, refinement, and interpretation of results. At each stage of the analysis, the MOF researchers contributed to interpreting the results and method refinements targeting our focus on MOF research. This study identified influential works and their themes. Our findings also underscore four main MOF research directions and applications.

This study is limited by the sample (articles identified and referenced by the Cambridge Structural Database) that informed our analysis.

Our findings contribute to addressing the current gap in fully mapping out the comprehensive landscape of MOF research. Additionally, the results will help domain scientists target future research directions.

To the best of our knowledge, the number of publications collected for analysis exceeds those of previous studies. This enabled us to explore a more extensive body of MOF research compared to previous studies. Another contribution of our work is the iterative engagement of domain scientists, who brought in-depth, expert interpretation to the data analysis, helping hone the study.

Ranking academic institutions based on the productivity, impact, and quality of institutional scholars

Wed, 17 Jul 2024 00:00:00 GMT

The quantitative rankings of over 55,000 institutions and their institutional programs are based on the individual rankings of approximately 30 million scholars determined by their productivity, impact, and quality.

The institutional ranking process developed here considers all institutions in all countries and regions, thereby including those that are established, as well as those that are emerging in scholarly prowess. Rankings of individual scholars worldwide are first generated using the recently introduced, fully indexed ScholarGPS database. The rankings of individual scholars are extended here to determine the lifetime and last-five-year Top 20 rankings of academic institutions over all Fields of scholarly endeavor, in 14 individual Fields, in 177 Disciplines, and in approximately 350,000 unique Specialties. Rankings associated with five specific Fields (Medicine, Engineering & Computer Science, Life Sciences, Physical Sciences & Mathematics, and Social Sciences), and in two Disciplines (Chemistry, and Electrical & Computer Engineering) are presented as examples, and changes in the rankings over time are discussed.

For the Fields considered here, the Top 20 institutional rankings in Medicine have undergone the least change (lifetime versus last five years), while the rankings in Engineering & Computer Science have exhibited significant change. The evolution of institutional rankings over time is largely attributed to the recent emergence of Chinese academic institutions, although this emergence is shown to be highly Field- and Discipline-dependent.

The ScholarGPS database used here ranks institutions in the categories of: (i) all Fields, (ii) in 14 individual Fields, (iii) in 177 Disciplines, and (iv) in approximately 350,000 unique Specialties. A comprehensive investigation covering all categories is not practical.

Existing rankings of academic institutions have: (i) often been restricted to pre-selected institutions, clouding the potential discovery of scholarly activity in emerging institutions and countries; (ii) considered only broad areas of research, limiting the ability of university leadership to act on the assessments in a concrete manner, or in contrast; (iii) have considered only a narrow area of research for comparison, diminishing the broader applicability and impact of the assessment. In general, existing institutional rankings depend on which institutions are included in the ranking process, which areas of research are considered, the breadth (or granularity) of the research areas of interest, and the methodologies used to define and quantify research performance. In contrast, the methods presented here can provide important data over a broad range of granularity to allow responsible individuals to gauge the performance of any institution from the Overall (all Fields) level, to the level of the Specialty. The methods may also assist identification of the root causes of shifts in institution rankings, and how these shifts vary across hundreds of thousands of Fields, Disciplines, and Specialties of scholarly endeavor.

This study provides the first ranking of all academic institutions worldwide over Fields, Disciplines, and Specialties based on a unique methodology that quantifies the productivity, impact, and quality of individual scholars.

Tracking direct and indirect impact on technology and policy of transformative research via ego citation network

Wed, 17 Jul 2024 00:00:00 GMT

The disseminating of academic knowledge to nonacademic audiences partly relies on the transition of subsequent citing papers. This study aims to investigate direct and indirect impact on technology and policy originating from transformative research based on ego citation network.

Key Nobel Prize-winning publications (NPs) in fields of gene engineering and astrophysics are regarded as a proxy for transformative research. In this contribution, we introduce a network-structural indicator of citing patents to measure technological impact of a target article and use policy citations as a preliminary tool for policy impact.

The results show that the impact on technology and policy of NPs are higher than that of their subsequent citation generations in gene engineering but not in astrophysics.

The selection of Nobel Prizes is not balanced and the database used in this study, Dimensions, suffers from incompleteness and inaccuracy of citation links.

Our findings provide useful clues to better understand the characteristics of transformative research in technological and policy impact.

This study proposes a new framework to explore the direct and indirect impact on technology and policy originating from transformative research.

The Unique citing documents Journal Impact Factor (Uniq-JIF) as a supplement for the standard Journal Impact Factor

Wed, 17 Jul 2024 00:00:00 GMT

Detecting LLM-assisted writing in scientific communication: Are we there yet?

Tue, 09 Jul 2024 00:00:00 GMT

Large Language Models (LLMs), exemplified by ChatGPT, have significantly reshaped text generation, particularly in the realm of writing assistance. While ethical considerations underscore the importance of transparently acknowledging LLM use, especially in scientific communication, genuine acknowledgment remains infrequent. A potential avenue to encourage accurate acknowledging of LLM-assisted writing involves employing automated detectors. Our evaluation of four cutting-edge LLM-generated text detectors reveals their suboptimal performance compared to a simple ad-hoc detector designed to identify abrupt writing style changes around the time of LLM proliferation. We contend that the development of specialized detectors exclusively dedicated to LLM-assisted writing detection is necessary. Such detectors could play a crucial role in fostering more authentic recognition of LLM involvement in scientific communication, addressing the current challenges in acknowledgment practices.

A quantitative study of disruptive technology policy texts: An example of China’s artificial intelligence policy

Mon, 10 Jun 2024 00:00:00 GMT

The transformative impact of disruptive technologies on the restructuring of the times has attracted widespread global attention. This study aims to analyze the characteristics and shortcomings of China’s artificial intelligence (AI) disruptive technology policy, and to put forward suggestions for optimizing China’s AI disruptive technology policy.

Develop a three-dimensional analytical framework for “policy tools-policy actors-policy themes” and apply policy tools, social network analysis, and LDA topic model to conduct a comprehensive analysis of the utilization of policy tools, cooperative relationships among policy actors, and the trends in policy theme settings within China’s innovative AI technology policy.

We find that the collaborative relationship among the policy actors of AI disruptive technology in China is insufficiently close. Marginal subjects exhibit low participation in the cooperation network and overly rely on central subjects, forming a “center-periphery” network structure. Policy tool usage is predominantly focused on supply and environmental types, with a severe inadequacy in demand-side policy tool utilization. Policy themes are diverse, encompassing topics such as “Intelligent Services” “Talent Cultivation” “Information Security” and “Technological Innovation”, which will remain focal points. Under the themes of “Intelligent Services” and “Intelligent Governance”, policy tool usage is relatively balanced, with close collaboration among policy entities. However, the theme of “AI Theoretical System” lacks a comprehensive understanding of tool usage and necessitates enhanced cooperation with other policy entities.

The data sources and experimental scope are subject to certain limitations, potentially introducing biases and imperfections into the research results, necessitating further validation and refinement.

The study introduces a three-dimensional analysis framework for disruptive technology policy texts, which is significant for formulating and enhancing disruptive technology policies.

This study utilizes text mining and content analysis techniques to quantitatively analyze disruptive technology policy texts. It systematically evaluates China’s AI policies quantitatively, focusing on policy tools, policy actors, policy themes. The study uncovers the characteristics and deficiencies of current AI policies, offering recommendations for formulating and enhancing disruptive technology policies.

Beyond authorship: Analyzing contributions in and the challenges of appropriate attribution

Fri, 31 May 2024 00:00:00 GMT

This study aims to evaluate the accuracy of authorship attributions in scientific publications, focusing on the fairness and precision of individual contributions within academic works.

The study analyzes 81,823 publications from the journal PLOS ONE, covering the period from January 2018 to June 2023. It examines the authorship attributions within these publications to try and determine the prevalence of inappropriate authorship. It also investigates the demographic and professional profiles of affected authors, exploring trends and potential factors contributing to inaccuracies in authorship.

Surprisingly, 9.14% of articles feature at least one author with inappropriate authorship, affecting over 14,000 individuals (2.56% of the sample). Inappropriate authorship is more concentrated in Asia, Africa, and specific European countries like Italy. Established researchers with significant publication records and those affiliated with companies or nonprofits show higher instances of potential monetary authorship.

Our findings are based on contributions as declared by the authors, which implies a degree of trust in their transparency. However, this reliance on self-reporting may introduce biases or inaccuracies into the dataset. Further research could employ additional verification methods to enhance the reliability of the findings.

These findings have significant implications for journal publishers, highlighting the necessity for robust control mechanisms to ensure the integrity of authorship attributions. Moreover, researchers must exercise discernment in determining when to acknowledge a contributor and when to include them in the author list. Addressing these issues is crucial for maintaining the credibility and fairness of academic publications.

This study contributes to an understanding of critical issues within academic authorship, shedding light on the prevalence and impact of inappropriate authorship attributions. By calling for a nuanced approach to ensure accurate credit is given where it is due, the study underscores the importance of upholding ethical standards in scholarly publishing.

Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

Mon, 27 May 2024 00:00:00 GMT

Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets.

Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, MLkNN, LabelPowerset, RAkEL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss.

The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.

Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings.

The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future.

(1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.

Can ChatGPT evaluate research quality?

Mon, 27 May 2024 00:00:00 GMT

Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.

Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements.

ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations.

The data is self-evaluations of a convenience sample of articles from one academic in one field.

Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use.

This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

A comparative study on characteristics of retracted publications across different open access levels

Mon, 27 May 2024 00:00:00 GMT

Recently, global science has shown an increasing open trend, however, the characteristics of research integrity of open access (OA) publications have rarely been studied. The aim of this study is to compare the characteristics of retracted articles across different OA levels and discover whether OA level influences the characteristics of retracted articles.

The research conducted an analysis of 6,005 retracted publications between 2001 and 2020 from the Web of Science and Retraction Watch databases. These publications were categorized based on their OA levels, including Gold OA, Green OA, and non-OA. The study explored retraction rates, time lags and reasons within these categories.

The findings of this research revealed distinct patterns in retraction rates among different OA levels. Publications with Gold OA demonstrated the highest retraction rate, followed by Green OA and non-OA. A comparison of retraction reasons between Gold OA and non-OA categories indicated similar proportions, while Green OA exhibited a higher proportion due to falsification and manipulation issues, along with a lower occurrence of plagiarism and authorship issues. The retraction time lag was shortest for Gold OA, followed by non-OA, and longest for Green OA. The prolonged retraction time for Green OA could be attributed to an atypical distribution of retraction reasons.

There is no exploration of a wider range of OA levels, such as Hybrid OA and Bronze OA.

The outcomes of this study suggest the need for increased attention to research integrity within the OA publications. The occurrences of falsification, manipulation, and ethical concerns within Green OA publications warrant attention from the scientific community.

This study contributes to the understanding of research integrity in the realm of OA publications, shedding light on retraction patterns and reasons across different OA levels.

Amend: an integrated platform of retracted papers and concerned papers

Mon, 27 May 2024 00:00:00 GMT

The notable increase in retraction papers has attracted considerable attention from diverse stakeholders. Various sources are now offering information related to research integrity, including concerns voiced on social media, disclosed lists of paper mills, and retraction notices accessible through journal websites. However, despite the availability of such resources, there remains a lack of a unified platform to consolidate this information, thereby hindering efficient searching and cross-referencing. Thus, it is imperative to develop a comprehensive platform for retracted papers and related concerns. This article aims to introduce “Amend,” a platform designed to integrate information on research integrity from diverse sources.

The Amend platform consolidates concerns and lists of problematic articles sourced from social media platforms (e.g., PubPeer, For Better Science), retraction notices from journal websites, and citation databases (e.g., Web of Science, CrossRef). Moreover, Amend includes investigation and punishment announcements released by administrative agencies (e.g., NSFC, MOE, MOST, CAS). Each related paper is marked and can be traced back to its information source via a provided link. Furthermore, the Amend database incorporates various attributes of retracted articles, including citation topics, funding details, open access status, and more. The reasons for retraction are identified and classified as either academic misconduct or honest errors, with detailed subcategories provided for further clarity.

Within the Amend platform, a total of 32,515 retracted papers indexed in SCI, SSCI, and ESCI between 1980 and 2023 were identified. Of these, 26,620 (81.87%) were associated with academic misconduct. The retraction rate stands at 6.64 per 10,000 articles. Notably, the retraction rate for non-gold open access articles significantly differs from that for gold open access articles, with this disparity progressively widening over the years. Furthermore, the reasons for retractions have shifted from traditional individual behaviors like falsification, fabrication, plagiarism, and duplication to more organized large-scale fraudulent practices, including Paper Mills, Fake Peer-review, and Artificial Intelligence Generated Content (AIGC).

The Amend platform may not fully capture all retracted and concerning papers, thereby impacting its comprehensiveness. Additionally, inaccuracies in retraction notices may lead to errors in tagged reasons.

Amend provides an integrated platform for stakeholders to enhance monitoring, analysis, and research on academic misconduct issues. Ultimately, the Amend database can contribute to upholding scientific integrity.

This study introduces a globally integrated platform for retracted and concerning papers, along with a preliminary analysis of the evolutionary trends in retracted papers.

New roles of research data infrastructure in research paradigm evolution

Mon, 27 May 2024 00:00:00 GMT

Research data infrastructures form the cornerstone in both cyber and physical spaces, driving the progression of the data-intensive scientific research paradigm. This opinion paper presents an overview of global research data infrastructure, drawing insights from national roadmaps and strategic documents related to research data infrastructure. It emphasizes the pivotal role of research data infrastructures by delineating four new missions aimed at positioning them at the core of the current scientific research and communication ecosystem. The four new missions of research data infrastructures are: (1) as a pioneer, to transcend the disciplinary border and address complex, cutting-edge scientific and social challenges with problem- and data-oriented insights; (2) as an architect, to establish a digital, intelligent, flexible research and knowledge services environment; (3) as a platform, to foster the high-end academic communication; (4) as a coordinator, to balance scientific openness with ethics needs.

General laws of funding for scientific citations: how citations change in funded and unfunded research between basic and applied sciences

Mon, 26 Feb 2024 00:00:00 GMT

The goal of this study is to analyze the relationship between funded and unfunded papers and their citations in both basic and applied sciences.

A power law model analyzes the relationship between research funding and citations of papers using 831,337 documents recorded in the Web of Science database.

The original results reveal general characteristics of the diffusion of science in research fields: a) Funded articles receive higher citations compared to unfunded papers in journals; b) Funded articles exhibit a super-linear growth in citations, surpassing the increase seen in unfunded articles. This finding reveals a higher diffusion of scientific knowledge in funded articles. Moreover, c) funded articles in both basic and applied sciences demonstrate a similar expected change in citations, equivalent to about 1.23%, when the number of funded papers increases by 1% in journals. This result suggests, for the first time, that funding effect of scientific research is an invariant driver, irrespective of the nature of the basic or applied sciences.

This evidence suggests empirical laws of funding for scientific citations that explain the importance of robust funding mechanisms for achieving impactful research outcomes in science and society. These findings here also highlight that funding for scientific research is a critical driving force in supporting citations and the dissemination of scientific knowledge in recorded documents in both basic and applied sciences.

This comprehensive result provides a holistic view of the relationship between funding and citation performance in science to guide policymakers and R&D managers with science policies by directing funding to research in promoting the scientific development and higher diffusion of results for the progress of human society.