Data Sources
Data Sources and Collection Strategy
The AHSC project builds a comprehensive scientometric database by integrating data from multiple high-quality sources, covering more than twenty years of scientific production (2000–present) of all 51 Italian public Academic Health Science Centres.
The data collection strategy follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, with rigorous procedures for cleaning, disambiguation, and normalization. Standardized identifiers — such as ROR (Research Organization Registry), GRID (Global Research Identifier Database), MAG (Microsoft Academic Graph), and DOI (Digital Object Identifier) — are used for institutional matching, record merging, and deduplication.
The massive data retrieval is made possible by dedicated R packages co-designed by the research team:
- openalexR — Gathering bibliographic records from OpenAlex using its DSL API
- dimensionsR — Gathering bibliographic records from Digital Science Dimensions using DSL API
- pubmedR — Gathering metadata about publications, grants, MeSH terms, and clinical trials from PubMed
All source codes are open-source and freely available on GitHub.
OpenAlex
OpenAlex serves as the primary bibliographic data source. It is an open, freely accessible platform covering more than 225 million publications, offering wide-ranging metadata on scientific articles, author profiles, institutional affiliations, and citation relationships.
Unlike proprietary alternatives such as Web of Science or Scopus, OpenAlex ensures broader inclusion — particularly of smaller or underrepresented institutions — and enables full integration with open-source analysis workflows.
Publications were retrieved via the OpenAlex API using institutional affiliations as the main search criterion. To improve matching accuracy, we relied on the Research Organization Registry (ROR), a global community-led system for identifying research institutions, which allowed us to account for official names, abbreviations, and alternative spellings of each AHSC.
Scientific Publications
All scientific publications associated with Italian public AHSCs were retrieved applying the following inclusion and exclusion criteria:
- Document type: Only peer-reviewed articles were selected. Other formats — such as book chapters, conference proceedings, and editorials — were excluded due to their lower citation visibility and limited standardization.
- Timeframe: Articles published from 2000 to 2024.
- Language: Only articles written in English, to ensure internationally visible and comparable research output.
Concepts and Topics
OpenAlex provides a rich classification system for the scientific content of each publication through hierarchical entities known as Concepts — semantically defined research topics assigned automatically based on content analysis and citation patterns.
For each publication, all associated Concepts were extracted with their unique identifier, name, and relevance score. This enables reconstruction of the cognitive structure of each institution’s research activity, supporting analyses such as thematic clustering, science mapping, and comparative profiling across AHSCs.
Dimensions
Dimensions is a multidisciplinary platform that goes beyond the standard publication-citation ecosystem by aggregating a broad range of research outputs. It is particularly valuable for capturing:
- Clinical trials — studies aimed at evaluating medical, surgical, or behavioral interventions
- Research grants — awarded competitive funding from national and international sources
- Patents — indicators of knowledge transfer and technological innovation
- Datasets — research data outputs linked to publications
By integrating Dimensions data, we expand the evaluation beyond publications and citations, capturing a broader spectrum of institutional impact — including funding success, technological innovation, and translational research potential.
Publications were excluded from the Dimensions extraction as they were more comprehensively covered by OpenAlex. Policy documents were omitted due to limited availability for the AHSCs analyzed.
PubMed/MEDLINE
PubMed/MEDLINE is the free and most widely used database in the field of medicine and life sciences. A distinctive feature of PubMed is Medical Subject Headings (MeSH), a controlled vocabulary of biomedical terms maintained by the National Library of Medicine.
Each article is indexed manually by MEDLINE experts with 10–15 MeSH terms, drawn from a dictionary of over 30,000 terms organized into 16 main categories. MeSH terms provide standardized subject classification that complements the broader concept tagging from OpenAlex, enabling precise thematic analyses of AHSCs’ clinical and biomedical research.
Altmetric
Altmetric tracks and quantifies the online attention received by scholarly publications, capturing a complementary perspective on research impact beyond academic citations.
Altmetric monitors mentions across a wide array of sources:
- Social media platforms (e.g., Twitter/X, Facebook)
- News outlets and blogs
- Policy documents
- Wikipedia
Using the DOIs of AHSC-affiliated publications, we accessed Altmetric’s API to collect both Altmetric Attention Scores and detailed engagement data. This enables us to quantify public and institutional attention, highlighting the societal resonance and outreach of AHSC research beyond the academic community.
Data Integration and Quality Assurance
The merging process operates at two levels:
- Macro-level: Standardized identifiers (ROR, GRID) assigned to each affiliation are used as primary merging keys. When identifiers are unavailable, normalized metadata such as institution full names serve as fallback keys.
- Micro-level: Record-level identifiers such as DOI or USPTO patent numbers are used. When unavailable, normalized metadata (e.g., titles and authors’ lists) support the joining phase.
Duplicate publications are removed by cross-referencing identifiers and item titles through a dedicated ad-hoc algorithm developed in R. The final output is an integrated, cleaned, and normalized scientometric database covering all dimensions of research output for Italian AHSCs.