TL;DR
Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from all accepted ICLR 2026 papers. The resulting dataset and visualizations provide a clearer picture of who is shaping AI research today, based solely on PDF data. This addresses previous profile drift problems and offers a robust resource for understanding research trends.
A new pipeline has converted all 5,356 accepted papers at ICLR 2026 into a verified, PDF-derived institutional-affiliation dataset and visualizations, offering a clearer view of the institutions shaping AI research today. This approach circumvents previous issues with author profile drift, providing more accurate data for analysis.
The pipeline extracts affiliations directly from PDF title blocks, not from author profiles, reducing errors caused by profile updates or inaccuracies. It employs a set of ~250 regex rules to normalize institution names, ensuring consistency across the dataset. The dataset includes details such as institution names, countries, and paper titles, with counts based on each paper’s affiliations.
In total, the dataset covers 5,356 accepted papers, with institutions ranked by the number of papers they appear on, both overall and by first authors. The analysis distinguishes between academic and industry institutions, visualized through treemaps that size institutions by publication count and regions by aggregate contributions.
Why It Matters
This development offers a more reliable resource for understanding research trends, institutional influence, and geographic distribution in AI. It enables more accurate assessments of the leading players in the field, informs funding and collaboration decisions, and helps track shifts in research focus over time.
By providing a transparent, PDF-based methodology, the dataset reduces biases inherent in author profile data, which can be outdated or inaccurate. This approach enhances the integrity of research landscape analyses, making it valuable for academics, industry stakeholders, and policymakers.

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files
The Data Recovery Stick requires no technical skills — simply plug it into your Windows computer, click Start,…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
Previous analyses of AI research influence relied heavily on author profiles and OpenReview data, which are prone to drift and inaccuracies. The ICLR 2026 dataset builds on prior efforts to extract affiliations directly from PDFs, a method that has gained traction for its accuracy. This release follows similar initiatives at other conferences but is notable for its scale and rigorous normalization process.
The pipeline was developed using a combination of PDF parsing techniques and regex-based normalization, covering common layout patterns in conference papers. It also includes sensitivity analyses comparing different counting methods to verify the robustness of the rankings.
“This pipeline provides a more accurate picture of who is contributing to AI research right now, free from profile drift issues.”
— Dmytro Lopushanskyy, project lead
“Having a clean, normalized dataset helps us understand institutional influence more reliably than ever before.”
— Research community analyst
institutional affiliation analysis tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It is not yet clear how the dataset compares to author profile-based rankings in terms of influence or prestige. The pipeline’s normalization rules, while extensive, may still miss some institutional variants or recent name changes. Additionally, the impact of non-PDF sources or future updates remains to be seen.

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
The team plans to release updated versions of the dataset as new papers are processed and to apply similar methods to other conferences. Further analysis will compare this PDF-derived data with traditional profile-based rankings to evaluate differences and potential biases. Additionally, researchers may incorporate this dataset into broader analyses of research trends and collaboration networks.
regex-based PDF parsing tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
The dataset is derived directly from PDF title blocks, avoiding the profile drift problem common in author profiles, which can be outdated or inaccurate. It offers a more stable and normalized view of institutional affiliations for the papers accepted at ICLR 2026.
What institutions are most prominent in the dataset?
The top-ranked institutions are identified based on the number of papers they appear on, with notable entries from leading universities and industry labs. The visualizations show a mix of academia and industry, with some regions dominating specific research areas.
Can this approach be applied to other conferences?
Yes, the pipeline is designed to be adaptable for other conferences that publish PDFs with structured title blocks. The team plans to extend this methodology to future events, enhancing cross-conference comparisons.