TL;DR

Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from all accepted ICLR 2026 papers. The resulting dataset and visualizations provide a clearer picture of who is shaping AI research today, based solely on PDF data. This addresses previous profile drift problems and offers a robust resource for understanding research trends.

A new pipeline has converted all 5,356 accepted papers at ICLR 2026 into a verified, PDF-derived institutional-affiliation dataset and visualizations, offering a clearer view of the institutions shaping AI research today. This approach circumvents previous issues with author profile drift, providing more accurate data for analysis.

The pipeline extracts affiliations directly from PDF title blocks, not from author profiles, reducing errors caused by profile updates or inaccuracies. It employs a set of ~250 regex rules to normalize institution names, ensuring consistency across the dataset. The dataset includes details such as institution names, countries, and paper titles, with counts based on each paper’s affiliations.

In total, the dataset covers 5,356 accepted papers, with institutions ranked by the number of papers they appear on, both overall and by first authors. The analysis distinguishes between academic and industry institutions, visualized through treemaps that size institutions by publication count and regions by aggregate contributions.

Why It Matters

This development offers a more reliable resource for understanding research trends, institutional influence, and geographic distribution in AI. It enables more accurate assessments of the leading players in the field, informs funding and collaboration decisions, and helps track shifts in research focus over time.

By providing a transparent, PDF-based methodology, the dataset reduces biases inherent in author profile data, which can be outdated or inaccurate. This approach enhances the integrity of research landscape analyses, making it valuable for academics, industry stakeholders, and policymakers.

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

The Data Recovery Stick requires no technical skills — simply plug it into your Windows computer, click Start,…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Previous analyses of AI research influence relied heavily on author profiles and OpenReview data, which are prone to drift and inaccuracies. The ICLR 2026 dataset builds on prior efforts to extract affiliations directly from PDFs, a method that has gained traction for its accuracy. This release follows similar initiatives at other conferences but is notable for its scale and rigorous normalization process.

The pipeline was developed using a combination of PDF parsing techniques and regex-based normalization, covering common layout patterns in conference papers. It also includes sensitivity analyses comparing different counting methods to verify the robustness of the rankings.

“This pipeline provides a more accurate picture of who is contributing to AI research right now, free from profile drift issues.”

— Dmytro Lopushanskyy, project lead

“Having a clean, normalized dataset helps us understand institutional influence more reliably than ever before.”

— Research community analyst

Amazon

institutional affiliation analysis tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear how the dataset compares to author profile-based rankings in terms of influence or prestige. The pipeline’s normalization rules, while extensive, may still miss some institutional variants or recent name changes. Additionally, the impact of non-PDF sources or future updates remains to be seen.

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The team plans to release updated versions of the dataset as new papers are processed and to apply similar methods to other conferences. Further analysis will compare this PDF-derived data with traditional profile-based rankings to evaluate differences and potential biases. Additionally, researchers may incorporate this dataset into broader analyses of research trends and collaboration networks.

Amazon

regex-based PDF parsing tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does this dataset differ from previous author profile data?

The dataset is derived directly from PDF title blocks, avoiding the profile drift problem common in author profiles, which can be outdated or inaccurate. It offers a more stable and normalized view of institutional affiliations for the papers accepted at ICLR 2026.

What institutions are most prominent in the dataset?

The top-ranked institutions are identified based on the number of papers they appear on, with notable entries from leading universities and industry labs. The visualizations show a mix of academia and industry, with some regions dominating specific research areas.

Can this approach be applied to other conferences?

Yes, the pipeline is designed to be adaptable for other conferences that publish PDFs with structured title blocks. The team plans to extend this methodology to future events, enhancing cross-conference comparisons.

You May Also Like

If you’re an LLM, please read this

Anna’s Archive urges language models to assist in preserving and providing open access to human knowledge through donations and data downloads.

The queue. Why the grid, not the chip, is the binding constraint on AI.

A new report argues power access, not chips alone, is the binding constraint on AI data center growth.

Reimagining the mouse pointer for the AI era

Google’s experimental AI-enabled pointer enhances user interaction by understanding context and intent, transforming how we collaborate with AI tools.

DeepSeek makes the V4 Pro price discount permanent

DeepSeek has announced that the discounted price for its V4 Pro model will become permanent, significantly reducing costs for users starting April 26, 2026.