TL;DR

Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from all accepted ICLR 2026 papers. The resulting dataset and visualizations provide a clearer picture of who is shaping AI research today, based solely on PDF data. This addresses previous profile drift problems and offers a robust resource for understanding research trends.

A new pipeline has converted all 5,356 accepted papers at ICLR 2026 into a verified, PDF-derived institutional-affiliation dataset and visualizations, offering a clearer view of the institutions shaping AI research today. This approach circumvents previous issues with author profile drift, providing more accurate data for analysis.

The pipeline extracts affiliations directly from PDF title blocks, not from author profiles, reducing errors caused by profile updates or inaccuracies. It employs a set of ~250 regex rules to normalize institution names, ensuring consistency across the dataset. The dataset includes details such as institution names, countries, and paper titles, with counts based on each paper’s affiliations.

In total, the dataset covers 5,356 accepted papers, with institutions ranked by the number of papers they appear on, both overall and by first authors. The analysis distinguishes between academic and industry institutions, visualized through treemaps that size institutions by publication count and regions by aggregate contributions.

Why It Matters

This development offers a more reliable resource for understanding research trends, institutional influence, and geographic distribution in AI. It enables more accurate assessments of the leading players in the field, informs funding and collaboration decisions, and helps track shifts in research focus over time.

By providing a transparent, PDF-based methodology, the dataset reduces biases inherent in author profile data, which can be outdated or inaccurate. This approach enhances the integrity of research landscape analyses, making it valuable for academics, industry stakeholders, and policymakers.

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

The Data Recovery Stick requires no technical skills — simply plug it into your Windows computer, click Start,…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Previous analyses of AI research influence relied heavily on author profiles and OpenReview data, which are prone to drift and inaccuracies. The ICLR 2026 dataset builds on prior efforts to extract affiliations directly from PDFs, a method that has gained traction for its accuracy. This release follows similar initiatives at other conferences but is notable for its scale and rigorous normalization process.

The pipeline was developed using a combination of PDF parsing techniques and regex-based normalization, covering common layout patterns in conference papers. It also includes sensitivity analyses comparing different counting methods to verify the robustness of the rankings.

“This pipeline provides a more accurate picture of who is contributing to AI research right now, free from profile drift issues.”

— Dmytro Lopushanskyy, project lead

“Having a clean, normalized dataset helps us understand institutional influence more reliably than ever before.”

— Research community analyst

Amazon

institutional affiliation analysis tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear how the dataset compares to author profile-based rankings in terms of influence or prestige. The pipeline’s normalization rules, while extensive, may still miss some institutional variants or recent name changes. Additionally, the impact of non-PDF sources or future updates remains to be seen.

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The team plans to release updated versions of the dataset as new papers are processed and to apply similar methods to other conferences. Further analysis will compare this PDF-derived data with traditional profile-based rankings to evaluate differences and potential biases. Additionally, researchers may incorporate this dataset into broader analyses of research trends and collaboration networks.

Amazon

regex-based PDF parsing tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does this dataset differ from previous author profile data?

The dataset is derived directly from PDF title blocks, avoiding the profile drift problem common in author profiles, which can be outdated or inaccurate. It offers a more stable and normalized view of institutional affiliations for the papers accepted at ICLR 2026.

What institutions are most prominent in the dataset?

The top-ranked institutions are identified based on the number of papers they appear on, with notable entries from leading universities and industry labs. The visualizations show a mix of academia and industry, with some regions dominating specific research areas.

Can this approach be applied to other conferences?

Yes, the pipeline is designed to be adaptable for other conferences that publish PDFs with structured title blocks. The team plans to extend this methodology to future events, enhancing cross-conference comparisons.

You May Also Like

Running local models on an M4 with 24GB memory

Exploring the capability of an M4 MacBook with 24GB memory to run local AI models like Qwen 3.5 9B, including setup, performance, and limitations.

Anthropic’s Trillion-Dollar Bet Is Really a Compute Bet

Anthropic’s reported $65B Series H would fund years of AI infrastructure, shifting focus from valuation hype to compute risk.

Sony tries to explain that its AI Camera Assistant doesn’t suck

Sony responds to concerns about its AI Camera Assistant, explaining it offers suggestions rather than editing photos, but issues with suggestions persist.

The clause. How a contractual definition of AGI met the capital built on top of it.

A Thorsten Meyer AI item points to renewed scrutiny of how an AGI contract clause could affect AI capital and control.