CV |

Contact Information

Name	Alessio Cocchieri
Email	alessiococchieri.ac@gmail.com

Experience

March 2026 - May 2026

Dublin, Ireland
Research Scientist Internship (PhD Intern)

IBM Research

Natural Language Processing — LLM Factuality
- Created a novel benchmark for multimodal scientific fact-checking, targeting LLM long-form generation
- Coordinated a multi-team human annotation pipeline, overseeing annotation guidelines, quality control, and IAA
- Conducted systematic analysis exposing critical factuality fragilities in multimodal LLMs across scientific domains
March 2023 - May 2023

Dublin, Ireland
Research Scientist Internship (Master Thesis)

IBM Research

Natural Language Processing — Named Entity Recognition
- Leveraged LLM distillation to improve smaller-sized models for zero-shot NER
- Produced two peer-reviewed publications — OpenBioNER (NAACL 2025) and ZeroNER (ACL 2025)
- Contributed to the IBM zshot library for zero-shot NER model inference

About Me

Last-year NLP PhD candidate (ending Nov 2026) at UniboNLP, with 9+ publications at ACL, EMNLP, NAACL, and EACL.

I specialize in LLM evaluation and knowledge distillation for low-resource NLP, with applications to high-stakes domains like medicine.

Active reviewer for ARR and top-tier ML conferences like NeurIPS.

Education

2023 - present

Bologna, Italy
PhD

University of Bologna

Natural Language Processing
- Focus: LLMs, RAG, Information Extraction, Benchmarking, Low-resource NLP
- Supervisor: Prof. Gianluca Moro — Research Group: UniboNLP
2021 - 2023

Bologna, Italy
MSc

University of Bologna

Artificial Intelligence
- Grade: 110/110 with honors
2018 - 2021

Bologna, Italy
BSc

University of Bologna

Computer Science
- Grade: 110/110 with honors

Selected Publications

ACL 2026

LLMs (Almost) Never Abstain Under Medical Uncertainty

We introduce MedQAbstain, a benchmark for medical abstention under uncertainty, revealing that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden.
EACL 2026

ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

We show that high LLM accuracy in medical MCQA masks severe inconsistency. We propose novel metrics to evaluate true reliability across MCQA formats.
ACL 2025

What do you call a dog that is incontrovertibly true? Dogma: Testing LLM Generalization through Humor

We introduce Phunny, a novel benchmark using uncontaminated English puns, revealing that LLMs struggle with generalization even on simple tasks, consistently underperforming the human baseline.
EMNLP 2025

Can Large Language Models Win the International Mathematical Games?

We introduce MathGames, a novel multimodal benchmark of age-graded math problems from an international competition, showing that frontier LLMs underperform compared to humans, including 11-year-olds.
NAACL 2025

OpenBioNER: Lightweight Open-Domain Biomedical NER Through Entity Type Description

We introduce a 110M BERT model that leverages descriptions for zero-shot Biomedical NER, outperforming GPT-4o, specialized LLMs, and GLiNER by up to 10% F1.
ACL 2024

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

We introduce MedGENIE, the first generate-then-read framework for open-domain medical QA, demonstrating the effectiveness of generated over retrieved contexts and significantly improving LLM RAG performance.

Contact Information

Experience

Research Scientist Internship (PhD Intern)

IBM Research

Natural Language Processing — LLM Factuality

Research Scientist Internship (Master Thesis)

IBM Research

Natural Language Processing — Named Entity Recognition

About Me

Education

PhD

University of Bologna

Natural Language Processing

MSc

University of Bologna

Artificial Intelligence

BSc

University of Bologna

Computer Science

Selected Publications

LLMs (Almost) Never Abstain Under Medical Uncertainty

ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

What do you call a dog that is incontrovertibly true? Dogma: Testing LLM Generalization through Humor

Can Large Language Models Win the International Mathematical Games?

OpenBioNER: Lightweight Open-Domain Biomedical NER Through Entity Type Description

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering