CV
Contact Information
| Name | Alessio Cocchieri |
| alessiococchieri.ac@gmail.com |
Experience
-
March 2026 - May 2026 Dublin, Ireland
Research Scientist Internship (PhD Intern)
IBM Research
Natural Language Processing — LLM Factuality
- Created a novel benchmark for multimodal scientific fact-checking, targeting LLM long-form generation
- Coordinated a multi-team human annotation pipeline, overseeing annotation guidelines, quality control, and IAA
- Conducted systematic analysis exposing critical factuality fragilities in multimodal LLMs across scientific domains
-
March 2023 - May 2023 Dublin, Ireland
Research Scientist Internship (Master Thesis)
IBM Research
Natural Language Processing — Named Entity Recognition
- Leveraged LLM distillation to improve smaller-sized models for zero-shot NER
- Produced two peer-reviewed publications — OpenBioNER (NAACL 2025) and ZeroNER (ACL 2025)
- Contributed to the IBM zshot library for zero-shot NER model inference
About Me
- Last-year NLP PhD candidate (ending Nov 2026) at UniboNLP, with 9+ publications at ACL, EMNLP, NAACL, and EACL.
- I specialize in LLM evaluation and knowledge distillation for low-resource NLP, with applications to high-stakes domains like medicine.
- Active reviewer for ARR and top-tier ML conferences like NeurIPS.
Education
-
2023 - present Bologna, Italy
PhD
University of Bologna
Natural Language Processing
- Focus: LLMs, RAG, Information Extraction, Benchmarking, Low-resource NLP
- Supervisor: Prof. Gianluca Moro — Research Group: UniboNLP
-
2021 - 2023 Bologna, Italy
MSc
University of Bologna
Artificial Intelligence
- Grade: 110/110 with honors
-
2018 - 2021 Bologna, Italy
BSc
University of Bologna
Computer Science
- Grade: 110/110 with honors
Selected Publications
-
ACL 2026 LLMs (Almost) Never Abstain Under Medical Uncertainty
We introduce MedQAbstain, a benchmark for medical abstention under uncertainty, revealing that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden.
-
EACL 2026 ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?
We show that high LLM accuracy in medical MCQA masks severe inconsistency. We propose novel metrics to evaluate true reliability across MCQA formats.
-
ACL 2025 What do you call a dog that is incontrovertibly true? Dogma: Testing LLM Generalization through Humor
We introduce Phunny, a novel benchmark using uncontaminated English puns, revealing that LLMs struggle with generalization even on simple tasks, consistently underperforming the human baseline.
-
EMNLP 2025 Can Large Language Models Win the International Mathematical Games?
We introduce MathGames, a novel multimodal benchmark of age-graded math problems from an international competition, showing that frontier LLMs underperform compared to humans, including 11-year-olds.
-
NAACL 2025 OpenBioNER: Lightweight Open-Domain Biomedical NER Through Entity Type Description
We introduce a 110M BERT model that leverages descriptions for zero-shot Biomedical NER, outperforming GPT-4o, specialized LLMs, and GLiNER by up to 10% F1.
-
ACL 2024 To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
We introduce MedGENIE, the first generate-then-read framework for open-domain medical QA, demonstrating the effectiveness of generated over retrieved contexts and significantly improving LLM RAG performance.