By Liz Fuller-Wright, Office of Communications
In his final year at Princeton, Fernando Avilés-García tackled 700-year-old literature with an innovative approach: building an artificially intelligent tool to analyze the language of Dante Alighieri’s “Divine Comedy.”
“The Comedy has been egregiously underserved by modern language models, considering how weighty a text it is,” said Avilés-García, a computer science major with a certificate from the Department of French and Italian. “This project let me overlap my love of solving puzzles through code with my passion for Italian.”
“It’s one of the most original senior theses I’ve read at Princeton through the years — and I’ve read some great senior theses,” said Gaetana Marrone-Puglia, a professor of French and Italian who has taught at Princeton since 1988. “Fernando brought a computer model to texts that normally are in the hands of medievalists. It’s a perfect marriage of science and humanities.”
The final product, “Divining language: Unearthing medieval Italian through natural language processing (NLP),” contributed to his graduating with departmental high honors and winning the inaugural Lucio Caputo Senior Thesis Prize “for an outstanding thesis on the literature, language, culture, economy, history, politics or society of Italy.”
“He has created a tool that I think will be beneficial for the Italian literature community and will inspire future studies,” said Christiane Fellbaum, Avilés-García’s thesis adviser. She is a lecturer with the rank of professor in computer science, linguistics and the Council of the Humanities, as well as a Ph.D. graduate of Princeton in linguistics.
Getting past imposter syndrome
In his first programming courses at Princeton, Avilés-García found himself delighting in the rush of conquering problems. “I really got hooked on that feeling of, ‘I’m making things!’” he said.
By his sophomore year, he was ready to declare computer science as his major. “But part of me was scared, because all the computer science kids I knew had done so much coding in high school,” he said. “Part of me wondered, ‘Am I cut out for this?’”
Once, when Avilés-García was assisting with an intro course, a first-year student asked him about applying a data structure that he’d never even heard of.
But then he thought, “If I’m good enough to teach these kids, or at least debug their code, I can hang in there,” he recalled. So he declared the major and followed his love of language into AI-based translation, eventually creating an app that can translate whole books at once.
Many undergraduates naturally build bridges between humanities and AI, said Natalia Ermolaev, executive director of the Center for Digital Humanities at Princeton. “This happens all the time at Princeton, because we have so many computer science majors that secretly love classics or Italian literature or medieval architecture,” she said.
“So they come to us saying, ‘Please give me a text to work with, or some problem that I can apply my computational knowledge to.’ And then they are floored by the fact that they have to create a dataset, or deal with a language that doesn’t easily get plugged into the models. We see the lightbulbs go off as they gain a fresh understanding of the limitations of models, and just how much of the Internet is English focused. So then comes the creativity. Do you tweak the models or enhance your data? In that process, you learn a lot about the material and about the language, and from there about the culture.”
From Mexico City to medieval Italy
Born in Mexico City and raised in Basking Ridge, New Jersey, Avilés-García grew up bilingual in English and Spanish, and he fell in love with Italian during summers in Sicily.
So when he was looking for an subject to tackle with his AI language modeling skills, his French and Italian adviser Simone Marchesi steered him toward one of the greatest works in any language: Dante’s “Divine Comedy,” a three-volume journey from Hell to Paradise written between 1308 and 1321.
Just one problem: Dante wrote in an archaic form of a Tuscan dialect, so even modern Italian language models struggle with the text, and English-trained models fare much worse.
“Dante is the father of the Italian language, but his text is not standard Italian,” said Marchesi, a professor of French and Italian and a 2002 Ph.D. graduate of Princeton in comparative literature. It took months of effort, and collaborations with programmers from the University of Pisa, for Avilés-García to train his model to parse medieval Italian.
“Once you have that, you can run fun and intriguing and promising queries, as Fernando has been doing,” Marchesi said.
Shining a new light
Avilés-García began quantifying words that frequently appear together in the Comedy.
He struck gold when he ran queries on the noun “love” (amore). He guessed some words that would accompany it — Beatrice (Dante’s muse), heart, the verb love (amare), affection, sweet, beautiful, beauty, woman, wife, desire, flesh — then ran the model.
He was surprised that almost none of his guesses regularly appear within 15 words of amore, but many words related to light (shine, star, ray) and darkness (night) do. When he turned back to the text, that unexpected connection unlocked a new insight. “Dante describes Hell as a place devoid of stars,” he said. “Then I started seeing that Hell is defined by an absence of this much bigger thing: love.”
One of the strengths of interdisciplinary AI research at Princeton is the presence of deep expertise in many subject areas. In this case, Aviles turned to one of the world’s leading Dante experts, Marchesi, to ask if the connection between stars and love and Hell was a trite observation that scholars have recognized for centuries or a radically new concept, or somewhere in between.
“What he has found is real, I would say, and not self-evident,” Marchesi said. Most scholars, he added, have focused on the role of stars as navigational tools, and thus Hell as a disorienting place. “Fernando has proved that a larger conceptual constellation is at stake in their absence.”
Marchesi says he is intrigued by the promise of this new language model. “When you get trained for your job as an academic, you get trained to answer old questions,” he said. “The really exciting part is crossing paths with someone who can ask new questions.”
He looks forward to using this AI model and its future iterations in his own research. “Someone who is a Princetonian once is a Princetonian forever,” he said. “I can reach out to Fernando wherever he goes after Princeton and ask questions and get friendly answers. It’s beautiful.”