More text data is available today than ever before -- from scientific literature to social media to clinical records. Learning representations of this data is key to making it searchable, interpretable, and ultimately useful for discovering patterns that would be impossible to find manually.

Rita González Márquez

Rita González Márquez uses machine learning to learn meaningful representations of text data, and to explore what makes a representation good in the first place. She is a PhD student and a member of the IMPRS-IS graduate school. Her research spans both high-dimensional and low-dimensional embedding spaces: she works on fine-tuning transformer-based models to produce text representations that accurately reflect semantic relationships, and on adapting dimensionality reduction methods to large datasets for visualizing and exploring scientific corpora. Applying these methods to large scientific corpora, she investigates questions about the structure and evolution of research fields, scientific trends, and research integrity.

González-Márquez, R., Berens, P., & Kobak

Rita González Márquez

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

The landscape of biomedical research