Victor May

Machine Learning Engineer & Researcher

About Me

Hello, Internet.

I’m an ML Engineer at Bridgewater AIA Labs, where I work on training LLMs for forecasting. My research sits at the intersection of training data and model behavior: how data properties at each stage of the pipeline — from pretraining through agent fine-tuning — shape what models can and can’t do.

On the pretraining side, I’ve worked on MixtureVitae, an open permissive-first corpus that matches non-permissive baselines at a fraction of the token budget (Accepted to TMLR with Featured Certification). On the agent side, I study how training trace semantics affect deployed behavior, and build rigorous benchmarks for evaluating code agents in realistic settings (FreshBrew at ICSE 2026, GitChameleon at ACL 2026).

Prior to Bridgewater, I was a Staff ML Engineer at Google Cloud, focused on AI agents for software engineering. Before that, I led a team at Chegg fine-tuning vision-language models, and worked on recommender systems and multilingual NLP at Taboola.

I hold an M.Sc. in Applied Mathematics from Tel Aviv University and a B.Sc. in Computer Science and Mathematics from Bar-Ilan University.

Links

Google Scholar | LinkedIn | Resume | X (Twitter)

News

April 2026: Our Paper GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities had been accepted to ACL 2026 (Main Track).

April 2026: Our paper MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources had been accepted to Transactions on Machine Learning Research (TMLR) with Featured Certification.

March 2026: Our paper MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources had been accepted to the Data-FM workshop at ICLR 2026.

October 2025: Our paper FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration had been accepted to International Conference on Software Engineering (ICSE) 2026.

September 2025: Our papers FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration and GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities had been accepted to the NeurIPS 2025 Deep Learning for Code Workshop.

Publications

2026
Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen · arXiv preprint (2026) preprint
GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang , ... · ACL 2026 Main Track (2026) accepted
FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring , ... · International Conference on Software Engineering (ICSE), Research Track (2026) accepted
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang , ... · Transactions on Machine Learning Research (TMLR) (2026) accepted
2025
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T. Stillerman , ..., Victor May , ... · Proceedings of the 31st International Conference on Computational Linguistics: Industry Track (2025) accepted
2015
An Algorithm for Improving Non-Local Means Operators via Low-Rank Approximation

Victor May, Yosi Keller, Nir Sharon, Yoel Shkolnisky · IEEE Transactions on Image Processing (2015) accepted

Blogging

I write about machine learning and related topics on
Medium.

Kaggle Competitions

🥈 Silver Medal (Top 1%) – Feedback Prize: English Language Learning
🥈 Silver Medal (Top 3%) – Google AI4Code
🥉 Bronze Medal (Top 6%) – U.S. Patent Phrase to Phrase Matching