Jerry Yao-Chieh Hu, 胡耀傑
Department of Computer Science
Northwestern University
Department of Computer Science
Northwestern University
jhu \at\ u.northwestern.edu
I am a PhD candidate in Computer Science at Northwestern University, advised by Han Liu in MAGICS lab. I hold my B.S. degree in Physics from National Taiwan University, advised by Pisin Chen.
My research focuses on theoretical foundations and principled methodologies for Large Language Models, Foundation Models and Generative AI. My long-term goal is to leverage machine learning to tackle important scientific and societal challenges.
Recently, I have focused on understanding inference and learning in large pretrained models through the dual lens of statistics and neuroscience. This unique (model-based) perspective allows me to explore
Computational and statistical properties of pretrained transformer and diffusion models for pretraining, inference, fine-tuning, compression and alignment
New methodological and algorithmic designs, with theoretical guarantees to ensure their practical optimality
I dedicate 2 hours weekly for master, undergrad and high school outreach students to chat about Research, Grad School, and My Transition to ML research from a Non-CS/ML Background
I welcome students from underrepresented groups and will prioritize these meetings
Please fill out this link to schedule a chat :)
I will be attending ICML 2025 in Vancouver from July 13th to July 19th. Let me know if you'd like to catch up!
I study the statistical and computational foundations of large‑scale pretrained models and their real‑world applications.
Big AI models act like black boxes. I study them as if they were brains that learn, store, and recall memories.
Thrust 1. I show how training “writes” memories into the model and prove how much it can hold [NeurIPS'24a; ICML'24a; ICML'24b; ICLR'24; NeurIPS'23]. Then I invent training tricks that let it remember more while using less compute [NeurIPS'24a; ICML'24b; ICML'24c].
Thrust 2. I reveal how the model “reads” those memories to solve new tasks [ICML'25c; ICML'25d; ICLR'25b] and design faster, clearer reasoning steps [ICML'25a; ICML'25b; ICLR'25a].
Thrust 3. I engineer plug-in memory modules so the model can learn fresh facts [ICML'25b; ICLR'24] or delete harmful ones without full retraining [USENIX Sec '25].
These ideas make future AI cheaper, safer, and easier to understand, and they give neuroscientists new test-beds for how real brains might work.
Beyond academia, my research also contributes to critical domains in practice: Particle Physics at Fermilab, Drug Design at Abbvie, Finance at Gamma Paradigm Capital, and NdLinear at Ensemble AI.
(Click for more details)
Rethinking Pretrained Models as Statistical Brains via the Lens of Dense Associative Memory (DenseAM a.k.a. Modern Hopfield Models).
Entropy‑Regularized DenseAM ⇄ Transformer Attention — unified theory & capacity bounds [NeurIPS'24a; ICML'24a; ICML'24b; ICLR'24; NeurIPS'23]
Nonparametric DenseAM — auto/hetero associative memory with statistical guarantees [ICML'25b]
DenseAM Computational Limits — almost‑linear‑time lower bounds [ICML'24a]
Larger DenseAM Capacity for Better Transformer Representation Learning — method [NeurIPS'24a] and optimal capacity [ICML'24b]
In-Context Learning as Conditional Associative Memory Retrieval [ICML'25c, To Appear]
How far can fine-tuning methods push performance after pretraining?
Computational Limits of LoRA — hardness and fast algorithm via inherent low‑rank gradient structure [ICLR'25a]
Fundamental Limits of Prompt-Tuning — universal approximation & computational limits [ICLR'25b]
In-Context Deep Learning via Transformer Models — provably in-context gradient descent for deep neural networks [ICML'25d]
Bridging diffusion, transformers and optimal distribution estimations.
Statistical Rates of Diffusion Transformers (DiTs) — Approximation & Minimax Rates [NeurIPS'24b, ICLR'25c]
Making large models smaller, safer and attack‑resistant.
Robust Model Quantization with Outlier-Free DenseAM — “Softmax_N” attention as a quantization-strong and resource-efficient backbone for LLM [ICML'24c]
Differentially Private Query Algorithm — improved privacy-utility tradeoff and efficiency [arXiv]
Bayesian View of LLM Jailbreaks [USENIX Sec '25, arXiv]
Turning theory into impact on critical domains.
Programmable Feature Engineering for Time Series [ICML'23]
Fast (Even Trainless) Test-Time Adaptation for Time Series [ICLR'24]
Sparsity-Aware, Multi-Resolution, Bi-Directional Tabular Learning [ICML'24d]
Real-time Edge AI for Accelerator Controls (Particle Physics, READS Collaboration) @ Fermilab [IBIC'23, ICALEPCS'23, ML4Phys Workshop @ NeurIPS'23, FastML4Sci@ICCAD'23]
Fast and Low-Cost Genomic Foundation Models via Outlier Removal [ICML'25a]
Bayesian Black-Litterman Portfolio Optimization [ICML'25e]