I'm a final-year Ph.D. candidate in Computer Science at Johns Hopkins University, advised by Prof. Alan L. Yuille and Prof. Rama Chellappa.
I am also named a Siebel Scholar, the highest distinction for Ph.D in Bioengineering at JHU. I'm best known for my neural architecture TransUNet, with over 8,000 citations.
I am fascinated by how intelligence can operate in the real world. My research builds scalable, structured world models that connect artificial and natural intelligence, enabling new forms of reasoning and interaction across computer vision, robotics, and healthcare.
I love mentoring and teaching undergraduates—several of my mentees have been recognized with top CS research honors. I also teach Machine Imagination (EN.601.208) at Johns Hopkins in 2025/2026.
An undergraduate mentee received an Honorable Mention for the CRA Outstanding Undergraduate Researcher Award. Congrats, Arda!
An undergraduate mentee won Michael J. Muuss Research Award and was a finalist (1 of 24 nationwide) for the CRA Outstanding Undergraduate Researcher Award. Congrats, TaiMing!
Research Areas
Over the next decade, my research aims to answer a central question: how can we bring intelligence into the real world to meaningfully benefit humanity?
This pursuit is structured across three pillars:
Building Foundation Neural Architectures to learn scalable representations from raw sensory data.
Establishing Predictive Visual Modeling grounded in human-like mental models to achieve closed-loop embodiment.
Developing Proactive Biomedical Systems via medical world models to reduce cancer mortality and enhance human life.
Generation, perception, and action within physical world (World-in-World, ICLR'26 under review)
Proactive Biomedical Systems
Scalable Early Diagnosis
▸ Scaling eight-major cancer AI (CancerUnit, ICCV'23)
▸ Scaling cancer AI with reports (R-Super, MICCAI'25)
→
Treatment Discovery
Personalized treatment
planning via simulation (Medical World Model, ICCV'25)
↑
↑
Foundation Neural Architecture
Visual Dense Learning
▸ TransUNet: the first scalable Transformer architecture that fuse global attention with U-Net's local comprehension.
▸ Swin-Unet: upgrade TransUNet through pure attention.
▸ TransFG: introduced novel part-level attention.
→
Multimodal Learning
▸ Visual Encoder (ViTamin, CVPR'24)
▸ Visual representation in language models (LLaVolta, NeurIPS'24)
ViTamin achieved state-of-the-art on 60+ multimodal benchmarks in 2024, and was adopted into the widely used timm (36k stars) and openclip codebase (13k stars).
invited talks at Johns Hopkins, Chemical and Biomolecular Engineering (ChemBE), Cognitive Science Brown Bag, Center for Language and Speech Processing (CLSP), Mathematical Institute for Data Science (MINDS), Artificial Intelligence for Engineering and Medicine Lab (AIEM).
Teaching
Instructor: I designed and taught the undergraduate course Machine Imagination, EN.601.208, at JHU in 2025 and 2026 (starting Jan. 2026).
Shanshan Zhong (she/her), SYSU MS → CMU LTI PhD
Research: 1 first-author WACV publication (the 4D-Animal project).
Acknowledgement
My doctoral research was made possible through the generous support of ARL, IARPA, NSF, NIH, ONR, Lambda, NVIDIA, Johns Hopkins University, the Siebel Foundation, the Patrick J. McGovern Foundation, and the Lustgarten Foundation. I am deeply grateful for the resources provided to me and my advisors.