Person perception relies on visual cues from faces and bodies, but whether the brain processes these signals separately or jointly remains unclear. Here, we test the extent to which face and body representations emerge as segregated or integrated in deep neural network models, and how these representations map onto human visual cortex. We find that models optimized for visual recognition develop not only face- and body-selective units, but also mixed-selective units responding preferentially to both categories. Using fMRI encoding analyses, we show that mixed-selective units best explain face- and body-selective cortical regions. Face- and body-selective units contribute unique variance in their corresponding regions but predominantly explain shared variance that increased from posterior to anterior cortex. In models, we further demonstrate that mixed selectivity supports multiple person perception tasks but corresponds to part-based processing of whole persons. Our findings reveal progressive integration of face and body representations along the visual hierarchy in both models and cortex.