The spatial concentration of the human activity is a crucial indication of socioeconomic vitality. Accurately mapping activity volumes is fundamental to support the regional sustainable development. Current approaches rely on mobile positioning data, which record information about human daily activity but are inaccessible in most cities due to privacy and data sharing concerns. Alternative methods are needed to provide more generalized predictions on extensive areas while maintaining low cost. This study demonstrates how remote sensing imagery can be used through an end-to-end deep learning framework for reliable estimates of human activity volumes. The neighbor effect, representing the inherent nature of spatial autocorrelation in the volumes, is incorporated to improve the network. The proposed model exhibits strong predictive power and demonstrates great explainability of physical environment on variations of activity volumes. Landscape interpretations based on hierarchical features provide both object-based and region-based insights into the coevolvement of landscape and human activity. Our findings indicate the possibility of extensively predicting activity volumes, especially in areas with limited access to mobile data, and provide support for the promising framework to better comprehend broad aspects of the human society from observable physical environments.