Accurate and rapid censuses can provide detailed basic information for a country, which is useful for resource allocation, disease control, disaster prevention, urban planning, and business management. However, traditional censuses often take up much time, manpower, and financial resources. Population maps are created by national statistical institutes at statistical units. Remote sensing imagery combined with end-to-end deep learning models makes it possible to estimate a wide range of populations at a low cost. This study demonstrates the effectiveness of a local–global dual attention network (LGANet) for population estimation using remote sensing images. The LGANet contains a local attention embranchment and a global attention embranchment on the top of the backbone to adaptively learn and integrate two discriminative features simultaneously. To enhance the precision of population estimation, the outputs from the two attention modules are combined. This method utilizes daytime remote sensing images as input, complemented by nighttime light data, to estimate the population on 1 km grids. Our method exhibits superior accuracy compared to other deep learning methods, as evidenced by an experimental comparison between the estimated population and the ground-truth population in 1 km grids.