Building roofs are essential for various geographical analyses such as solar potential analysis and urban microclimate simulation. Despite growing demand, reconstructing detailed 3D roofs remains challenging due to the complexity of roof geometries and variations in architectural styles. This paper introduces RooFormer, an end-to-end learning framework for reconstructing detailed and textured 3D roof models in mesh format from high-resolution imagery. RooFormer consists of a MaskFormer branch, which identifies and focuses on roof features, and a MeshFormer branch, which predicts detailed roof meshes. In the MeshFormer branch, a local self-attention mechanism is employed to understand mesh features, and a positional embedding layer is designed to integrate geometric and texture features. In addition, to measure the geometric similarity between predicted meshes and ground truth, we develop a loss function that integrates terms from both image and mesh spaces. Compared to existing 3D metrics, the proposed geometric loss term more accurately reflects the geometric differences in meshes. Experiments show that its normalized height error of 0.014 is lower than the 0.034 error of state-of-the-art methods. Visually, the reconstruction accurately reflects the geometric contours and structures of roofs, even with slight occlusions. We also demonstrate its generalization by testing it across various areas. The framework promises to enable richer building modeling and analysis for a wide range of digital city applications.