
Outdoor thermal comfort is a crucial determinant of urban space quality. While research has developed various heat indices, such as the Universal Thermal Climate Index (UTCI) and the Physiological Equivalent Temperature (PET), these metrics fail to fully capture perceived thermal comfort. Beyond environmental and physiological factors, recent research suggests that visual elements significantly drive outdoor thermal perception. This study integrates computer vision, explainable machine learning, and perceptual assessments to investigate how visual elements in streetscapes affect thermal perception. To provide a comprehensive representation of diverse visual elements, we employed multiple computer vision models (viz. Segment Anything Model, ResNet-50, and Vision Transformer) and applied the Maximum Clique method to systematically select 50 representative ground-level images, each paired with a corresponding thermal image captured simultaneously. An outdoor, web-based survey among 317 students collected thermal sensation votes (TSV), thermal comfort votes (TCV), and element preference data, yielding 2,854 valid responses. The same survey was replicated in an indoor exhibition setting to provide a comparative reference against the outdoor experiment. A Random Forest classifier achieved 70% and 68% accuracy in predicting thermal sensation and comfort, respectively. Using Shapley Additive Explanations to interpret model outcomes, we uncovered that the colour magenta emerged as the most influential visual factor for thermal perception, while greenery – despite being participants most preferred element for cooling – showed weaker correlation with actual thermal perception. These findings challenge conventional assumptions about visual thermal comfort and offer a novel framework for image-based thermal perception research, with important implications for climate-responsive urban design.