The visual landscape plays a pivotal role in urban planning and healthy cities. Recent studies of visual evaluation focus on either objective or subjective approach, while describing the visual character holistically and monitor its evolution remains challenging. This study introduces an embedding-driven clustering approach that integrates both physical and perceptual attributes to infer the spatial structure of the visual environment, and investigates its spatio-temporal evolution. Singapore, a highly urbanised yet green city, is selected as a case study. Firstly, a visual feature matrix is derived from street view imagery (SVI). Then, a graph neural network is constructed based on road connections to encode visual features and spatial dependency leading to a clustering algorithm that is used to discover the underlying characteristics of the visual environment. The implementation characterises streetscapes of the city-state into six types of clusters. Finally, taking advantage of historical SVI, a longitudinal analysis reveals how visual clusters have evolved in the past decade. Among them, one of the clusters represents high-density visual experience, affirming the work as such streetscape dominates the central business district and it is evolving elsewhere, mirroring the expansion of new towns. In turn, another identified cluster, indicating sparse landscapes, decreased, while areas that are considered to be in the most visually pleasant cluster, increased. For the first time, this study demonstrates a novel method to understand the urban visual structure and analyse its spatio-temporal evolution, which could support future planning decision-making and urban landscape betterment.