The availability of large urban social media data creates new opportunities for studying cities. In our paper we propose a new direction for this research: a joint analysis of geolocations of shared images and their content as determined by computer vision. To test our ideas, we use a dataset of 47,410 Instagram images shared in the city of St.Petersburg over one year. We show how a combination of semantic clustering, image recognition and geospatial analysis can detect important patterns related to both how people use a city and how they represent in social media.