Geospatial Big Data Visualization with Python

In the previous post I discussed how to download bulding footprints data from Google and Microsoft that cover the whole buildings footprint in the world. The next thing that came later was how to visualize such the big data with the size of several Gigabyte or even larger in my small machine. For example I downloaded Microsoft building footprints for Indonesia region with the size of 20.7 GB. Loading such extensive data into QGIS didn't work and made it crashed. Finally I can visualize the whole buildings in Indonesia with Python as in the figure 1 which focuses on the subset for Java Island only.

Building Density Java-Indonesia
Figure 1. Building Density Java-Indonesia

If you find yourself facing challenges in viewing large geospatial data after downloading it, you're not alone. This post is tailored to address such situations. In this tutorial, I will guide you through the process of visualizing extensive data using Python. By the end of this tutorial, you should be able to  visualize your own large-scale geospatial data.

Importing Librabry

We start this tutorial by importing some libraries such as datashader, geopandas and colorcet. Datashader is a main library in this tutorial that visualize big data in three steps: Projection, aggregation and transformation. The ouput is a raster or image that visualize the aggregation of data into each pixel of the image. Transforming a large data into a raster format will extremely reduce the size of data that can be viewed much more faster.

Geopandas in a Python library that is used to handle spatial data in Python. Basically it is used for input/output spatial data, spatial processing and analysis. Lastly the colorcet library is used for colormap.

To import the libraries can be done using the following code.

import datashader as ds, datashader.transfer_functions as tf
import geopandas as gpd
import colorcet as cc

Reading Spatial Data

After importing all required libraries, let's get and read the data. For this tutorial I used Microsoft roads dataset. The road data was mined from Bing Maps imagery between 2020 and 2022 including Maxar and Airbus. The data cover all around the world with the total length of 48.9 million KM. It's free to download with Open Data Commons Open Database License (ODbL). For another sample dataset can be found from geoparquet website.

 To read the data, we can use geopandas read_parquet() function as below.


The data will be loaded as a dataframe. Therefore you can see the data rows quickly  as seen in the figure 2 with dataframe head() method. Furthermore you can also get another info about the data with info() or describe() method.


Figure 2. Roads Dataframe

Creating Canvas and Aggregate Data

Before rendering the data, we need to create a canvas first. The following code is used to create a canvas in 500 pixels wide and 400 pixels tall.

canvas = ds.Canvas(plot_width=500, plot_height=400)
agg = canvas.line(road_df,geometry='geometry', agg=ds.count())

In the next line, we're using the line() method of the canvas object to render lines from a DataFrame called road_df. Let's break down the parameters:

  • road_df: This is the DataFrame containing the data we want to render.
  • geometry='geometry': This parameter specifies the column in the dataframe that contains the geometrical information (e.g., coordinates) for the lines to draw.
  • agg=ds.count(): This parameter specifies how we want to aggregate the data. In this case, I used ds.count() to count occurrences of the data in the respected pixel.

So, the agg object will contain a representation of the lines from road_df aggregated onto the canvas, where each pixel represents the count of lines that overlap at that pixel. This aggregated representation can then be visualized using various plotting libraries.

Rendering The Data with Datashader

After aggregating the data, now let's visualize the data. For that we need to create a variable that contains the datashader transfer function with some defined variables as shown in the following code.

img = tf.shade(agg.where(agg>10),, how="eq_hist")

Let's me explain those variables:

  • agg.where(agg > 10): This filters the aggregated data (agg) to keep only the values where the count is greater than 10. This is a threshold to visualize only areas with a higher density of lines.
  • This parameter specifies the colormap to use for coloring the data. In this case, I used the 'fire' colormap from the Colorcet library. You can find the other colormap's name in the Colorcet documentation.
  • how="eq_hist": This parameter specifies the histogram equalization method for mapping the colors. Histogram equalization enhances the contrast in the image by stretching out the intensity range.

In the next line we're setting the background color of the image to black using the set_background() function from the transfer function module. 

After running the code, the image will look like as in figure 3.

South East Asia Roads
Figure 3. South East Asia Roads

Viewing Certain Part of the Data

Great! We have made a road visualization for South East Asia from a big dataset. What about if we want to view certain part of the data for example a specific region or country? For this purpose we can define x_range, and y_range parameters in the canvas method. These parameters consist of minimum and maximum coordinates. For example I want to view the road for Philippines with min,max longitude(x) 119.756,126.726 and min,max latitude(y) 5.099,19.069 as in the following code.    

canvas_ph = ds.Canvas(plot_width=300, plot_height=400,x_range=ph[0],y_range=ph[1])
agg = canvas_ph.line(road_df,geometry='geometry', agg=ds.count())
img = tf.shade(agg,, how="eq_hist")

After running the code it will return an image that shows the road for Philippines as in figure 4.

Philippines Roads
Figure 4. Philippines Roads

Playing More with Visualization

Let's play more with visualization. Let's say we want to highlight the most road density. For that I created ten classes gray color map as in line 4 in the code below. After that I added a 'red' color into the colormap list cause I want to highlight the last 10% of the data with the red color. The result can be seen in the figure 5.  

import numpy as np
canvas_ph = ds.Canvas(plot_width=300, plot_height=400,x_range=ph[0],y_range=ph[1])
agg = canvas_ph.line(road_df,geometry='geometry', agg=ds.count())
colormap = [(i,i,i) for i in np.linspace(0,255,9)]
colormap += ["red"]
img=tf.shade(agg, cmap = colormap, how='eq_hist')

Philippines Road. The most Density Highlight in Red
Figure 6. Philippines Road. The most Density Highlight in Red

Exporting Data Visualization to An Image

After creating the nice visualization, of course we want to save it to local disk for another purpose like sharing to other people, combine with another information or to make the visualization more attractive to user. To get the image output we can export it into an image.

To export the visualization result into an image can be used the following code

from functools import partial
from datashader.utils import export_image
export = partial(export_image, background = "black", export_path="export")
export(tf.shade(agg, cmap=colormap),"philippines roads")

In summary, the code sets up a function for exporting images with specified background color and export path, and then it exports a shaded image of aggregated data with the specified colormap under the name "philippines roads".

That's all tutorial on geospatial big data visualization with Python. Throughout this tutorial, we learned how to read big data, perform aggregation, and create visualizations using Datashader in Python. Hopefully this tutorial useful. Thank you.

Related Posts

Disqus Comments