BigQuery Geography Clustering

The BigQuery team rolled out support for geography type a while ago and they have never stopped improving performances and GIS (Geographic Information System) functions. This allows users to run complex geo-spatial analytics directly in BigQuery harnessing all its power, simplicity, and reliability.

Hold on your keyboard (or your screen if you are reading this on a mobile device).

Now you can cluster tables using a geography column. Say what!!!!

This is game changing for users working heavily with geodata. By clustering your table on a geography column, BigQuery can reduce the amount of data that needs to read to serve the query. This makes queries cheaper and run faster when filtering on clustering column.

Let’s see the benefits of clustering table using geography column with an example. We will use one of the great public datasets curated by Felipe Hoffa. We will use the weather_gsod specifically the two tables all and all_geoclustered.

Let’s say that we want the all-time minimum and maximum temperature within the Greater London area for each station. Our query will look like this:

select
name,
min(temp) as min_temp,
max(temp) as max_temp
from `fh-bigquery.weather_gsod.all`
where st_within(point_gis, st_geogfromtext(‘POLYGON((-0.5191087546253357 51.69057940987663,-0.5850267233753357 51.530249275252345,-0.5685472311878357 51.33504597185959,-0.19226549290658568 51.2285324250183,0.20049573756216432 51.271511154557565,0.3185987649059143 51.55074848767514,0.2856397805309143 51.64628942375177,0.04943372584341432 51.72461916884517,-0.5191087546253357 51.69057940987663))’))
group by 1

Results:

name                min_temp  max_temp
ST JAMES PARK 26.3 83.2
KENLEY AIRFIELD 18.0 82.8
HEATHROW 18.6 83.4
CITY 22.3 92.2
BLACKWALL 37.0 62.2
NORTHOLT 18.4 84.1
BIGGIN HILL 15.5 90.3
PURLEY OAKS 23.0 81.7
LEAVESDEN 22.3 89.1
LONDON WEA CENTER 20.0 85.2
KEW-IN-LONDON 23.3 82.4

This query reads 9.02GB of data. Now let’s see how the same query perform on the clustered table:

select
name,
min(temp) as min_temp,
max(temp) as max_temp
from `fh-bigquery.weather_gsod.all_geoclusterd`
where st_within(point_gis, st_geogfromtext(‘POLYGON((-0.5191087546253357 51.69057940987663,-0.5850267233753357 51.530249275252345,-0.5685472311878357 51.33504597185959,-0.19226549290658568 51.2285324250183,0.20049573756216432 51.271511154557565,0.3185987649059143 51.55074848767514,0.2856397805309143 51.64628942375177,0.04943372584341432 51.72461916884517,-0.5191087546253357 51.69057940987663))’))
group by 1

The result is obviously the same but, this time, BigQuery reads just 98.97MB of the data!!!

So switching to the geo clustered table made this query almost 100 times cheaper.

Do you want to try other locations and test the differences yourself? You can use Huq Industries GisMap tool to quickly draw and export polygons in various formats.

http://gismap.huq.io/

Author’s Github and Twitter.

Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud

Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud