Originally published at https://www.puccetti.io
At Huq Industries, we collect and process 30 billions events a month from more than 8 million devices. Every day, we enrich, slice, and then deliver data feeds in various forms to our clients mainly via BigQuery, GCS, or S3.
In the past 2 years, we have seen our data ingestion growing 4x each year and we forecast the same for the next years to come. Our already large data history and its steady and high grow rate pose several challenges from ingestion to storage, from processing to delivery and others.
All these parts can be very critical…
How to use BigQuery superpowers to rewind time.
Originally published at https://www.puccetti.io
Who does not like time travel? We all saw it and were fascinated by it in many sci-fi movies, unfortunately science did not crack real-life time travel yet.
However, we can “data time-travel”. Thanks to the amazing BigQuery SQL feature “FOR SYSTEM_TIME AS OF”, we can time travel (up to 7 days in the past) to a specific timestamp (for detailed information about the syntax refer to the official documentation).
This feature is extremely easy to use. In fact, we can simply modify any query by adding the…
Originally published at https://www.puccetti.io on May 10, 2020.
Do you use BigQuery? Are you interested in knowing how to integrate data from different cloud providers into BigQuery? In this blogpost, we will implement a serverless and fully managed system to make available S3 access logs into BigQuery to easily integrate them with other data sources and reporting systems. To achieve this we will see how to set up AWS S3 access logs delivery and configure Google Data Transfer Service in order to schedule fully managed S3 to Cloud Storage transfers.We will also use BigQuery external table to read data directly…
The BigQuery team rolled out support for geography type a while ago and they have never stopped improving performances and GIS (Geographic Information System) functions. This allows users to run complex geo-spatial analytics directly in BigQuery harnessing all its power, simplicity, and reliability.
Hold on your keyboard (or your screen if you are reading this on a mobile device).
Now you can cluster tables using a geography column. Say what!!!!
This is game changing for users working heavily with geodata. By clustering your table on a geography column, BigQuery can reduce the amount of data that needs to read to…
In this blogpost, I will explain what partitioning and clustering features in BigQuery are and how to supercharge your query performance and reduce query costs.
Partitioning a table can make your queries run faster while spending less. Until December 2019, BigQuery supported table partitioning only using date data type. Now, you can do it on integer ranges too. If you want to know more about partitioning your tables this way, check out this great blogpost by Guillaume Blaquiere.
Here, I will focus on date type partitioning. You can partition your data using 2 main strategies: on the one hand you…
BigQuery supports the “*” wildcard to reference multiple tables or files. You can leverage this feature to load, extract, and query data across multiple sources, destinations, and tables. Let’s see what you can do with wildcards with some examples.
The first thing is definitely loading the data into BigQuery. If you deal with a very large amount of data you will have, most likely, tens of thousands of files coming from a data pipelines that you want to load into BigQuery. Using wildcards, you can easily load data from different files into a single table.
bq load project_id:dataset_name.table_name gs://my_data/input/prefix/* ./schema.json
Back in the early days of Huq we were ingesting a just few millions records per day into our geo-behavioural insights platform. Today that figure is in the hundreds of millions. During the period where our traffic was ramping intensively, we quickly realised that our single high-spec bare metal server setup was not going to be enough for our analytics needs.
After all, what good is building a valuable data asset if you can’t get answers out? We wanted to find a way to retrieve answers in seconds, not days, and so we set ourselves a mission: find a solution…
Google Cloud Composer is Google Cloud Platform product that helps you manage complex workflows with ease.
It is built on top of Apache Airflow and is a fully managed service that leverages other GCP products like Cloud SQL, GCS, Kubernetes Engine, Stackdriver, Cloud SQL and Identity Aware Proxies.
You don’t have to worry about provisioning and dev-ops, so you can focus on your core business logic and let Google take care of the rest. With Airflow it’s easy to create, schedule, and monitor pipelines that span both cloud and…
Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud