Real-Time data delivery at scale with BigQuery

Originally published at

At Huq Industries, we collect and process 30 billions events a month from more than 8 million devices. Every day, we enrich, slice, and then deliver data feeds in various forms to our clients mainly via BigQuery, GCS, or S3.
In the past 2 years, we have seen our data ingestion growing 4x each year and we forecast the same for the next years to come. Our already large data history and its steady and high grow rate pose several challenges from ingestion to storage, from processing to delivery and others.
All these parts can be very critical. However, in this specific post, we will focus on the processing and delivery. We will see how we leveraged BigQuery Authorized Views to cut our storage and processing costs, to reduce our delivery time to zero and to streamline the whole system by removing many processes and dependencies.

The issue

The solution: Authorized Views Everywhere

What is an authorised view?

An authorised view is a SQL view that was authorised to read tables in a specific dataset. This means that users, who are allowed to access the view, do not need to get permission to access the underlined tables. It is the view itself that holds that permission and acts as a sort of middle-man.

How is this useful for us?

In our case, we were able to provide all our clients with a view that grants them access only to the data they are subscribing to. Each view can be different and customized according to each client’s needs but all of them use the same source tables. Thus, removing a lot of duplicated data and updating processes.

Now, let’s assume that we have 10s of clients who subscribed to different timeframes and/or countries to our data. Additionally, some of them needed customization of the table schema. As you can imagine, the ETL jobs and their scheduling become quite complicated and pricey. Clearly, this won’t be sustainable when scaling up to 100s or 1000s of clients.
Authorised views enabled us to optimize this issue and we can now easily serve any number of clients with custom data requirements with very little overhead. This can be done by simply creating a view and authorising it to the master dataset:

After we created this view and we granted users access to it, they still won’t be able to successfully run their queries. In fact, if a user does not have access to the underlying tables (i.e. table_a and table_b) and he/she runs a query using this view, that will result in a permission error.
However, by authorizing the view to access dataset_a and dataset_b that will enable users to successfully run their queries.
This can easily be done in the BigQuery explorer. To get to the "authorised views" menu go to source_project_iddataset_a ➝ "Share Dataset" ➝ "Authorised Views". Now, you can configure which views have access to this dataset using their project_id, dataset_id, and view_id. In our example: client_project, dataset, and view_name. Repeat this for all the dataset that the view needs to access, in our example we have 2 datasets in 2 different projects.

Bonus: views using views

The answer is no, at least not directly.

Of course you could authorise the final view to access all of these tables but, as you can imagine, this will result in a lot of dependencies, repetitions, and complex definitions that will be a nightmare to maintain.

How can we successfully use views in an authorised view?

The answer is authorised views. More precisely, we can use authorised views in another authorised view because the “inner” view will be authorised to access the resources creating a chain of authorisation. In this way, we can just simply manage the authorisation of each single view.

Stay tuned for an implementation of this using Terraform in the coming blogposts.

Follow or contact me on Twitter for more stories about data, infrastructure, and automation.

Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud

Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud