Using and Optimizing your SQL queries for cloud MPP : Bigquery

Anurag Bisht
7 min readMar 2, 2023

--

Source Internet

SQL queries in BigQuery can be optimized in several ways to improve their performance and reduce the overall cost. Here are some ways to optimize SQL queries in BigQuery:

  1. Use partitioned tables: Partitioning tables in BigQuery can significantly improve query performance, especially for large datasets. Partitioning tables in BigQuery allows you to split large tables into smaller parts, making it easier for BigQuery to scan only the relevant data. Let’s say you have a table of customer orders that is growing larger over time. By partitioning the table by order date, you can split the table into smaller parts based on the date range of the orders. This can speed up queries that only need to access a certain date range of orders.
CREATE TABLE mydataset.mytable
PARTITION BY DATE(order_date)
AS SELECT * FROM source_table

Here are three ways to do partitioning in BigQuery:

Date-based partitioning:

One common approach is to partition data by date or timestamp. This can be useful for tables that have a time component, such as transactional or event data. To partition a table by date, you can create a new table with a partitioned column that corresponds to the date field in your data. For example:

CREATE TABLE my_table
PARTITION BY DATE(timestamp)
AS
SELECT *
FROM my_source_table

This creates a new table my_table that is partitioned by the timestamp column. BigQuery automatically creates a new partition for each day, based on the date in the timestamp column.

Integer-based partitioning:

You can also partition tables by integer values, such as a customer ID or product ID. This can be useful for tables that have a natural grouping based on a numeric identifier. To partition a table by integer, you can use the RANGE_BUCKET() function to assign each row to a specific partition based on the value of the integer column. For example:

CREATE TABLE my_table
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 1000000, 10000))
AS
SELECT *
FROM my_source_table

This creates a new table my_table that is partitioned into 100 partitions based on the customer_id column. Each partition contains rows with customer_id values that fall within a specific range.

Ingestion-time partitioning:

Ingestion-time partitioning is a feature in BigQuery that automatically partitions data based on the time it was ingested into the table. This can be useful for tables that are constantly being updated with new data. To enable ingestion-time partitioning, you can set the timePartitioning option when creating a new table. For example:

CREATE TABLE my_table
PARTITION BY ingestion_time
OPTIONS(
partition_expiration_days=7
)
AS
SELECT *
FROM my_source_table

This creates a new table my_table that is partitioned based on the time the data was ingested. Each partition corresponds to a specific ingestion time, and partitions older than seven days are automatically deleted to prevent the table from becoming too large.

These are just a few examples of how partitioning can be done in BigQuery. There are many other ways to partition tables based on specific criteria, and the optimal partitioning strategy depends on the nature of your data and your query patterns.

2. Use clustered tables: Clustering tables is another way to improve query performance. Clustering involves organizing data within a partition based on one or more columns. This can help BigQuery to group related data together, reducing the amount of data scanned and improving query performance.

Clustering tables in BigQuery groups related data together, which can improve query performance by reducing the amount of data scanned. For example, if you have a table of customer orders and you often query by customer_id and order_date, clustering the table by these columns can speed up those queries.

Here are some ways clustering can help improve query performance in BigQuery:

  1. Reduced Data Scanning: When data is clustered, BigQuery can skip over large chunks of data that are not relevant to the query, reducing the amount of data that needs to be scanned. This can significantly reduce the time it takes to complete a query.
  2. Reduced Costs: By scanning less data, clustered tables can reduce the cost of running queries in BigQuery, as you only pay for the data that is actually processed.
  3. Faster Aggregation: When querying large datasets with aggregation functions like SUM, AVG, or COUNT, clustering can improve query performance by reducing the amount of data that needs to be processed.
  4. Faster Joins: When joining two or more tables, clustering can improve query performance by ensuring that the data is physically sorted in a way that can optimize the join process.

To take advantage of clustering in BigQuery, you should carefully choose the columns on which to cluster, as this can have a significant impact on query performance. You should also consider the size of the dataset and the types of queries you plan to run, as clustering may not always be necessary or beneficial for smaller datasets or certain types of queries.

CREATE TABLE mydataset.mytable
CLUSTER BY customer_id, order_date
AS SELECT * FROM source_table

Let’s say you have a large e-commerce dataset with millions of rows, including a transactions table with the following columns:

  • transaction_id
  • user_id
  • timestamp
  • product_id
  • product_price
  • category

To improve query performance on this table, you could create a clustered table using the user_id and timestamp columns, like this:

CREATE TABLE transactions_clustered
CLUSTER BY user_id, timestamp
AS
SELECT *
FROM transactions

Now, when you run a query that filters on user_id and/or timestamp, BigQuery can skip reading large portions of the data that do not match the filter criteria. For example, if you want to calculate the total revenue per user for a specific date range, you could run a query like this:

SELECT user_id, SUM(product_price) AS total_revenue
FROM transactions_clustered
WHERE timestamp >= '2022-01-01' AND timestamp < '2022-02-01'
GROUP BY user_id

Because the transactions_clustered table is sorted and stored based on user_id and timestamp, BigQuery can skip over most of the data that does not match the date range filter, resulting in faster query times and lower costs. This can be especially beneficial for large datasets with complex queries that would otherwise require scanning a significant amount of data.

3. Use caching: BigQuery provides caching capabilities that can help to speed up query performance. When a query is executed, BigQuery caches the results for a certain period of time. Subsequent queries can then retrieve the cached results, reducing query execution time and cost. Caching query results in BigQuery can reduce query execution time and cost. If you have a query that is frequently run with the same parameters, caching the results can speed up subsequent runs of the query. By default, BigQuery caches results for 24 hours.

SELECT COUNT(*) FROM mydataset.mytable WHERE order_date = '2022-01-01'

If you run this query multiple times with the same date parameter, BigQuery will cache the results for subsequent runs.

4. Use appropriate join types: BigQuery supports several join types, including inner join, outer join, and cross join. Choosing the appropriate join type can significantly impact query performance. Inner join is usually the fastest and most efficient join type, while outer join and cross join can be slower.

SELECT a.*, b.*
FROM table_a a
INNER JOIN table_b b
ON a.id = b.id

5. Use appropriate data types: Choosing the appropriate data types can also impact query performance. Using smaller data types wherever possible can help to reduce the amount of data scanned and improve query performance.

CREATE TABLE mydataset.mytable
(order_id INT64, order_date DATE, customer_id INT64, order_amount NUMERIC)

In this example, using INT64 for order_id and customer_id and DATE for order_date can help to reduce the amount of data scanned.

5. Optimize subqueries: Subqueries can be useful for complex queries, but they can also be a performance bottleneck. To optimize subqueries, try to avoid using correlated subqueries and instead use join operations wherever possible.

SELECT a.*
FROM table_a a
WHERE a.id IN (SELECT b.id FROM table_b b WHERE b.amount > 100)

In this example, using a join instead of a subquery can improve query performance:

SELECT a.*
FROM table_a a
JOIN table_b b
ON a.id = b.id
WHERE b.amount > 100

6. Use query optimization techniques: BigQuery provides several query optimization techniques, such as query plan explanation, query plan caching, and query prioritization. These techniques can help to improve query performance and reduce cost.

For example, you can use the EXPLAIN keyword to analyze the query plan and identify potential performance issues:

EXPLAIN SELECT *
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'

7. Use efficient SQL code: Finally, writing efficient SQL code can also improve query performance. This includes avoiding unnecessary columns, using appropriate indexing, and using simple and efficient SQL syntax.

-- Instead of this
SELECT *
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'

-- Use this
SELECT order_id, order_date, customer_id
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'

Using simple and efficient SQL syntax can improve query performance by reducing the amount of data scanned and the time it takes to process the query. Avoid using complex queries and nested subqueries whenever possible.

-- Instead of this complex query
SELECT a.*, b.*
FROM table_a a
WHERE a.id IN (SELECT b.id FROM table_b b WHERE b.amount > 100)

-- Use this simple query with a join instead
SELECT a.*, b.*
FROM table_a a
JOIN table_b b
ON a.id = b.id
WHERE b.amount > 100

8. Use WHERE clause instead of HAVING clause:

In SQL, the WHERE clause is used to filter rows before aggregation, while the HAVING clause is used to filter groups after aggregation. Since the WHERE clause is executed before the aggregation, it is generally faster than the HAVING clause. Therefore, try to use the WHERE clause wherever possible to filter rows before aggregation.

SELECT customer_id, SUM(order_amount) as total_amount
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'
GROUP BY customer_id
HAVING total_amount > 1000

In this example, the WHERE clause can be used to filter rows before aggregation and improve query performance:

SELECT customer_id, SUM(order_amount) as total_amount
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'
AND order_amount > 0
GROUP BY customer_id
HAVING total_amount > 1000

9. Use LIMIT clause:

If you only need a certain number of rows, using the LIMIT clause can improve query performance by reducing the amount of data scanned. The LIMIT clause limits the number of rows returned by a query.

SELECT *
FROM mydataset.mytable
WHERE order_date >= '2022-01-01'
LIMIT 100

The limit clause might help only in cases where you would like to limit the data to the consumer application and wont have any impact on the data consumed by your query before limiting it. Hence limiting the data is partially helpful in optimizing the query performance.

There are some limitations with Bigquery, topic of discussion any other day.

--

--

Anurag Bisht

Passionate Certified Cloud Data Engineer and Architect (GCP , AWS). Interests: Data, Cloud Architecture and AI/ ML. Sometimes share my interests through blogs.