BigQueryでSQLをDRY(Don't repeat yourself)に書いてみよう！

f:id:smartcamp:20190109165501p:plain

スマートキャンプ株式会社でデータエンジニアをしている瀧川です。

皆さん！分析SQLを書いていますか！？

弊社ではアプリケーションのログや、各種データをBigQueryに集約しており、諸々の分析をRedashでSQLを書くことで行っています。 Redashは良くも悪くも、SQLですべてを完結する必要があるので、もともと複雑な分析SQLがさらに長くなる...みたいなことありますよね。

そこで私が普段使っている、BigQueryでクエリをDRY(Don't repeat yourself)に書く方法を紹介したい思います！

※ 本記事ではクエリの可読性にフォーカスしているため、パフォーマンスに関しては考慮していません。

※ 記事内の例はSQLの書き方のみであり、内容に意味はありません。

参考 BigQuery公式Doc

with で結果を一時保存しよう
with とサブクエリで定数定義しよう
UDF(SQL)を使って共通処理を関数化しよう
まとめ

`with` で結果を一時保存しよう

1つ目はすでに常識かもしれませんが、with句を使ってクエリ実行時の一時的な view を作成することができます。

例えば、以下のクエリを見ていただくと、似たようなサブクエリが複数出てきており冗長な気がしますね...!

#standardSQL
select *
, (select count(1) from `master_table` join `sub_table` using (id))  as cnt
from (select * from `master_table` join `sub_table` using (id))
where id in (
  select id from `master_table` join `sub_table` using (id)
)

これをwithを使うと以下のようになります。

#standardSQL
with
joined_table as (
  select * from `master_table` join `sub_table` using (id)
)
select *
, (select count(1) from joined_table) as cnt
from `joined_table`
where id in (
  select id from `joined_table`
)

一度結果に別名をつけることができるため、意味も伝わりやすくなりますね。

私は以下のように、さらに小さなselectを作ってはviewにしてを繰り返して、最後に組み立てるような書き方をしたりします。

#standardSQL
with
joined_table as (
  select * from `master_table` join sub_table using (id)
),
count_joined_table as (
  select count(1) as value from `joined_table`
),
ids_joined_table as (
  select id from `joined_table`
)

select *
, (select value from `count_joined_table`) as cnt
from `joined_table`
where id in `ids_joined_table`

`with` とサブクエリで定数定義しよう

クエリ内で何度も使う定数(期間や対象ID)が存在する場合、クエリの修正や値を変更して実行みたいなこと大変になってきます。 (変更漏れでデータがおかしくなることもよくありますね...)

#standardSQL
select * from (
  select * from `master_table_*`
  where parse_date('%Y%m%d', _table_suffix) between date('2018-11-01') and date('2018-11-30')
)
join (
  select * from `sub_table`
  where created_at between date('2018-11-01') and date('2018-11-30')
    and user_id = 1000
)
using (id)

そこで、私は以下のように定数だけを管理する view を with で用意して、サブクエリで都度呼び出すようにしています。これによりparamsの値を変更するだけで、抜け漏れなく更新することができます。

#standardSQL
with
params as (
  select date('2018-11-01') as start_date,
    date('2018-11-30') as end_date,
    1000 as target_user_id
)
select * from (
  select * from `master_table_*`
  where (
      select parse_date('%Y%m%d', _table_suffix) between start_date and end_date 
      from `params`
  )
)
join (
  select * from `sub_table` as o
  where (
    select created_at between start_date and end_date 
      and o.user_id = i.target_user_id
    from `params` as i
  )
)
using (id)

※ BigQueryにはサブクエリ数に制限があったはずなので、使いすぎると実行できなくなるかも

UDF(SQL)を使って共通処理を関数化しよう

UDF(UserDefinedFunction)をご存知ですか？以下のDocumentを見ていただくと、SQLで書くの難しい処理をjavascript使って関数定義できる機能といったイメージかなと思います。

標準 SQL ユーザー定義関数 | BigQuery | Google Cloud

実はドキュメントをよく読むとSQLもUDF化できるので、何度も使っているSQLはUDFにして共通化しましょう！

#standardSQL
select regex_exact(r'email: (.+)', json_extract_scalar(hoge, '$.hogehoge')) as email,
  regex_exact(r'tel: (.+)', json_extract_scalar(hoge, '$.hogehoge')) as tel,
  case category
  when 1 then 'one'
  when 2 then 'two'
  else 'other'
  end as category_title
from `master_table`

#standardSQL
create temp function extract_hogehoge(regex string)
returns (
    regex_exact(regex, json_extract_scalar(hoge, '$.hogehoge'))
),
create temp function category2title(category float64)
returns (
  case category
  when 1 then 'one'
  when 2 then 'two'
  else 'other'
  end
);
select extract_hogehoge(r'email: (.+)') as email,
  extract_hogehoge(r'tel: (.+)') as tel,
  category2title(category) as category_title
from `master_table`