Creating Date-Partitioned Tables in BigQuery

Learn to Earn Data Challenge

Creating Date-Partitioned Tables in BigQuery

with_AI 2022. 7. 3. 16:11

3번째 퀘스트를 진행 해보았다.

이번 부터는 시나리오 하나하나 다 해석을 해보려고 한다.

Create a new dataset

First, you will create a dataset to store your tables.

Click the three dots next to your Qwiklabs project ID and select Create dataset:

Name your dataset ecommerce. Leave the other options at their default values (Data Location, Default table Expiration).

Click Create dataset.

Click Check my progress to verify the objective.

이커머스 데이터 셋을 만드는거는 저번에도 했다.

Creating tables with date partitions

A partitioned table is a table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and control costs by reducing the number of bytes read by a query.

Now you will create a new table and bind a date or timestamp column as a partition. Before we do that, let's explore the data in the non-partitioned table first.

Query webpage analytics for a sample of visitors in 2017

In the Query Editor, add the below query. Before running, note the total amount of data it will process as indicated next to the query validator icon: "This query will process 1.74 GB when run".

날짜 파티션이 있는 테이블 생성 파티션을 나눈 테이블은 데이터를 보다 쉽게 관리하고 쿼리할 수 있도록 파티션이라고 하는 세그먼트로 나누어진 테이블입니다. 큰 테이블을 더 작은 파티션으로 나누면 쿼리 성능을 개선하고 쿼리에서 읽는 바이트 수를 줄여 비용을 제어할 수 있습니다. 이제 새 테이블을 만들고 날짜 또는 타임스탬프 열을 파티션으로 바인딩합니다. 그 전에 먼저 분할되지 않은 테이블의 데이터를 탐색해 보겠습니다. 2017년 방문자 샘플에 대한 쿼리 웹페이지 분석 쿼리 편집기에서 아래 쿼리를 추가합니다. 실행하기 전에 쿼리 유효성 검사기 아이콘 옆에 표시된 대로 처리할 총 데이터 양을 확인합니다. "이 쿼리는 실행 시 1.74GB를 처리합니다."

#standardSQL
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170708'
LIMIT 5

Query webpage analytics for a sample of visitors in 2018

Let's modify the query to look at visitors for 2018 now.

Click COMPOSE NEW QUERY to clear the Query Editor, then add this new query. Note the WHERE date parameter is changed to 20180708:

2018년 방문자 샘플에 대한 쿼리 웹페이지 분석 이제 쿼리를 수정하여 2018년 방문자를 살펴보겠습니다. COMPOSE NEW QUERY를 클릭하여 쿼리 편집기를 지운 다음 이 새 쿼리를 추가합니다. WHERE 날짜 매개변수가 20180708로 변경되었습니다.

#standardSQL
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20180708'
LIMIT 5

The Query Validator will tell you how much data this query will process.

Click Run.

Notice that the query still processes 1.74 GB even though it returns 0 results. Why? The query engine needs to scan all records in the dataset to see if they satisfy the date matching condition in the WHERE clause. It must look at each record to compare the date against the condition of ‘20180708'.

Additionally, the LIMIT 5 does not reduce the total amount of data processed, which is a common misconception.

쿼리 유효성 검사기는 이 쿼리가 처리할 데이터의 양을 알려줍니다. 실행을 클릭합니다. 쿼리는 0개의 결과를 반환하더라도 여전히 1.74GB를 처리합니다. 왜요? 쿼리 엔진은 데이터 세트의 모든 레코드를 스캔하여 WHERE 절의 날짜 일치 조건을 충족하는지 확인해야 합니다. 날짜를 '20180708'의 조건과 비교하려면 각 레코드를 살펴봐야 합니다. 또한 LIMIT 5는 처리되는 총 데이터 양을 줄이지 않습니다. 이는 일반적인 오해입니다.

Common use-cases for date-partitioned tables

Scanning through the entire dataset everytime to compare rows against a WHERE condition is wasteful. This is especially true if you only really care about records for a specific period of time like:

All transactions for the last year
All visitor interactions within the last 7 days
All products sold in the last month

Instead of scanning the entire dataset and filtering on a date field like we did in the earlier queries, we will now setup a date-partitioned table. This will allow us to completely ignore scanning records in certain partitions if they are irrelevant to our query.

Create a new partitioned table based on date

Click COMPOSE NEW QUERY and add the below query, then Run:

날짜로 파티션을 나눈 테이블의 일반적인 사용 사례 WHERE 조건과 행을 비교하기 위해 매번 전체 데이터 세트를 스캔하는 것은 낭비입니다. 다음과 같이 특정 기간 동안의 레코드에만 관심이 있는 경우 특히 그렇습니다. 작년의 모든 거래 지난 7일 동안의 모든 방문자 상호작용 지난 달에 판매된 모든 제품 이전 쿼리에서와 같이 전체 데이터 세트를 스캔하고 날짜 필드를 필터링하는 대신 이제 날짜로 파티션을 나눈 테이블을 설정합니다. 이렇게 하면 쿼리와 관련이 없는 특정 파티션의 스캔 레코드를 완전히 무시할 수 있습니다. 날짜를 기준으로 새 파티션을 나눈 테이블 만들기 COMPOSE NEW QUERY를 클릭하고 아래 쿼리를 추가한 다음 실행:

#standardSQL
 CREATE OR REPLACE TABLE ecommerce.partition_by_day
 PARTITION BY date_formatted
 OPTIONS(
   description="a table partitioned by date"
 ) AS
 SELECT DISTINCT
 PARSE_DATE("%Y%m%d", date) AS date_formatted,
 fullvisitorId
 FROM `data-to-insights.ecommerce.all_sessions_raw`

이 쿼리에서 새 옵션인 PARTITION BY 필드에 주목하십시오. 파티션에 사용할 수 있는 두 가지 옵션은 DATE 및 TIMESTAMP입니다. PARSE_DATE 함수는 날짜 필드(문자열로 저장됨)에 사용되어 파티셔닝을 위한 적절한 DATE 유형으로 가져옵니다.

View data processed with a partitioned table

Run the below query, and note the total bytes to be processed:

#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2016-08-01'

This time 25 KB or 0.025MB is processed, which is a fraction of what you queried.

Now run the below query, and note the total bytes to be processed:

#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2018-07-08'

Creating an auto-expiring partitioned table

Auto-expiring partitioned tables are used to comply with data privacy statutes, and can be used to avoid unnecessary storage (which you'll be charged for in a production environment). If you want to create a rolling window of data, add an expiration date so the partition disappears after you're finished using it.

Explore the available NOAA weather data tables

In the left menu, in Explorer, click on Add Data and select Explore public datasets.

자동 만료 분할 테이블 생성 자동 만료 분할 테이블은 데이터 개인정보 보호법을 준수하는 데 사용되며 불필요한 저장을 방지하는 데 사용할 수 있습니다(프로덕션 환경에서 요금이 부과됨). 데이터의 롤링 윈도우를 생성하려면 만료 날짜를 추가하여 파티션 사용을 마친 후 파티션이 사라지도록 하십시오. 사용 가능한 NOAA 날씨 데이터 테이블 탐색 탐색기의 왼쪽 메뉴에서 데이터 추가를 클릭하고 공개 데이터 세트 탐색을 선택합니다.

Search for "GSOD NOAA" then select the dataset.

Click on View Dataset.

Scroll through the tables in the noaa_gsod dataset (which are manually sharded and not partitioned)

Your goal is to create a table that:

Queries on weather data from 2018 onward
Filters to only include days that have had some precipitation (rain, snow, etc.)
Only stores each partition of data for 90 days from that partition's date (rolling window)

First, copy and paste this below query:

#standardSQL
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter stations/days with no precipitation
   AND _TABLE_SUFFIX >= '2018'
 ORDER BY date DESC -- Where has it rained/snowed recently
 LIMIT 10

테이블 와일드카드 *는 TABLE_SUFFIX 필터에서 참조되는 테이블의 양을 제한하기 위해 FROM 절에 사용됩니다. LIMIT 10이 추가되었지만 아직 파티션이 없기 때문에 스캔된 총 데이터 양(약 1.83GB)이 줄어들지 않습니다. 실행을 클릭합니다. 날짜 형식이 적절하고 강수량 필드에 0이 아닌 값이 표시되는지 확인합니다.

Your turn: Create a Partitioned Table

Modify the previous query to create a table with the below specifications:

Table name: ecommerce.days_with_rain
Use the date field as your PARTITION BY
For OPTIONS, specify partition_expiration_days = 60
Add the table description = "weather stations with precipitation, partitioned by day"

내 차례: 분할된 테이블 만들기 이전 쿼리를 수정하여 아래 사양의 테이블을 생성합니다. 테이블 이름: ecommerce.days_with_rain 날짜 필드를 PARTITION BY로 사용하십시오. OPTIONS의 경우 partition_expiration_days = 60을 지정합니다. 테이블 설명 추가 = "강수량이 있는 기상 관측소, 요일별로 분할됨"

#standardSQL
 CREATE OR REPLACE TABLE ecommerce.days_with_rain
 PARTITION BY date
 OPTIONS (
   partition_expiration_days=60,
   description="weather stations with precipitation, partitioned by day"
 ) AS
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter
   AND _TABLE_SUFFIX >= '2018'

데이터 파티션 만료가 작동하는지 확인 과거 60일부터 오늘까지의 데이터만 저장하고 있는지 확인하려면 DATE_DIFF 쿼리를 실행하여 60일 후에 만료되도록 설정된 파티션의 수명을 가져옵니다. 아래는 상당한 강우량이 있는 일본 와카야마의 NOAA 기상 관측소에 대한 평균 강우량을 추적하는 쿼리입니다.

Confirm the oldest partition_age is at or below 60 days

Update the ORDER BY clause to show the oldest partitions first. The date you see there Add this query and run it:

#standardSQL
# avg monthly precipitation
SELECT
  AVG(prcp) AS average,
  station_name,
  date,
  CURRENT_DATE() AS today,
  DATE_DIFF(CURRENT_DATE(), date, DAY) AS partition_age,
  EXTRACT(MONTH FROM date) AS month
FROM ecommerce.days_with_rain
WHERE station_name = 'WAKAYAMA' #Japan
GROUP BY station_name, date, today, month, partition_age
ORDER BY partition_age DESC

'Learn to Earn Data Challenge' 카테고리의 다른 글

Build and Execute MySQL, PostgreSQL, and SQLServer to Data Catalog Connectors (0)	2022.07.03
Working with JSON, Arrays, and Structs in BigQuery (0)	2022.07.03
Troubleshooting and Solving Data Join Pitfalls (0)	2022.07.03
Creating a Data Warehouse Through Joins and Unions (0)	2022.07.03
Learn to Earn Data Challenge 도전 (0)	2022.07.03

현재글Creating Date-Partitioned Tables in BigQuery

매일코딩

코테, 합격꿀팁, 자기소개, 코랩, 코테문법, 오픈소스컨트리뷰션아카데미, openup, 나동빈, 문법, 파이썬, 실전문법, db, mongoDB, 이코테, 코테준비, Colab, json, 파이썬코테, 필수문법, NoSQL,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

매일코딩