Learn to Earn Data Challenge

Creating a Data Warehouse Through Joins and Unions

with_AI 2022. 7. 3. 14:41

아직은 GCP 뱃지가 없지만, 이번 챌린지를 계속 진행하려면

 

  1. Pre-requisite: Earn the Build and Optimize Data Warehouses with BigQuery skill badge. Already earned it? Great! Don't have it yet? Be sure to earn it by July 19, 2022.

7월19 일까지  Earn the Build and Optimize Data Warehouses with BigQuery skill badge 를 얻어야 한다.

 

첫번째 과제를 끝내면

이런식으로 진행중인 퀘스트를 볼 수 있게 된다.

 

 

이렇게 많은 LAB을 도전해서 통과해야 뱃지를 받을 수 있다.

 

오늘 7/3일 일요일 안으로 이 퀘스트를 모두 완료해 보려고 한다.

 

이번 포스팅에서는 2번째 과정인

 

Creating a Data Warehouse Through Joins and Unions

 

과정에 대해서 공부하려고 한다.

 

이번에는 CLI 환경이 아닌 GUI 환경에서 빅쿼리를 하는 과제다.

 

Open the BigQuery console

  1. In the Google Cloud Console, select Navigation menu > BigQuery.

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

  1. Click Done.

The BigQuery console opens.

 

Create a new dataset to store your tables

First, create a new dataset titled ecommerce in BigQuery.

  1. In the left pane, click on the name of your BigQuery project (qwiklabs-gcp-xxxx).
  2. Click on the three dots next to your project name, then select CREATE DATASET.

The Create dataset dialog opens.

  1. Set the Dataset ID to ecommerce, leave all other options at their default values.

Click Create dataset.

 

이커머스라는 데이터셋을 빅쿼리에서 생성한다.

모든 옵션들은 default로 설정한다.

 

  1. Click on the Disable Editor Tabs link to enable the Query Editor.

Scenario: Your marketing team provided you and your data science team all of the product reviews for your ecommerce website. You partner with them to create a data warehouse in BigQuery which joins together data from three sources:

  • Website ecommerce data
  • Product inventory stock levels and lead times
  • Product review sentiment analysis

In this lab, you examine a new dataset based on product reviews.

 

시나리오: 귀하의 마케팅 팀은 귀하와 귀하의 데이터 과학 팀에서 귀하의 전자상거래 웹사이트에 대한 모든 제품 리뷰를 제공했습니다. 이들과 협력하여 BigQuery에서 세 가지 소스의 데이터를 결합하는 데이터 웨어하우스를 만듭니다.

 

Explore the product sentiment dataset

Your data science team has run all of your product reviews through the API and provided you with the average sentiment score and magnitude for each of your products.

First, create a copy the table that the data science team made so you can read it:

 

빅쿼리에서 이커머스라는 데이터셋을 만들었다.

 

이커머스 데이터를 가져와서 products라는 테이블로 가져온다.
미리보기를 통해 어떻게 record가 형성 됐는지 확인한다.

감정 점수 순으로 내림차순 정렬하기

 

반대로 감정 점수가 낮은 것을 내림차순으로 정렬하기

 

SELECT
  SKU,
  name,
  sentimentScore,
  sentimentMagnitude
FROM
  `data-to-insights.ecommerce.products`
WHERE sentimentScore IS NOT NULL
ORDER BY
  sentimentScore
LIMIT 5

 

Join datasets to find insights

Scenario It's the first of the month and your inventory team has informed you that the orderedQuantity field in the product inventory dataset is out of date. They need your help to query the total sales by product for 08/01/2017 and reference that against the current stock levels in inventory to see which products need to be resupplied first.

 

시나리오 매월 1일이며 재고 팀에서 제품 재고 데이터 세트의 ordersQuantity 필드가 오래되었다고 알려왔습니다. 그들은 2017년 8월 1일에 대한 제품별 총 매출을 쿼리하고 이를 재고의 현재 재고 수준과 비교하여 어떤 제품이 먼저 재공급되어야 하는지 확인하는 데 도움이 필요합니다.

Calculate daily sales volume by productSKU

Create a new table in your ecommerce dataset with the below requirements:

  • Title it sales_by_sku_20170801
  • Source the data from data-to-insights.ecommerce.all_sessions_raw
  • Include only distinct results
  • Return productSKU
  • Return the total quantity ordered (productQuantity). Hint: Use a SUM() with a IFNULL condition
  • Filter for only sales on 20170801
  • ORDER BY the SKUs with the most orders first

Possible Solution:

 

# pull what sold on 08/01/2017
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170801 AS
SELECT DISTINCT
  productSKU,
  SUM(IFNULL(productQuantity,0)) AS total_ordered
FROM
  `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
GROUP BY productSKU
ORDER BY total_ordered DESC #462 skus sold

 

Next, enrich your sales data with product inventory information by joining the two datasets.

Join sales data and inventory data

Using a JOIN, enrich the website ecommerce data with the following fields from the product inventory dataset:

  • name
  • stockLevel
  • restockingLeadTime
  • sentimentScore
  • sentimentMagnitude
# calculate ratio and filter
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude,
  SAFE_DIVIDE(website.total_ordered, inventory.stockLevel) AS ratio
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
# gone through more than 50% of inventory for the month
WHERE SAFE_DIVIDE(website.total_ordered,inventory.stockLevel) >= .50
ORDER BY total_ordered DESC

Create a new empty table to store sales by productSKU for 08/02/2017

For the schema, specify the following fields:

  • table name is ecommerce.sales_by_sku_20170802
  • productSKU STRING
  • total_ordered as an INT64 field
  •  
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170802
(
productSKU STRING,
total_ordered INT64
);

 

INSERT INTO ecommerce.sales_by_sku_20170802
(productSKU, total_ordered)
VALUES('GGOEGHPA002910', 101)

Append together historical data

There are multiple ways to append together data that has the same schema. Two common ways are using UNIONs and table wildcards.

  • Union is an SQL operator that appends together rows from different result sets.
  • Table wildcards enable you to query multiple tables using concise SQL statements. Wildcard tables are available only in standard SQL.
SELECT * FROM ecommerce.sales_by_sku_20170801
UNION ALL
SELECT * FROM ecommerce.sales_by_sku_20170802