Learn to Earn Data Challenge

Troubleshooting and Solving Data Join Pitfalls

with_AI 2022. 7. 3. 17:36

벌써 4번째 퀘스트이다.

이 벳지를 따기 위해서 계속 퀘스트를 깨고 있고, 이번 과제의 목표는 트러블 슈팅과 데이터 조인 함정 문제 해결이다.

 

Pin the lab project in BigQuery

Scenario: Your team provides you with a new dataset on the inventory stock levels for each of your products for sale on your ecommerce website. You want to become familiar with the products on the website and the fields you could use to potentially join on to other datasets.

The project with the new dataset is data-to-insights.

  1. Click Navigation menu > BigQuery.
  2. Click Done.
  3. BigQuery public datasets are not displayed by default in the BigQuery web UI. To open the public datasets project, copy data-to-insights.
  4. Click Add Data > Pin a project > Enter Project Name, then paste in the data-to-insights name. Click Pin.

The data-to-insights project is listed in the Explorer section.

 

BigQuery에서 실습 프로젝트 고정 시나리오: 팀에서 전자상거래 웹사이트에서 판매할 각 제품의 재고 재고 수준에 대한 새 데이터 세트를 제공합니다. 웹 사이트의 제품과 잠재적으로 다른 데이터 세트에 조인하는 데 사용할 수 있는 필드에 익숙해지기를 원합니다. 새 데이터 세트가 있는 프로젝트는 데이터를 인사이트로 변환합니다. 탐색 메뉴 > BigQuery를 클릭합니다.

 

완료를 클릭합니다. BigQuery 공개 데이터세트는 기본적으로 BigQuery 웹 UI에 표시되지 않습니다. 공개 데이터 세트 프로젝트를 열려면 데이터를 통찰력으로 복사하십시오. 데이터 추가 > 프로젝트 고정 > 프로젝트 이름 입력을 클릭한 다음 데이터-통계 이름을 붙여넣습니다. 고정을 클릭합니다. 데이터-인사이트 프로젝트는 Explorer 섹션에 나열됩니다.

 

Examine the fields

Next, get familiar with the products and fields on the website you can use to create queries to analyze the dataset.

In the left pane in the Resources section, navigate to data-to-insights > ecommerce > all_sessions_raw.

On the right, under the Query editor, click the Schema tab to see the Fields and information about each field.

 

필드 조사 다음으로, 데이터 세트를 분석하기 위한 쿼리를 생성하는 데 사용할 수 있는 웹 사이트의 제품 및 필드에 익숙해집니다. 리소스 섹션의 왼쪽 창에서 data-to-insights > 전자 상거래 > all_sessions_raw로 이동합니다. 오른쪽의 쿼리 편집기 아래에서 스키마 탭을 클릭하여 필드 및 각 필드에 대한 정보를 확인합니다.

 

Identify a key field in your ecommerce dataset

Examine the products and fields further. You want to become familiar with the products on the website and the fields you could use to potentially join on to other datasets.

Examine the Records

In this section you find how many product names and product SKUs are on your website and whether either one of those fields is unique.

Find how many product names and product SKUs are on the website. Copy and Paste the below query in bigquery EDITOR.

 

전자상거래 데이터세트의 핵심 필드 식별 제품과 분야를 더 자세히 조사하십시오. 웹 사이트의 제품과 잠재적으로 다른 데이터 세트에 조인하는 데 사용할 수 있는 필드에 익숙해지기를 원합니다. 기록 조사 이 섹션에서는 웹사이트에 있는 제품 이름과 제품 SKU의 수와 해당 필드 중 하나가 고유한지 여부를 확인할 수 있습니다. 웹사이트에서 제품 이름과 제품 SKU가 몇 개인지 확인하십시오. bigquery EDITOR에 아래 쿼리를 복사하여 붙여넣습니다.

 

#standardSQL
# how many products are on the website?
SELECT DISTINCT
productSKU,
v2ProductName
FROM `data-to-insights.ecommerce.all_sessions_raw`

하지만...결과가 고유한 제품 SKU가 많다는 것을 의미합니까? 데이터 분석가로서 실행할 첫 번째 쿼리 중 하나는 데이터 값의 고유성을 확인하는 것입니다. 이전 쿼리를 지우고 아래 쿼리를 실행하여 DISTINCT를 사용하여 나열되는 고유한 SKU 수를 나열합니다.

 

#standardSQL
# find the count of unique SKUs
SELECT
DISTINCT
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`

 

Examine relationship between SKU & Name

Let's determine which products have more than one SKU and which SKUs have more than one Product Name.

Clear the previous query and run the below query to determine if some product names have more than one SKU. Notice we use the STRING_AGG() function to aggregate all the product SKUs that are associated with one product name into comma separated values.

 

SKU와 이름의 관계 조사 어떤 제품에 둘 이상의 SKU가 있고 어떤 SKU에 둘 이상의 제품 이름이 있는지 알아보겠습니다. 이전 쿼리를 지우고 아래 쿼리를 실행하여 일부 제품 이름에 둘 이상의 SKU가 있는지 확인합니다. STRING_AGG() 함수를 사용하여 하나의 제품 이름과 연결된 모든 제품 SKU를 쉼표로 구분된 값으로 집계합니다.

 

SELECT
  v2ProductName,
  COUNT(DISTINCT productSKU) AS SKU_count,
  STRING_AGG(DISTINCT productSKU LIMIT 5) AS SKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE productSKU IS NOT NULL
  GROUP BY v2ProductName
  HAVING SKU_count > 1
  ORDER BY SKU_count DESC

 

전자 상거래 웹 사이트 카탈로그에서 각 제품 이름에 별도의 SKU로 판매되는 여러 옵션(크기, 색상)이 있을 수 있음을 확인할 수 있습니다. 따라서 1개의 제품에 12개의 SKU가 있을 수 있음을 확인했습니다. 1 SKU는 어떻습니까? 하나 이상의 제품에 속하도록 허용해야 합니까? 이전 쿼리를 지우고 아래 쿼리를 실행하여 알아보세요.

 

SELECT
  productSKU,
  COUNT(DISTINCT v2ProductName) AS product_count,
  STRING_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE v2ProductName IS NOT NULL
  GROUP BY productSKU
  HAVING product_count > 1
  ORDER BY product_count DESC

 

Pitfall: non-unique key

In inventory tracking, a SKU is designed to uniquely identify one and only one product. For us, it will be the basis of your JOIN condition when you lookup information from other tables. Having a non-unique key can cause serious data issues as you will see.

Write a query to identify all the product names for the SKU 'GGOEGPJC019099'.

Possible Solution:

함정: 고유하지 않은 키 재고 추적에서 SKU는 하나의 제품만을 고유하게 식별하도록 설계되었습니다. 우리에게는 다른 테이블에서 정보를 조회할 때 JOIN 조건의 기초가 됩니다. 고유하지 않은 키를 사용하면 앞으로 보게 될 심각한 데이터 문제가 발생할 수 있습니다. SKU 'GGOEGPJC019099'에 대한 모든 제품 이름을 식별하는 쿼리를 작성하십시오.

 

SELECT DISTINCT
  v2ProductName,
  productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGPJC019099'

 

전자 상거래 웹 사이트 카탈로그에서 각 제품 이름에 별도의 SKU로 판매되는 여러 옵션(크기, 색상)이 있을 수 있음을 확인할 수 있습니다. 따라서 1개의 제품에 12개의 SKU가 있을 수 있음을 확인했습니다. 1 SKU는 어떻습니까? 하나 이상의 제품에 속하도록 허용해야 합니까? 이전 쿼리를 지우고 아래 쿼리를 실행하여 알아보세요.

 

Join pitfall: Unintentional many-to-one SKU relationship

We now have two datasets: one for inventory stock level and the other for our website analytics. Let's JOIN the inventory dataset against your website product names and SKUs so you can have the inventory stock level associated with each product for sale on the website.

Clear the previous query and run the below query.

 

SELECT DISTINCT
  website.v2ProductName,
  website.productSKU,
  inventory.stockLevel
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
  WHERE productSKU = 'GGOEGPJC019099'

 

Next, let's expand our previous query to simply SUM the inventory available by product.

 

WITH inventory_per_sku AS (
  SELECT DISTINCT
    website.v2ProductName,
    website.productSKU,
    inventory.stockLevel
  FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
  JOIN `data-to-insights.ecommerce.products` AS inventory
    ON website.productSKU = inventory.SKU
    WHERE productSKU = 'GGOEGPJC019099'
)
SELECT
  productSKU,
  SUM(stockLevel) AS total_inventory
FROM inventory_per_sku
GROUP BY productSKU

 

Oh no! It is 154 x 3 = 462 or triple counting the inventory! This is called an unintentional cross join (a topic we'll revisit later).

 

Join pitfall solution: use distinct SKUs before joining

What are the options to solve your triple counting dilemma? First you need to only select distinct SKUs from the website before joining on other datasets.

We know that there can be more than one product name (like 7" Dog Frisbee) that can share a single SKU.

Let's gather all the possible names into an array:

 

조인 함정 솔루션: 조인하기 전에 고유한 SKU 사용 삼중 계산 딜레마를 해결하기 위한 옵션은 무엇입니까? 먼저 다른 데이터 세트에 결합하기 전에 웹 사이트에서 고유한 SKU만 선택하면 됩니다. 단일 SKU를 공유할 수 있는 두 개 이상의 제품 이름(예: 7" Dog Frisbee)이 있을 수 있다는 것을 알고 있습니다. 가능한 모든 이름을 배열로 수집해 보겠습니다.

 

SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

 

Now instead of having a row for every Product Name, we only have a row for each unique SKU.

If you wanted to deduplicate the product names, you could even LIMIT the array like so:

 

SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName LIMIT 1) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

 

Join pitfall: Losing data records after a join

Now you're ready to join against your product inventory dataset again.

Clear the previous query and run the below query.

 

#standardSQL
SELECT DISTINCT
website.productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

 

It seems we lost 819 SKUs after joining the datasets Let's investigate by adding more specificity in your fields (one SKU column from each dataset):

 

#standardSQL
# pull ID fields from both tables
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
# IDs are present in both tables, how can we dig deeper?

 

It appears the SKUs are present in both of those datasets after the join for these 1,090 records. How can we find the missing records?

Join pitfall solution: Selecting the correct join type and filtering for NULL

The default JOIN type is an INNER JOIN which returns records only if there is a SKU match on both the left and the right tables that are joined.

Rewrite the previous query to use a different join type to include all records from the website table, regardless of whether there is a match on a product inventory SKU record. Join type options: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN

 

이 1,090개 레코드에 대한 조인 후에 SKU가 두 데이터 세트에 모두 있는 것으로 보입니다. 누락된 기록을 어떻게 찾을 수 있습니까? 조인 함정 솔루션: 올바른 조인 유형 선택 및 NULL 필터링 기본 JOIN 유형은 조인되는 왼쪽 및 오른쪽 테이블 모두에 SKU 일치가 있는 경우에만 레코드를 반환하는 INNER JOIN입니다. 제품 재고 SKU 레코드에 일치 항목이 있는지 여부에 관계없이 웹 사이트 테이블의 모든 레코드를 포함하도록 다른 조인 유형을 사용하도록 이전 쿼리를 다시 작성합니다. 조인 유형 옵션: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN

 

#standardSQL
# the secret is in the JOIN type
# pull ID fields from both tables
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

 

You have successfully used a LEFT JOIN to return all of the original 1,909 website SKUs in your results.

 

How many SKUs are missing from your product inventory set?

Write a query to filter on NULL values from the inventory table.

 

#standardSQL
# find product SKUs in website table but not in product inventory table
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE inventory.SKU IS NULL

Question: How many products are missing?

Answer: 819 products are missing (SKU IS NULL) from your product inventory dataset.

Clear the previous query and run the below query to confirm using one of the specific SKUs from the website dataset.

 

#standardSQL
# you can even pick one and confirm
SELECT * FROM `data-to-insights.ecommerce.products`
WHERE SKU = 'GGOEGATJ060517'
# query returns zero results

Now, what about the reverse situation? Are there any products in the product inventory dataset but missing from the website?

 

#standardSQL
# reverse the join
# find records in website but not in inventory
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

 

Answer: Yes. There are two product SKUs missing from the website dataset

Next, add more fields from the product inventory dataset for more details.

 

#standardSQL
# what are these products?
# add more fields in the SELECT STATEMENT
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.*
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

 

Possible answers:

  • One new product (no orders, no sentimentScore) and one product that is "in store only"
  • Another is a new product with 0 orders

Why would the new product not show up on your website dataset?

  • The website dataset is past order transactions by customers brand new products which have never been sold won't show up in web analytics until they're viewed or purchased.

You typically will not see RIGHT JOINs in production queries. You would simply just do a LEFT JOIN and switch the ordering of the tables.

What if you wanted one query that listed all products missing from either the website or inventory?

Write a query using a different join type.

 

가능한 답변: 하나의 새 제품(주문 없음, 감상 점수 없음) 및 "매장에만 있는" 제품 하나 다른 하나는 0개의 주문이 있는 새 제품입니다. 새 제품이 웹사이트 데이터세트에 표시되지 않는 이유는 무엇입니까? 웹 사이트 데이터 세트는 고객의 과거 주문 거래이며 판매된 적이 없는 새로운 제품은 보거나 구매할 때까지 웹 분석에 표시되지 않습니다. 일반적으로 프로덕션 쿼리에는 RIGHT JOIN이 표시되지 않습니다. LEFT JOIN을 수행하고 테이블 순서를 전환하기만 하면 됩니다. 웹사이트나 인벤토리에서 누락된 모든 제품을 나열하는 하나의 쿼리를 원하면 어떻게 하시겠습니까? 다른 조인 유형을 사용하여 쿼리를 작성합니다.

 

#standardSQL
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
FULL JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL OR inventory.SKU IS NULL

LEFT JOIN + RIGHT JOIN = FULL JOIN which returns all records from both tables regardless of matching join keys. You then filter out where you have mismatches on either side

 

Join pitfall: Unintentional Cross Join

Not knowing the relationship between data table keys (1:1, 1:N, N:N) can return unexpected results and also significantly reduce query performance.

The last join type is the CROSS JOIN.

Create a new table with a site-wide discount percent that you want applied across products in the Clearance category.

 

#standardSQL
CREATE OR REPLACE TABLE ecommerce.site_wide_promotion AS
SELECT .05 AS discount;

 

In the left pane, site_wide_promotion is now listed in the Resource section under your project and dataset.

Clear the previous query and run the below query to find out how many products are in clearance.

 

SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'

Note: For a CROSS JOIN you will notice there is no join condition (e.g. ON or USING). The field is simply multiplied against the first dataset or .05 discount across all items.

Let's see the impact of unintentionally adding more than one record in the discount table.

 

INSERT INTO ecommerce.site_wide_promotion (discount)
VALUES (.04),
       (.03);

 

SELECT discount FROM ecommerce.site_wide_promotion

Note: this behavior isn't limited to cross joins, with a normal join you can unintentionally cross join when the data relationships are many-to-many this can easily result in returning millions or even billions of records unintentionally

The solution is to know your data relationships before you join and don't assume keys are unique.

 

참고: 이 동작은 교차 조인에 국한되지 않습니다. 일반 조인을 사용하면 데이터 관계가 다대다일 때 의도하지 않게 교차 조인할 수 있습니다. 이로 인해 의도치 않게 수백만 또는 수십억 개의 레코드가 반환되기 쉽습니다. 해결책은 조인하기 전에 데이터 관계를 알고 키가 고유하다고 가정하지 않는 것입니다.