Learn to Earn Data Challenge

Build and Optimize Data Warehouses with BigQuery: Challenge Lab

with_AI 2022. 7. 4. 05:13

마지막 8번째 퀘스트이다.

 

이 퀘스트만 클리어하면 뱃지를 받을 수 있다.

 

마지막 LAB인 만큼 챌린지라 커맨드도 내가 직접 짜야한다.

 

어김없이 빅쿼리에 들어온다

 

 

Challenge scenario

You are part of an international public health organization which is tasked with developing a machine learning model to predict the daily case count for countries during the Covid-19 pandemic. As a junior member of the Data Science team you've been assigned to use your data warehousing skills to develop a table containing the features for the machine learning model.

You are expected to have the skills and knowledge for these tasks, so don't expect step-by-step guides to be provided.

 

도전 시나리오 당신은 Covid-19 전염병 동안 국가의 일일 사례 수를 예측하는 기계 학습 모델을 개발하는 임무를 맡은 국제 공중 보건 기구의 일원입니다. 데이터 과학 팀의 주니어 멤버로서 데이터 웨어하우징 기술을 사용하여 기계 학습 모델의 기능이 포함된 테이블을 개발하도록 지정되었습니다. 이러한 작업에 대한 기술과 지식이 있어야 하므로 단계별 가이드가 제공될 것으로 기대하지 마십시오.

 

Your challenge

Your first step is to create a new dataset and table. The starting point for the machine learning model will be the oxford_policy_tracker table in the COVID 19 Government Response public dataset which contains details of different actions taken by governments to curb the spread of Covid-19 in their jurisdictions. Given the fact that there will be models based on a range of time periods, you are instructed to create a new dataset and then create a date partitioned version of the oxford_policy_tracker table in your newly created dataset, with an expiry time set to 180 days. You have also been instructed to exclude the United Kingdom ( alpha_3_code='GBR'), Brazil ( alpha_3_code='BRA'), Canada ( alpha_3_code='CAN') & the United States of America (alpha_3_code='USA) as these will be subject to more in-depth analysis through nation and state specific analysis.

Then, in terms of additional information that is required, you have been told to add columns for population, country_area and a record column (named mobility) that will take six input fields representing average mobility data from the last six columns of the mobility_report table from the Google COVID 19 Mobility public dataset.

A colleague working on an ancillary task has provided you with the SQL they used for updating the daily new case data in a similar data partitioned table through a JOIN with the covid_19_geographic_distribution_worldwide table from the European Center for Disease Control COVID 19 public dataset. This is a useful table that contains a range of data, including recent national population data, that you should use to populate the population column in your table.

 

 

당신의 도전 첫 번째 단계는 새 데이터세트와 테이블을 만드는 것입니다. 머신 러닝 모델의 출발점은 관할 구역에서 Covid-19의 확산을 억제하기 위해 정부가 취한 다양한 조치에 대한 세부 정보가 포함된 COVID 19 Government Response 공개 데이터 세트의 oxford_policy_tracker 테이블이 될 것입니다. 기간 범위를 기반으로 하는 모델이 있다는 사실을 감안할 때 새 데이터 세트를 생성한 다음 만료 시간을 180일로 설정하여 새로 생성된 데이터 세트에 날짜 분할 버전의 oxford_policy_tracker 테이블을 생성하라는 지시를 받습니다. 또한 영국( alpha_3_code='GBR'), 브라질( alpha_3_code='BRA'), 캐나다( alpha_3_code='CAN') 및 미국(alpha_3_code='USA)을 제외하라는 지시를 받았습니다. 국가별, 국가별 분석을 통해 보다 심층적인 분석을 받을 수 있습니다. 그런 다음 필요한 추가 정보와 관련하여 population, country_area 및 mobile_report 테이블의 마지막 6개 열에서 평균 이동성 데이터를 나타내는 6개의 입력 필드를 취하는 레코드 열(mobility)에 대한 열을 추가하라는 지시를 받았습니다. Google COVID 19 Mobility 공개 데이터세트. 보조 작업을 하는 동료가 유럽 질병 통제 센터 COVID 19 공개 데이터 세트의 covid_19_geographic_distribution_worldwide 테이블과 JOIN을 통해 유사한 데이터 분할 테이블에서 매일 새로운 사례 데이터를 업데이트하는 데 사용한 SQL을 제공했습니다. 이것은 최근 전국 인구 데이터를 포함하여 테이블의 인구 열을 채우는 데 사용해야 하는 데이터 범위가 포함된 유용한 테이블입니다.

 

UPDATE
    covid.oxford_policy_tracker t0
SET
    t0.population = t2.pop_data_2019
FROM
    (SELECT DISTINCT country_territory_code, pop_data_2019 FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`) AS t2
WHERE t0.alpha_3_code = t2.country_territory_code;

 

The above template updates a daily new case column so you must modify it before you can use it to populate the population data from the European Center for Disease Control COVID 19 public dataset but the final query will be very similar.

In addition to population date you must also add in country area data to your table. The data for geographic country areas can be found in the country_names_area table from the Census Bureau International public dataset.

The last data ingestion task requires you to extract average values for the the six component fields that comprise the mobility record data from the mobility_report table from the Google COVID 19 Mobility public dataset . You need to be aware that the mobility information might be broken down by sub-regions for some countries so there may be more than one daily record for each country. However the machine learning model you are working on will only operate on a country level so you must extract a daily average for these mobility fields that aggregates all daily records for each country into a single average for each mobility record element.

In order to ensure you are aligned with the rest of the team the following column names and data types have been specified that you must use when updating the schema for your table:

 

위의 템플릿은 매일 새로운 사례 열을 업데이트하므로 유럽 질병 통제 센터 COVID 19 공개 데이터 세트의 인구 데이터를 채우는 데 사용하기 전에 수정해야 하지만 최종 쿼리는 매우 유사합니다. 인구 날짜 외에도 테이블에 국가 영역 데이터도 추가해야 합니다. 지리적 국가 영역에 대한 데이터는 Census Bureau International 공개 데이터 세트의 country_names_area 테이블에서 찾을 수 있습니다. 마지막 데이터 수집 작업에서는 Google COVID 19 Mobility 공개 데이터 세트의 mobility_report 테이블에서 이동성 레코드 데이터를 구성하는 6개의 구성 요소 필드에 대한 평균 값을 추출해야 합니다. 일부 국가의 경우 이동 정보가 하위 지역별로 분류될 수 있으므로 각 국가에 대해 둘 이상의 일일 기록이 있을 수 있다는 점에 유의해야 합니다. 그러나 작업 중인 기계 학습 모델은 국가 수준에서만 작동하므로 각 국가의 모든 일일 레코드를 각 이동성 레코드 요소에 대한 단일 평균으로 집계하는 이러한 이동성 필드에 대한 일일 평균을 추출해야 합니다. 나머지 팀과 정렬되도록 하기 위해 테이블의 스키마를 업데이트할 때 사용해야 하는 다음 열 이름과 데이터 유형이 지정되었습니다.

 

New Column Name          SQL Data Type
population               INTEGER
country_area             FLOAT
mobility                 RECORD
mobility.avg_retail      FLOAT
mobility.avg_grocery     FLOAT
mobility.avg_parks       FLOAT
mobility.avg_transit     FLOAT
mobility.avg_workplace   FLOAT
mobility.avg_residential FLOAT

 

Your coworker has also given you a SQL snippet that is currently being used to analyze trends in the Google Mobility data daily mobility patterns. You should be able to use this as part of the query that will add the daily country data for the mobility record in your table.

 

동료는 또한 현재 Google Mobility 데이터 일일 이동성 패턴의 추세를 분석하는 데 사용되는 SQL 스니펫을 제공했습니다. 테이블의 이동성 레코드에 대한 일일 국가 데이터를 추가하는 쿼리의 일부로 이것을 사용할 수 있어야 합니다.

 

 SELECT country_region, date,
      AVG(retail_and_recreation_percent_change_from_baseline) as avg_retail,
      AVG(grocery_and_pharmacy_percent_change_from_baseline)  as avg_grocery,
      AVG(parks_percent_change_from_baseline) as avg_parks,
      AVG(transit_stations_percent_change_from_baseline) as avg_transit,
      AVG( workplaces_percent_change_from_baseline ) as avg_workplace,
      AVG( residential_percent_change_from_baseline)  as avg_residential
      FROM `bigquery-public-data.covid19_google_mobility.mobility_report`
      GROUP BY country_region, date

 

When performing the JOINs between these various tables, you will need to use either the alpha_3_code column which is the 3 letter country code, or the country_name column in your table which contains the full official country name. The corresponding column names in the secondary data tables may be different.

For your final task you must identify data issues that will need to be resolved by another member of your team. Once you have your columns populated, please run a query that returns a combined list of the DISTINCT countries that do not have any population data and countries that do not have country area information, ordered by country name. If a country has neither population nor country area it should appear twice. This will give you an idea of problematic countries.

 

이러한 다양한 테이블 간에 JOIN을 수행할 때 3자로 된 국가 코드인 alpha_3_code 열이나 전체 공식 국가 이름이 포함된 테이블의 country_name 열을 사용해야 합니다. 보조 데이터 테이블의 해당 열 이름은 다를 수 있습니다. 최종 작업을 위해 팀의 다른 구성원이 해결해야 하는 데이터 문제를 식별해야 합니다. 열이 채워지면 인구 데이터가 없는 DISTINCT 국가와 국가 지역 정보가 없는 국가를 국가 이름별로 결합된 목록으로 반환하는 쿼리를 실행하십시오. 국가에 인구나 국가 면적이 없으면 두 번 나타나야 합니다. 이것은 당신에게 문제 국가에 대한 아이디어를 줄 것입니다.

 

ask 1: Create a table partitioned by date

Create a new dataset covid_660 and create a table oxford_policy_tracker_774 in that dataset partitioned by date, with an expiry of 360 days. The table should initially use the schema defined for the oxford_policy_tracker table in the COVID 19 Government Response public dataset .

You must also populate the table with the data from the source table for all countries except the United Kingdom (GBR), Brazil (BRA), Canada (CAN) and the United States (USA).

 

질문 1: 날짜별로 파티션을 나눈 테이블 생성 새 데이터 세트 covid_660을 생성하고 해당 데이터 세트에 oxford_policy_tracker_774 테이블을 생성하고 만료일은 360일입니다. 이 테이블은 처음에 COVID 19 Government Response 공개 데이터 세트의 oxford_policy_tracker 테이블에 대해 정의된 스키마를 사용해야 합니다. 또한 영국(GBR), 브라질(BRA), 캐나다(CAN) 및 미국(USA)을 제외한 모든 국가에 대한 원본 테이블의 데이터로 테이블을 채워야 합니다.

 

CREATE OR REPLACE TABLE covid_660.oxford_policy_tracker_774
PARTITION BY date

OPTIONS(

partition_expiration_days=360,

description="oxford_policy_tracker table in the COVID 19 Government Response public dataset with  an expiry time set to 90 days."

) AS

SELECT

   *

FROM

   `bigquery-public-data.covid19_govt_response.oxford_policy_tracker`

WHERE

   alpha_3_code NOT IN ('GBR', 'BRA', 'CAN','USA')