Learn to Earn Data Challenge

Cloud Composer: Qwik Start - Console

with_AI 2022. 7. 11. 15:49

Overview

Workflows are a common theme in data analytics - they involve ingesting, transforming, and analyzing data to figure out the meaningful information within. In Google Cloud, the tool for hosting workflows is Cloud Composer which is a hosted version of the popular open source workflow tool Apache Airflow.

In this lab, you use the Cloud Console to set up a Cloud Composer environment. You then use Cloud Composer to go through a simple workflow that verifies the existence of a data file, creates a Cloud Dataproc cluster, runs an Apache Hadoop wordcount job on the Cloud Dataproc cluster, and deletes the Cloud Dataproc cluster afterwards.

What you'll do

  • Use Cloud Console to create the Cloud Composer environment
  • View and run the DAG (Directed Acyclic Graph) in the Airflow web interface
  • View the results of the wordcount job in storage.

 

개요 워크플로는 데이터 분석의 일반적인 주제입니다. 워크플로에는 데이터를 수집, 변환 및 분석하여 의미 있는 정보를 파악하는 작업이 포함됩니다. GCP에서 워크플로 호스팅 도구는 널리 사용되는 오픈소스 워크플로 도구 Apache Airflow의 호스팅 버전인 Cloud Composer입니다. 이 실습에서는 Cloud Console을 사용하여 Cloud Composer 환경을 설정합니다. 그런 다음 Cloud Composer를 사용하여 데이터 파일의 존재를 확인하고, Cloud Dataproc 클러스터를 만들고, Cloud Dataproc 클러스터에서 Apache Hadoop 단어 수 작업을 실행하고, 나중에 Cloud Dataproc 클러스터를 삭제하는 간단한 워크플로를 진행합니다. 당신이 할 일 Cloud Console을 사용하여 Cloud Composer 환경 만들기 Airflow 웹 인터페이스에서 DAG(Directed Acyclic Graph) 보기 및 실행 저장소에 있는 wordcount 작업의 결과를 봅니다.

 

Airflow and core concepts

While waiting for your Composer environment to get created, review some terms that are used with Airflow.

Airflow is a platform to programmatically author, schedule and monitor workflows.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

Core concepts

DAG

A Directed Acyclic Graph is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Operator

The description of a single task, it is usually atomic. For example, the BashOperator is used to execute bash command.

Task

A parameterised instance of an Operator; a node in the DAG.

Task Instance

A specific run of a task; characterized as: a DAG, a Task, and a point in time. It has an indicative state: running, success, failed, skipped, ...

You can read more about the concepts here.

 

기류 및 핵심 개념 Composer 환경이 생성될 때까지 기다리는 동안 Airflow와 함께 사용되는 몇 가지 용어를 검토하십시오. Airflow는 프로그래밍 방식으로 워크플로를 작성, 예약 및 모니터링하는 플랫폼입니다. Airflow를 사용하여 작업의 DAG(방향성 비순환 그래프)로 워크플로를 작성합니다. 기류 스케줄러는 지정된 종속성을 따르는 동안 작업자 배열에서 작업을 실행합니다. 핵심 개념 가리비 방향성 비순환 그래프는 관계와 종속성을 반영하는 방식으로 구성된 실행하려는 모든 작업의 ​​모음입니다. 운영자 단일 작업에 대한 설명은 일반적으로 원자적입니다. 예를 들어, BashOperator는 bash 명령을 실행하는 데 사용됩니다. 일 Operator의 매개변수화된 인스턴스입니다. DAG의 노드. 태스크 인스턴스 특정 작업 실행 DAG, 작업 및 특정 시점으로 특징지어집니다. 표시 상태: 실행 중, 성공, 실패, 건너뛰기, ... 여기에서 개념에 대한 자세한 내용을 읽을 수 있습니다.

 

Defining the workflow

Now let's discuss the workflow you'll be using. Cloud Composer workflows are comprised of DAGs (Directed Acyclic Graphs). DAGs are defined in standard Python files that are placed in Airflow's DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.

Below is the code for the hadoop_tutorial.py workflow, also referred to as the DAG:

 

워크플로 정의 이제 사용할 워크플로에 대해 논의해 보겠습니다. Cloud Composer 워크플로는 DAG(방향성 비순환 그래프)로 구성됩니다. DAG는 Airflow의 DAG_FOLDER에 있는 표준 Python 파일에 정의됩니다. Airflow는 각 파일의 코드를 실행하여 DAG 개체를 동적으로 빌드합니다. 원하는 만큼 DAG를 가질 수 있으며 각 DAG는 임의의 수의 작업을 설명합니다. 일반적으로 각각은 단일 논리적 워크플로에 해당해야 합니다. 다음은 DAG라고도 하는 hadoop_tutorial.py 워크플로에 대한 코드입니다.