밤준

Airflow Slack 연동

밤준맨 — Tue, 9 Jul 2024 18:24:14 +0900

Airflow - Slack

airflow에서 성공, 실패시 slack으로 알림을 받기 위해 작성한다.

airflow job

현재 내 잡은 이렇게 되어 있고, 실패시 알림을 받고 있다.

airflow task 실패시 알림

task 명과 내가 받은 알림의 task가 다른 예시용이니 참고 바란다.

[태스크 상세] 버튼을 누르면 airflow log로 한 번에 이동할 수 있다.

이제부터 airflow - slack 설정하는 방법을 알아보자.

Airflow 설정

Airflow UI > Admin > Connections 에 들어가 connection을 위한 정보를 입력

Webhook Token은 본인이 만든 slack api에서 볼 수 있다. (https://api.slack.com/apps)

Slack Alert Class

from airflow.providers.slack.notifications.slack_webhook import SlackWebhookNotifier
from urllib import parse
import pendulum

class SlackAlert():
    def __init__(self, webhook_conn_id):
        self.webhook_conn_id = webhook_conn_id
        self.airflow_url = "http://xxx.xx.xx.xx:xxxx/dags"
        self.local_tz = pendulum.timezone("Asia/Seoul")
        
    def send_slack_webhook_notification(self, context):
        ti = context['task_instance']

        dag_fail_slack_webhook = SlackWebhookNotifier(
            slack_webhook_conn_id = self.webhook_conn_id,
            text=f"{ti.dag_id} > {ti.task_id} 실패",
            blocks=[
                # 메인 타이틀 + 메세지
                {
                    "type": "section",
                    "text": {
                            "type": "plain_text",
                            "text": f"{ti.dag_id} > {ti.task_id} 실패"
                            },
                    "accessory": {
                        "type": "button",
                        "text": {
                            "type": "plain_text",
                            "text": ":red_circle: 태스크 상세",
                            "emoji": True
                        },
                        "value": "go_task",
                        "url": f"{self.airflow_url}/{ti.dag_id}/grid?dag_run_id={parse.quote(ti.run_id)}&task_id={ti.task_id}",
                        "action_id": "button-action"
                    }
                }
            ],
            attachments=[
                    # 실패 태스크의 상세 정보
                    {
                        "color": "#FD8B2D",
                        "fields": [
                            {
                                "title": f"{ti.task_id} (state: {ti.state}) 상세 Logs 내용을 확인해 주세요.",
                                "value": f"""시작 시간 : {pendulum.instance(ti.start_date).in_timezone(self.local_tz).strftime('%Y-%m-%d %H:%M:%S')}""",
                                "short": False
                            }
                        ]
                    }
                ]
        )
        return dag_fail_slack_webhook.notify(context)

slack alert를 위한 간단한 클래스를 작성하였다.

여기서 눈 여겨봐야할 점은

1. 서버에 따라 서버시간이 다르므로 나는 pendulum 라이브러리를 사용해 지역 시간을 설정

2. ti.log_url을 사용하면 localhost로 불려지기 때문에, airflow_url을 수동 설정

한 점이다.

추가로 필요한 정보가 있다면 context 내에서 정보를 꺼내써도 된다.

Python 호출

from common import SlackAlert

# 본인이 설정한 conn_id 입력
slack_alert = SlackAlert('slack_ir_monitoring')

default_args = {
	...,
	'on_failure_callback' : slack_alert.send_webhook_notification
    }
    
with DAG(
	job_name,
    default_args = default_args,
    ...
    ) as dag:
    
    task1 = ..
    ..

이렇게 설정하면 task 실패시, slack으로 알림이 오는걸 확인할 수 있다.

물론, task 성공시에도 알림을 받을 수 있는데 default_args내 on_success_callback 인자에 비슷하게 구현하면 된다.

나는 배치 주기가 짧아 실패시에만 알림을 받고 있다.

36. Valid Sudoku

밤준맨 — Sat, 13 Apr 2024 16:47:21 +0900

Determine if a 9 x 9 Sudoku board is valid. Only the filled cells need to be validated according to the following rules:

Each row must contain the digits 1-9 without repetition.
Each column must contain the digits 1-9 without repetition.
Each of the nine 3 x 3 sub-boxes of the grid must contain the digits 1-9 without repetition.
Note:

A Sudoku board (partially filled) could be valid but is not necessarily solvable.
Only the filled cells need to be validated according to the mentioned rules.

Example 1:

Input: board =
[["5","3",".",".","7",".",".",".","."]
,["6",".",".","1","9","5",".",".","."]
,[".","9","8",".",".",".",".","6","."]
,["8",".",".",".","6",".",".",".","3"]
,["4",".",".","8",".","3",".",".","1"]
,["7",".",".",".","2",".",".",".","6"]
,[".","6",".",".",".",".","2","8","."]
,[".",".",".","4","1","9",".",".","5"]
,[".",".",".",".","8",".",".","7","9"]]
Output: true
Example 2:

Input: board =
[["8","3",".",".","7",".",".",".","."]
,["6",".",".","1","9","5",".",".","."]
,[".","9","8",".",".",".",".","6","."]
,["8",".",".",".","6",".",".",".","3"]
,["4",".",".","8",".","3",".",".","1"]
,["7",".",".",".","2",".",".",".","6"]
,[".","6",".",".",".",".","2","8","."]
,[".",".",".","4","1","9",".",".","5"]
,[".",".",".",".","8",".",".","7","9"]]
Output: false
Explanation: Same as Example 1, except with the 5 in the top left corner being modified to 8. Since there are two 8's in the top left 3x3 sub-box, it is invalid.

Constraints:
board.length == 9
board[i].length == 9
board[i][j] is a digit 1-9 or '.'.

class Solution:
    def isValidSudoku(self, board: List[List[str]]) -> bool:

        def check_row_and_col(r, c, n):
            for i in range(9):
                if (board[r][i] == n and i != c) or (board[i][c] == n and i != r):
                    return False

            # check rect
            start_row = (r // 3) * 3
            start_col = (c // 3) * 3
            for i in range(start_row, start_row + 3):
                for j in range(start_col, start_col + 3):
                    if board[i][j] == n and (i != r or j != c):
                        return False
            return True
        
        for i in range(9):
            for j in range(9):
                if board[i][j] != ".":
                    if not check_row_and_col(i, j, board[i][j]):
                        return False
        return True

스도쿠를 채우는 문제가 아니라

이 스도쿠가 유효한지, 무효한지만 검사하면 되는 문제

discussion을 보니

비교해야하는 인덱스 값이 현재 체크하는 인덱스값과 동일한지를 체크해야 하는 부분을

놓치는 부분이 많은 것 같다.

나도 놓쳐서 애먹었다 ㅋㅋ..

35. Search Insert Position

밤준맨 — Sat, 13 Apr 2024 16:44:07 +0900

Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.

You must write an algorithm with O(log n) runtime complexity.

Example 1:

Input: nums = [1,3,5,6], target = 5
Output: 2
Example 2:

Input: nums = [1,3,5,6], target = 2
Output: 1
Example 3:

Input: nums = [1,3,5,6], target = 7
Output: 4

Constraints:

1 <= nums.length <= 104
-104 <= nums[i] <= 104
nums contains distinct values sorted in ascending order.
-104 <= target <= 104

class Solution:
    def searchInsert(self, nums: List[int], target: int) -> int:
        return bisect.bisect_left(nums, target)

일반적으로 좌, 우의 index를 옮겨가며 값을 찾는게 일반적이지만

구글링을 하다보니 binary search에 대한 라이브러리가 있었다.

관련 라이브러리 설명 글
https://yerimoh.github.io/Algo011/

hadoop MR

밤준맨 — Mon, 17 Oct 2022 14:20:46 +0900

hadoop3 기준으로 설명

1. hadoop MR 실행하기 위한 기본 쉘 파일

${HADOOP} --config ${HADOOP_CONF_DIR} jar ${JAR_PATH} \
	-files "argvs, mapper.py, reducer.py" \
	-D mapreduce.job.name="test job" \
	-D mapreduce.job.reduces=1 \
	-D mapreduce.reduce.memory.mb=40960 \
	-D stream.num.map.output.key.fields=1 \
	-D stream.map.output.field.separator="\t" \
	-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
	-D mapreduce.partition.keycomparator.options="-k1,1" \
	-D mapreduce.partition.keypartitioner.options="-k1,1" \
	-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
	-cmdenv PYTHONPATH=. \
	-input "HDFS INPUT" \
	-output "HDFS OUTPUT" \
	-mapper "python mapper.py argvs" \
	-reducer "python reducer.py"

${HADOOP} --config ${HADOOP_CONF_DIR} jar ${JAR_PATH}

하둡은 jar 파일을 통해 실행되므로 첫 line에서 설치된 하둡의 jar 파일을 config로 할당해야 한다.

-files "argvs, mapper.py, reducer.py"

필요한 파일들은 미리 선언해서 사용해야 하는데, python 파일외에 argvs로 사용할 파일들도 미리 선언해야 한다.
만약 하둡내에 있는 파일이라면 다음과 같이 변경해서 사용해야 한다.

-files "hdfs://${HADOOP_ADDR}/args#argvs, mapper.py, reducer.py"

위와 같이 할당하면 되고, #argvs 는 alias된 형태이므로, mapper 혹은 reducer 실행시 hdfs:// ~ 주소 전체를 인자로 넘길 필요 없이, argvs로만 넘기면 된다.

-D mapreduce.job.name="test job"

yarn에서 보여지는 job name으로, yarn 내에 다양한 잡들이 많기 때문에 이름을 할당하여 구분짓기 위함이다.

-D mapreduce.job.reduces=1
-D mapreduce.reduce.memory.mb=40960

리듀서를 1개만 할당하여, 결과파일을 1개만 생성한다는 의미로 part-00000 한 개의 파일만 생성될 것이고,
한 리듀서에 40GB의 메모리를 할당한 다는 의미이다.

-D stream.num.map.output.key.fields=1
-D stream.map.output.field.separator="\t"

mapper에서 1번째 "\t"을 구분자로 key값과 value값으로 나눈다는 의미이다.
예시로, P1 \t val1 val2 val3 val4 값이 Mapper의 output으로 나왔다면,
key : P1 , values : val1 val2 val3 val4 와 같이 reducer로 전달되게 된다.

-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapreduce.partition.keycomparator.options="-k1,1"

class 명은 keycomparator를 사용하기 위해 기본적으로 선언하는 인자로 인식하면 되고,
keycomparator의 의미는 mapper의 Output key값을 sorting하는 의미로 생각하면 된다.
-k1,1 의 의미는 sort의 key값 중 1번 key는 1번째 필드라는 의미이다.

n차 정렬을 하려면 다음과 같다
"-k1,1nr -k2,2n -k3,3nr"
키1은 numeric reverse sort, 키2는 numeric sort, 키3은 numeric reverse sort

-D mapreduce.partition.keypartitioner.options="-k1,1"
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

-partitioner option은 위 comparator에서 class 선언해준 것 처럼 기본적으로 선언해야 한다.( 아마 위치도 -D 옵션 사이말고.. -D 옵션 맨 뒤에 해야하는 걸로 알고 있는데.. 확실하진 않음 )
keypartititioner의 의미는 mapper의 output Key을 partition을 나누어서 reducer에게 전달한다고 생각하면 된다.
-k1,1의 경우 첫번째 키로 파티션을 나눈다고 생각하면 된다.
만약 첫번째, 두번째 키로 파티션을 할당하고 싶다면, -k1,2와 같이 할당하면 된다.

-k1,1 의 예시로 실행되면 다음과 같다.

mapper output 
P1 P2 P3
P1 P4 P5
P2 P2 P5
P3 P5 P2

reducer에게 input으로 전달될 시,
//part-00000( 파일명은 다를 수 있음 )
P1 P2 P3
P1 P4 P5

//part-00001
P2 P2 P5

//part-00002
P3 P5 P2

[MacOS] pyenv virtualenv 환경설정

밤준맨 — Thu, 30 Sep 2021 10:07:48 +0900

http://taewan.kim/post/python_virtual_env/

에러 해결

1. sendfile 관련 에러 : https://vipdeveloper.tistory.com/65

2. failed to activate virtualenv 관련 에러 : https://velog.io/@limdongyoung0/pyenv-Failed-to-activate-virtualenv

[MacOS] jupyter notebook 설치

밤준맨 — Wed, 29 Sep 2021 15:42:45 +0900

1. Terminal 실행

2. pip3 install --upgrade pip

3. pip3 install jupyter

4. jupyter notebook

(= jupyter lab)

ML Hyperparameter tuning

밤준맨 — Thu, 5 Aug 2021 09:39:28 +0900

로컬에서 모델 개발시 : hyperopt, optuna, skopt

여러 노드에서 모델 개발시 : katib, ray

hyperopt 예제

https://teddylee777.github.io/thoughts/hyper-opt

베이지안 최적화에 기반한 HyperOpt를 활용한 하이퍼 파라미터 튜닝

베이지안 최적화에 기반한 HyperOpt를 활용한 하이퍼 파라미터 튜닝 방법에 대하여 알아 보도록 하겠습니다.

teddylee777.github.io

30. Substring with Concatenation of All Words

밤준맨 — Mon, 21 Jun 2021 11:33:55 +0900

You are given a string s and an array of strings words of the same length. Return all starting indices of substring(s) in s that is a concatenation of each word in words exactly once, in any order, and without any intervening characters.

You can return the answer in any order.

Example 1:

Input: s = "barfoothefoobarman", words = ["foo","bar"] Output: [0,9] Explanation: Substrings starting at index 0 and 9 are "barfoo" and "foobar" respectively. The output order does not matter, returning [9,0] is fine too.

Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"] Output: []

Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"] Output: [6,9,12]

Constraints:

1 <= s.length <= 104
s consists of lower-case English letters.
1 <= words.length <= 5000
1 <= words[i].length <= 30
words[i] consists of lower-case English letters.

class Solution:
    def findSubstring(self, s: str, words: List[str]) -> List[int]:
        char = len(words[0])
        begin_idx = []
        i = 0
        while i <= len(s) - len(words) * char:
            tmp_words = list(words)
            if s[i:i+char] in tmp_words:
                tmp_words.remove(s[i:i+char])
        #         print(i,i+char)
                j = i+char
                while j <= len(s):
                    if s[j:j+char] in tmp_words:
                        tmp_words.remove(s[j:j+char])
                        j += char
                    else:
                        break
            if len(tmp_words) == 0:
                begin_idx.append(i)
            
            i += 1
        
        return begin_idx

- 성능이 좋지 못하다. discussion에 나온 다른 방법도 참고해봐야겠다..

Runtime: 1480 ms, faster than 23.33% of Python3 online submissions for Substring with Concatenation of All Words.

Memory Usage: 14.7 MB, less than 27.56% of Python3 online submissions for Substring with Concatenation of All Words.

31. Next Permutation

밤준맨 — Wed, 16 Jun 2021 15:12:41 +0900

31. Next Permutation

Medium

59131980Add to ListShare

Implement next permutation, which rearranges numbers into the lexicographically next greater permutation of numbers.

If such an arrangement is not possible, it must rearrange it as the lowest possible order (i.e., sorted in ascending order).

The replacement must be in place and use only constant extra memory.

Example 1:

Input: nums = [1,2,3] Output: [1,3,2]

Example 2:

Input: nums = [3,2,1] Output: [1,2,3]

Example 3:

Input: nums = [1,1,5] Output: [1,5,1]

Example 4:

Input: nums = [1] Output: [1]

Constraints:

1 <= nums.length <= 100
0 <= nums[i] <= 100

class Solution:
    def nextPermutation(self, nums: List[int]) -> None:
        """
        Do not return anything, modify nums in-place instead.
        """
        i = len(nums) - 2
        flag = False

        while i>=0:
            if nums[i] < nums[i+1]:
                flag = True
                break
            i -= 1

        if not flag:
            nums.sort()
            return nums
            print(nums)
        else:
            j = len(nums)-1
            while nums[j] <=nums[i] and j > i:
                j -= 1

            nums[i], nums[j] = nums[j], nums[i]
            nums[i+1:] = nums[i+1:][::-1]

- python은 inplace swap이 가능한 점

- 정렬 알고리즘 중 이러한 정렬 알고리즘도 있다는것..

[dacon] 와인품질 EDA 및 1차 모델 개발

밤준맨 — Mon, 14 Jun 2021 15:35:14 +0900

https://dacon.io/competitions/open/235610/overview/description

[화학] 와인 품질 분류

출처 : DACON - Data Science Competition

dacon.io

위 데이터를 활용했고, 기존에 작성했던 EDA글을 토대로 진행했다.

여기서 예측해야 하는 Y는 quality이며, 나머지는 feature로 사용해야 한다.

전체 컬럼의 null값은 없고, type만 object 타입인 것을 확인할 수 있다.

red, white 계열의 type만 존재하므로

df['type'] = df['type'].replace(['red', 'white'], [0, 1])

인코딩을 진행했다.

여기선 생략됐지만, 각 컬럼별로 분포가 다르다.

scaling을 진행해야 한다.

그리고, 변수별로 상관관계를 파악해봤다.

corr = df.corr(method = 'pearson')
sns.heatmap(data = corr, annot=True, fmt='.1f', cmap='Reds')
plt.show()

일반적으로 상관관계가 0.6~0.7이상이면 매우 상관관계가 높아 중요한 feature지만,

여기선 quality와 상관관계가 가장 높은 feature는 volatile acidity, density, alcohol로 선택하여

이 세 컬럼에 대해 이상치 데이터를 제거하기로 했다.

volatile acidity, density 같은 경우에는 이상치값들이 있는 것을 확인할 수 있는데,

지금 생각해보면 이상치값을 제거하면 안되는 요소인 것 같다.

이상치 탐지 혹은 binary classification 문제였다면,

이상치를 제거하여 변수를 만들지 말아야 한다고 생각한다.

하지만 이 데이터에서 이상치라고 하기에는 값의 범위가 크지 않고,

큰 영향을 주지 않을 것 같다.

이상치 제거

def check_outlier(df, col):
    IQR = df[col].quantile(0.75) - df[col].quantile(0.25) 
    max_outlier = df[col].quantile(0.75) + 1.5*IQR
    min_outlier = df[col].quantile(0.25) - 1.5*IQR
    print(len(df[df[col] > max_outlier]), len(df[df[col] < min_outlier]))
    df = df.drop(df[df[col] > max_outlier].index, axis=0)
    df = df.drop(df[df[col] < min_outlier].index, axis=0)
    return df

값의 범위가 달라 일반적인 standardscaler를 사용했다.

type과 quality를 제거하고 scaling을 진행했다.

sc = StandardScaler()
Y = df['quality']
X = df.drop(columns=['quality'])
X_scal = sc.fit_transform(X[X.columns[:-1]])
X_fin = np.column_stack((X_scal, X['type'].values))
Y = Y-3
## pred시 +3해서 예측할것

X_train, X_test, y_train, y_test = train_test_split(X_fin, Y, test_size = 0.33, random_state = 12)

여기서 Y(quality)값 3을 뺀 이유는 lgbm모델 param선언시,

num_classes 인자 때문이다.

params = {
    'application' : 'multiclass',
    'num_boost_round' : 1300,
    'learning_rate' : 0.01,
    'num_leaves' : 31,
    'num_classes' : 7,
    'metric' : 'multi_error'
}

현재 quality가 3~9 까지여서 y값의 개수는 총 7개가 나올 수 있다.

하지만 lgbm num_classes를 7로 선언시 자동으로 [0, 7)로 인식하기 때문에

predict한 결과에서 3을 더하는 방식으로 진행했다.

lgb_train = lgbm.Dataset(X_train, label = y_train)
lgb_valid = lgbm.Dataset(X_test, label = y_test)
evals_result = {}
clf = lgbm.train(params, lgb_train, valid_sets=lgb_valid, evals_result=evals_result)
lgbm.plot_metric(evals_result)

[1] valid_0's multi_error: 0.555372

...

[1300] valid_0's multi_error: 0.369146

로 에러가 점점 줄어들고, loss graph를 살펴보면

더 이상 진행하면 overfitting이 발생할 수 있어 여기서 stop했다.

predict하고 나면 결과는

[ 0.1 0.3 .. 0.1 0.2] 이런식으로 나온다.

결과값이 가장 높은 값의 index를 가져와야 한다.

res_argmax = [np.argmax(n) for n in y_pred]

그럼 테스트 데이터로 성능검증을 해보자.

빠르고, 간단하게 만들었는데 0.662가 나왔다.

이상치 데이터 제거하고 모델 학습도 해보고, 제거하지 않고도 학습해봤는데

데이터의 양이적어 큰 비중은 없는 것같다.

좀 더 결과적으로 분석해보자면

예측 결과의 quality가 4~8사이밖에 나오지 않았다.

나머지 0~3, 9에 대한 데이터를 늘려서 학습하거나,

scaling시 값이 너무 작아져 버려서 가중치가 작게 학습된거같다.