1. Feature Importances(Mean decrease impurity, MDI)

sklearn 트리 기반 분류기에서 디폴트로 사용
속도 빠르나 결과를 주의해서 봐야 함
각 특성을 평균불순도감소를 계산한 값
불순도를 크게 감소하는데 많이 사용된 특성의 중요도가 올라감

rf = pipe.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh();

2. Drop-Column Importance

모든 특성을 각각 drop한 후 다시 fit을 해야 하므로 매우 느림
n개의 특성이 존재할 때 (n+1)번의 학습이 필요

column  = 'opinion_seas_risk'

# opinion_h1n1_risk 없이 fit
pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train.drop(columns=column), y_train)
score_without = pipe.score(X_val.drop(columns=column), y_val)
print(f'검증 정확도 ({column} 제외): {score_without}')

# opinion_h1n1_risk 포함 후 다시 학습
pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train, y_train)
score_with = pipe.score(X_val, y_val)
print(f'검증 정확도 ({column} 포함): {score_with}')

# opinion_h1n1_risk 포함 전 후 정확도 차이를 계산합니다
print(f'{column}의 Drop-Column 중요도: {score_with - score_without}')

3. 순열중요도(Permutation Importance, Mean Decrease Accuracy,MDA)

관심있는 특성에 무작위로 노이즈를 주고 예측할 때 성능 평가지표가 얼마나 감소하는지를 측정함
각 특성을 제거하지 않고 특성값에 무작위로 노이즈를 주어 기존 정보를 제거함으로써 특성이 기존에 하던 역할을 하지 못하게 되고 이때 성능을 측정함
- 노이즈를 주는 가장 간단한 방법은 그 특성값들을 샘플 내에서 섞는 것

✔️ 순열 중요도를 직접 구하는 경우

# 특성의 값을 무작위로 섞습니다
X_val_permuted = X_val.copy()
X_val_permuted[feature] = np.random.RandomState(seed=7).permutation(X_val_permuted[feature])

# 순열 중요도 값을 얻습니다. (재학습이 필요 없습니다!)
score_permuted = pipe.score(X_val_permuted, y_val)

print(f'검증 정확도 ({feature}): {score_with}')
print(f'검증 정확도 (permuted "{feature}"): {score_permuted}')
print(f'순열 중요도: {score_with - score_permuted}')

✔️ 라이브러리로 순열 중요도를 구하는 경우

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import eli5
from eli5.sklearn import PermutationImportance

# permuter 정의
permuter = PermutationImportance(
    pipe.named_steps['rf'], # model
    scoring='accuracy', # metric
    n_iter=5, # 다른 random seed를 사용하여 5번 반복
    random_state=2
)

# permuter 계산은 preprocessing 된 X_val을 사용합니다.
X_val_transformed = pipe.named_steps['preprocessing'].transform(X_val)

# 실제로 fit 의미보다는 스코어를 다시 계산하는 작업입니다
permuter.fit(X_val_transformed, y_val);
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

Xgboost for gradient boosting

한 트리를 깊게 학습시키면 과적합을 일으키기 쉽기 때문에 배깅(Bagging, 랜덤포레스트)이나 부스팅(Boosting) 앙상블 모델을 사용해 과적합을 피함
부스팅 vs 배깅의 차이: 모두 앙상블 모델이지만 트리를 만드는 방법의 차이가 o
- 부스팅 알고리즘 중 AdaBoost는 각 트리가 만들어질 때 잘못 분류되는 관측치에 가중치를 주고, 다음 트리가 만들어질 때 이전에 잘못 분류된 관측치가 더 많이 샘플링되게 해 그 관측치를 분류하는 데 더 초점을 맞춤
- 그래디언트 부스팅: 비용함수를 최적화하는 방법에 있어 샘플의 가중치를 조정하는 대신 잔차를 학습하도록 해, 잔차가 더 큰 데이터를 더 학습하도록 만듦
  - python libraries
    - scikit-learn Gradient Tree Boosting : 속도가 느림
    - xgboost, LightGBM : 결측값을 수용, monotonic constraints를 강제할 수 있음
    - CatBoost — 결측값을 수용하며, categorical features를 전처리 없이 사용할 수 있음
      - 기본 파라미터가 최적화되어 있어 하이퍼파라미터 조정에 덜 민감(과적합을 내부적인 알고리즘으로 해결하고 있기 때문)
      - learning_rate, random_strength, L2_regulariser 등으로 하이퍼 파라미터를 조정할 수는 있음
    Catboost 주요 개념과 특징 이해하기