정보보호 r&d 데이터챌린지 2018 ai 기반악성코드탐지 · 2019-01-16 · •...

14
1 2019-01-15 정보보호 R&D 데이터 챌린지 2018 AI 기반 악성코드 탐지 정성균 개인팀 Sungkyun Jung

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

12019-01-15

정보보호 R&D 데이터 챌린지 2018AI 기반 악성코드 탐지

정성균개인팀

Sungkyun Jung

Page 2: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

22019-01-15

Background

Modeling Process

Data Preparation

Feature Engineering

Model Selection & Tuning with Cross-Validation

Conclusion

Personal Opinion

목차

Page 3: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

32019-01-15

정보보호 R&D 데이터챌린지 2017

참여경험을토대로부족한점보완

• Overfitting을유도하는과도한 Feature 개수사용

• 정제되지않은상태의 Feature 활용

• 기본기에충실

◦ 국영수위주로교과서를..

Background

Page 4: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

42019-01-15

Based on Static Analysis

Reference List

Background

Name Reference Link Description

Manalyze https://github.com/JusticeRage/Manalyze A static analyzer for PE executables

Objdump https://en.wikipedia.org/wiki/Objdumpit can be used as a disassembler to

view an executable in assembly form

UPX https://github.com/upx/upx the Ultimate Packer for eXecutables

LightGBM https://lightgbm.readthedocs.io/en/latestLightGBM is a gradient boosting framework that uses tree based

learning algorithms.

XGBoost https://xgboost.readthedocs.io/en/latest

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and

portable

Scikit-learn http://scikit-learn.org/stable/Simple and efficient tools for data

mining and data analysis

Page 5: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

52019-01-15

Data Preparation

Modeling Process 中가장중요한단계

• But, 대회특성상검증된학습데이터제공

◦ 근소한차이로순위결정

UPX Unpack

• 학습데이터 10,000개中 3,879(1,159 Benign / 2,720 Malware)개가 Packing 의심

• 정적 Unpack이가능한 UPX를대상으로만 Unpack 시도

◦ 632(118 Benign / 514 Mawlare)개파일 Unpack 성공

Modeling Process

Page 6: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

62019-01-15

Feature Engineering

정적으로분석가능한거의모든정보를대상으로 Featurization 수행

• PE Header

• Section

• Resource

• Byte-1-Gram

• Assembly Instruction

• Import/Export Api

• String

• Rich Header

• Version

• TLS Callback

Modeling Process

Page 7: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

72019-01-15

Feature Engineering

PE Header

• Dos Header, File Header, Optional Header

• Additional Information

◦ Detected Languages Number

◦ Has Debug Information

Modeling Process

Page 8: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

82019-01-15

Feature Engineering

Section, Resource, Assembly Instruction

• 출현빈도가높은대상선정및활용

Modeling Process

Page 9: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

92019-01-15

Feature Engineering

Byte-1-Gram

• PE 바이너리내에서출현하는 0~255 범위의바이트값의빈도수및비율정보

• Microsoft Malware Classification Challenge (BIG 2015)

◦ Packing 여부와관계없이악성코드군(‘family’)분류에효과적

◦ Benign/Malware Classification에도긍정적인효과기대가능

Rich Header

• Microsoft로부터공식적으로문서화되지않은구조정보

• PE 바이너리빌드에사용된 Compiler를 Detection 가능

• Malware간의유사성분석에활용한연구존재

Modeling Process

Page 10: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

102019-01-15

Feature Engineering

Import/Export Api, String

• 특정 Api 및 String의출현여부를 Feature로활용

Version

• Version Information Structure

◦ LegalCopyright, FileType, FileVersion, FileSubtype, FileOs, FileDescription, CompanyName

TLS Callback

• 프로그램의실행진입점(entry point) 이전에실행되는서브루틴

◦ Malware가디버거에 Load될경우, 진입점에도달하기이전에악성행위완료가능

Modeling Process

Page 11: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

112019-01-15

Feature Engineering

Feature Ranking & Selection

• DataSet 기반 Feature Engineering의자동화

Modeling Process

Feature 0

Feature 1

Feature 2

Feature 9

Feature 0

Feature 1

Feature 2…

Feature 9

Dimension Reduction

Combined Feature Set

Ranking&

Selection

2~300,000 6~7,000 746

Page 12: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

122019-01-15

Model Selection & Tuning with Cross-Validation

Gradient Boosting Algorithm

• XGBoost & LightGBM

Model Tuning With scikit-learn’s GridSearchCV

• Stratified 3-fold-cross validation 수행및 Best Model 선정

◦ XGBoost : 0.9702970297029703

◦ LightGBM : 0.9727972797279728

최종 Label 예측을위한계산식

Modeling Process

Page 13: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

132019-01-15

Why so Simple?

Simple is the best

Modeling Process ≒ Black Box Problem

사람의편견을최소화

• Byte-1-Gram

• Noise-like String

Personal Opinion

Page 14: 정보보호 R&D 데이터챌린지 2018 AI 기반악성코드탐지 · 2019-01-16 · • XGBoost & LightGBM Model Tuning With scikit-learn’sGridSearchCV • Stratified 3-fold-cross

142019-01-15

Thank you for listening