Андрей Устюжанин - Технологии обработки данных из...

Андрей Устюжанин

Обработка больших данных БАК

18 октября 2014

Из чего сделана наша вселенная?

Универсальные законы?

Что такое антиматерия? темная материя?

Как происходит переход от законов микро-мирак законам макро-мира?

…

2

Вопросы «на миллион»

Гипотезы => Эксперименты => Законы

3

F=ma

E=mc2

Стандартная модель

Суперсимметричная модель частиц

Гравитоны?

Экспериментальная наука «тогда»

4

Эксперименты «сегодня»

5

Открытие бозона Хиггса

6

Заголовок (не длинней одной строки)

7


8


9

A simulated SUSY event in ATLAS

10

high pT muons

high pT jets of hadrons

missing transverse energy

p p

Background events

11

This event from Standard Model ttbar production also has high pT jets and muons, and some missing transverse energy.

→ can easily mimic a SUSY event.

Event (событие)

12

Basic unit of data: an ‘event’.

Ideally, an event is a list of momentum vectors & particle types.

In practice, particles ‘reconstructed’ as tracks, clusters of energy, deposited in calorimeters, etc.

Resolution, angular coverage, particle id, etc. imperfect.

«In Monte-Carlo we trust!»

13

14

1 событие - 150 Kb

1 год ~ 10 Pb

Генерация событий

Выбор событий для проверки гипотез

15

For each event we measure a set of numbers: ( )nx,,x=x …1!

x1 = jet pT x2 = missing energy x3 = particle i.d. measure, ...

x follows some n-dimensional joint probability density, which depends on the type of event produced, i.e., was it ,ttpp→ …→ ,g~g~pp

xi

x jE.g. hypotheses H0, H1, ... Often simply �signal� (s), �background� (b)

( )1H|xp!

( )0H|xp!

Выбор оптимальных ограничений

16

In particle physics usually start by making simple �cuts�:

xi < ci xj < cj

Maybe later try some other type of decision boundary:

H0 H0

H0

H1

H1 H1

Выборка событий

17

To search for events of a given type (H0: ‘signal’), need discriminating variable(s) distributed as differently as possible relative to unwanted event types (H1: ‘background’)

Count number of events in acceptance region defined by ‘cuts’

Expected number of signal events: s = !s !s L

Expected number of background events: b = ! b !b L

!s, !b = cross section for signal, background

‘Efficiencies’: !s = P( accept | s ), !b = P( accept | b )

L = integrated luminosity (related to beam intensity, data taking time)

Фоновые события

18

Count n events, e.g., in fixed time or integrated luminosity.

s = expected number of signal events

b = expected number of background events

n ~ Poisson(s+b):

Sometimes b known, other times it is in some way uncertain.

Goals: (i) convince people that s ≠ 0 (discovery); (ii) measure or place limits on s, taking into consideration the uncertainty in b.

Widely discussed in HEP community, see e.g. proceedings of PHYSTAT meetings, Durham, Fermilab, CERN workshops...

Открытия

19

Often compute p-value of the ‘background only’ hypothesis H0 using test variable related to a characteristic of the signal.

p-value = Probability to see data as incompatible with H0, or more so, relative to the data observed.

Requires definition of ‘incompatible with H0’

HEP folklore: claim discovery if p-value equivalent to a 5! fluctuation of Gaussian variable (one-sided)

Actual p-value at which discovery becomes believable will depend on signal in question (subjective)

Why not do Bayesian analysis?

Usually don’t know how to assign meaningful prior probabilities подробнее на

http://www.pp.rhul.ac.uk/~cowan

http://www.pp.rhul.ac.uk/~cowan

20

Analysis Value Chain

Get datasets (Real, MC, ...)

Pre-selection

testtrain

Pre-processing (e.g., add variables)

Event selection

cut-based

MVA-based

Counting/fitting

Systematics Estimation

Signifiсance Estimation

В поисках лучшей выборки…

Возможности улучшения

21

more powerful algorithms (e.g. BDT, Deep Neural Networks)

improved features (e.g. «isolation» variables or particle identification)

complex training scenarios (e.g. n-folding, ensembling, blending, cascading)

Саша Фонарёв: https://tech.yandex.ru/education/m/shad/talks/1423/ Максим Мусин: https://tech.yandex.ru/education/m/shad/talks/1878/

https://tech.yandex.ru/education/m/shad/talks/1423/


Возможности улучшения

22

more powerful algorithms (e.g. BDT, Deep Neural Networks)

improved features (e.g. «isolation» variables or particle identification)

complex training scenarios (e.g. n-folding, ensembling, blending, cascading)

Саша Фонарёв: https://tech.yandex.ru/education/m/shad/talks/1423/ Максим Мусин: https://tech.yandex.ru/education/m/shad/talks/1878/



23

OverfittingDecision Tree Underfitting RandomForest

Number of iterations

Training set accuracy

Test set accuracy

Performance (ROC, Learning curve)

24

Алгоритмы, реализацииFamilies:

– Boosted Decision Trees (BDT)

– Artificial Neural Network (ANN)

– Support Vector Machine (SVM)

– Clustering, Bayesian Networks, ...

Implementations

– TMVA (60+ algorithms)

– NeuroBayes

– python scikit-learn

– R packages

– Private (Matrixnet, predict.io)

– XGBoost, …

Price for sensitivity

How do I check quality of event discriminating function?

– Overfitting?

– Correlations?

– Relevance of figure of merit to analysis significance?

How do I deal with complexity?

– Estimate influence of model parameters

– Extra computation

– Organization (cross-checks, collaboration)

25

Переобучение

26

training sample independent validation sample

If decision boundary is too flexible it will conform too closely to the training points → overtraining. Monitor by applying classifier to independent validation sample.

Figure-of-Merits Land

Area under ROC

Likelihood

Misclassification

False Positive, False Negative

Punzi measure

27

SpS+B

, SpB, · · ·

Efficiency flatness?

Не только физика

29

Online triggers and DAQ

Offline simulation and processing

Data storage architectures

Resource management and provisioning

Data analytics

Networks and connectivity

Skynet

Использование ресурсов GRID

«Узкие» места GRID

Сложность (высокая стоимость) перенастройки

Фиксированные настройки окружения

Дорогое масштабирование

Опалата времени, а не фактических вычислений

36

Облачные технологии

Предоставление вычислительной инфраструктуры как сервиса

Виртуализация аппаратных ресурсов

Динамическое выделение ресурсов под конкретные нужды

Оплата только за фактическое использование

Широка поддержка open-source & commercial (Amazon EC2, Rackspace OpenStack, T-Systems, Helix Nebula, …)

37

Облачные технологии

38

Applica'ons+Run+Na'vely+in#Hadoop+

HDFS2+(Redundant,*Reliable*Storage)*

YARN+(Cluster*Resource*Management)***

BATCH+(MapReduce)+

INTERACTIVE+(Tez)+

STREAMING+(Storm,+S4,…)+

GRAPH+(Giraph)+

INLMEMORY+(Spark)+

HPC+MPI+(OpenMPI)+

ONLINE+(HBase)+

OTHER+(Search)+(Weave…)+

…

YARN

39

Docker

40

http://www.docker.com/whatisdocker/

Пример Panda & ATLAS (http://bit.ly/UtlQxM)

41

http://bit.ly/UtlQxM

Примеры задач

Симуляция событий (MC)

Поиск реальных и MC-событий

Онлайн анализ

Офлайн анализ

Сохранение данных (интерфейс доступа)

Сохранение кода и структуры анализа

42

Анализ данных

Индикаторы сложности

‘Каким способом я сгененировал график 13?’

‘Новый студент хочет воспользоваться моделью, опубликованной мной 3 года назад, но я не могу воспроизвести ни одного графика’

‘Я думал, что использую те же параметры, но у меня получаются другие результаты!?I’

‘Где взять события, отобранные предыдущей версией моих скриптов?’

‘Оно работало еще вчера!’

‘Зачем я это сделал?’

44

Источники сложности

Физика

Работа с данными

Стратегия анализа (http://bit.ly/SqDDE4)

Шаги анализа

Командное взаимодействие

45

http://bit.ly/SqDDE4

Экосистема для экспериментов

Программная среда для поддержки экосистемы совместной исследовательской работы над общими задачами, позволяющая:

выполнять численные эксперименты над большими объемами данных

получать воспроизводимые результаты,

использовать единообразные критерии качества.

46

ROOT PyROOT

Plotly, D3s

Matplotlib SciKit-Learn

EF Python Wrapper

!!!!!

EF0

47

MN

…

Основные компоненты

48

Текст перед кодом:

self.rsp.removeHeader("Transfer-Encoding");


49

Заключение

50

Заключение Вместо заключения

совместные исследовательские работы с ЦЕРНом

развитие нового направления

практика в Яндексе

[email protected]

51

Bs ! µ+µ�

Bs ! 4µ⌧ ! 3µB ! K⇤µ+µ�

· · ·

http://arxiv.org/abs/1410.4140v1

mailto:[email protected]

http://arxiv.org/abs/1410.4140v1

Андрей Устюжанин - Технологии обработки данных из...

Documents