retrospection / prospection and schema

Retrospection / prospectionand schema

TAGOMORI Satoshi (@tagomoris)LINE Corp.

2014/01/31 (Fri) at University of Tsukubathe 1st half

14年1月31日金曜日

TAGOMORI Satoshi (@tagomoris)LINE Corp.

Development Support Team


Logs

Service metrics (Users, PageViews, ...)

UX/UI metrics (Access path, Taps/views, ...)

Monitoring metrics (Traffic Gbps, TBytes/day, ...)

System monitoring (Error rates, Response time, ...)


Software for Logging

Collection: Fluentd, Scribed, Flume, LogStash, ...

Storage: RDBMS, Hadoop HDFS, NoSQLs, Elasticsearch, ....

Processing: SQL, Hadoop MapReduce(Hive), Presto, Impala, ... Stream-Processing: Storm, Kafka, Norikra, ...

Visualization: Kibana, Tableau Fnordmetric, GrowthForecast, Focuslight, ...

Appliance: DHW + BI Tools

Services: Google BigQuery, Treasure Data, ...


How inspect logs

Retrospection (reactive search)

Store data, and search

Prospection (proactive search)

Define what should be processed, and store data


What logs inspected

Schema-full data:

strict schema: pre defined fields w/ types (or reject)

schema on read: try to read known fields (or ignore)

Schema-less data:

any fields (or ignore), any types (implicit/explicit conversion)

fit for services in-development (all internet services!)


How/what

How\What Schema-full Schema-less

RetrospectRDBMS,

Hive, BigQuery,Cassandra, HBase, ...

MongoDB,Hive(SerDe), TD,Plain text file, ...

ProspectEsper,

many of stream CEPs,...

Norikra, ...


Data size: schema & indexLogs: size is always important (xTB - xPB)Schema:

size optimizationaccess optimization on memory/disk

Index:access optimization on memory/diskmore memory/disk requiredhard to distribute


Query response improvementsof retrospection

Schema-full + indexed (RDBMS)

Query plan optimization

Schema on read

I/O and Task size optimization & scale out

Schema-less + indexed (Mongo)

mmap-ed index & data (!)


Query response improvementsof prospection

Time window + incremental calculation

Stream processing engines


Stream processingand data size

No disks: reduction of failure points

Less memory:

size of just processing and I/O buffers

aggregation results

Easy to distribute:

stream duplication

stream splitting by aggregation key


Stream processing and schema

Stream processing: query -> data

Prospective schema by queries:

Queries know required fields and its types

Unused fields can be ignored

Implicit type conversion available

Schema-less data + schema-full queries


My goal:Schema-less data stream + schema-full queries

It’s Norikra!


retrospection / prospection and schema

Technology