data onboarding
Post on 25-Jan-2017
408 views
Embed Size (px)
TRANSCRIPT
Copyright 2014 Splunk Inc.
Data Onboarding
Ingestion without the
Indigestion
Jeff Meyers Sales Engineer
Major components involved in data indexing What happens to data within Splunk What the data pipeline is & how to influence it
Shaping data understanding via props.conf Configuring data inputs via inputs.conf
What goes where
Heavy Forwarders vs. Universal Forwarders How to get your data into Splunk (mostly correctly)
~ 60 minutes from now...
SystemaMc way to bring new data sources into Splunk
Make sure that new data is instantly usable & has maximum value for users
Goes hand-in-hand with the User Onboarding process (sold separately)
What is the Data Onboarding Process?
4
Machine Data > Business Value Index Untapped Data: Any Source, Type, Volume
Online Services Web
Services
Servers Security GPS
LocaMon
Storage Desktops
Networks
Packaged ApplicaMons
Custom ApplicaMons Messaging
Telecoms Online
Shopping Cart
Web Clickstream
s
Databases
Energy Meters
Call Detail Records
Smartphones and Devices
RFID
On- Premises
Private Cloud
Public Cloud
Ask Any QuesMon
ApplicaMon Delivery
Security, Compliance and Fraud
IT OperaMons
Business AnalyMcs
Industrial Data and the Internet of Things
Flavors of Machine Data
Order Processing
TwiRer
Care IVR
Middleware Error
Getting Data Into Splunk
6
Agent and Agent-less Approach for Flexibility
perf
shell code
Mounted File Systems \\hostname\mount
syslog TCP/UDP
WMI Event Logs Performance
AcMve Directory
syslog compaMble hosts and network devices
Unix, Linux and Windows hosts
Windows hosts Custom apps and scripted API connecMons
Local File Monitoring log files, config files dumps and trace files
Windows Inputs Event Logs
performance counters registry monitoring
AcAve Directory monitoring
virtual host
Windows hosts
Scripted Inputs shell scripts custom parsers batch loading
Agent-less Data Input Splunk Forwarder
Splunk Data Ingest
UF UF HF UF
IDX
SH
Splunk Enterprise (with opMonal configs)
Splunk Universal Forwarder
Summary: when it comes to "core" Splunk, there are two dis8nct products: Splunk Universal Forwarder and Splunk Enterprise. "Everything else" Indexer, Search Head, License Server, Deployment Server, Cluster Master, Deployer, Heavy Forwarder, etc. are all instances of Splunk Enterprise with varying configs.
Data Pipeline (what the what?)
The Data Pipeline
The Data Pipeline
Any QuesMons?
The Data Pipeline
Input Processors: Monitor, FIFO, UDP, TCP, Scripted
No events yet-- just a stream of bytes
Break data stream into 64KB blocks
Annotate stream with metadata keys (host, source, sourcetype, index, etc.)
Can happen on UF, HF or indexer
Inputs Where it all starts
Check character set
Break lines
Process headers
Can happen on HF or indexer
Parsing
Merge lines for mulM-line events
IdenMfy events (finally!)
Extract Mmestamps
Exclude events based on Mmestamp (MAX_DAYS_AGO, ..)
Can happen on HF or indexer
AggregaMon/Merging
Do regex replacement (field extracMon, punctuaMon extracMon, event rouMng, host/source/sourcetype overrides)
Annotate events with metadata keys (host, source, sourcetype, ..)
Can happen on HF or indexer
Typing
Output processors: TCP, syslog, HTTP indexAndForward Sign blocks Calculate license volume and throughput metrics Index [Write to disk ] / [forward elsewhere] / ... Can happen on HF or indexer
Indexing
The Data Pipeline
Data Pipeline: UF & Indexer
Data Pipeline: HF & Indexer
Data Pipeline: UF, IF & Indexer
UF vs. HF
209.160.24.63 - - [23/Feb/2016:18:22:16] "GET /oldlink?itemId=EST-6&JSESSIONID=SD0SL6FF7AD... 209.160.24.63 - - [23/Feb/2016:18:22:17] "GET /product.screen?productId=BS-AG-G09&JSESSION... 209.160.24.63 - - [23/Feb/2016:18:22:19] "POST /category.screen?categoryId=STRATEGY&JSESSI... 209.160.24.63 - - [23/Feb/2016:18:22:20] "GET /product.screen?productId=FS-SG-G03&JSESSION... 209.160.24.63 - - [23/Feb/2016:18:22:20] "POST /cart.do?acMon=addtocart&itemId=EST-21&pro... 209.160.24.63 - - [23/Feb/2016:18:22:21] "POST /cart.do?acMon=purchase&itemId=EST-21&JSES... 209.160.24.63 - - [23/Feb/2016:18:22:22] "POST /cart/success.do?JSESSIONID=SD0SL6FF7ADFF49... 209.160.24.63 - - [23/Feb/2016:18:22:21] "GET /cart.do?acMon=remove&itemId=EST-11&product... 209.160.24.63 - - [23/Feb/2016:18:22:22] "GET /oldlink?itemId=EST-14&JSESSIONID=SD0SL6FF7A... 112.111.162.4 - - [23/Feb/2016:18:26:36] "GET /product.screen?productId=WC-SH-G04&JSESSION...
209.160.24.63 - - [23/Feb/2016:18:22:16] "GET /oldlink?itemId=EST-6&JSESSIONID=SD0SL6FF7AD...
209.160.24.63 - - [23/Feb/2016:18:22:17] "GET /product.screen?productId=BS-AG-G09&SSN=xxxyyyzzz...
sourcetype=access_combined, _8me=1456251739, index=foo, host=bar,
sourcetype=access_combined, _8me=1456251739, index=foo, host=bar,
sourcetype=access_combined, index=foo, host=bar,
UF
HF emits events
emits chunks of data
Splunk Data Ingest
UF UF HF UF
IDX
SH
Parsing
Not Parsing
Note: the data is parsed at the first component that has a parsing engine and not again This effects where you put certain props.conf and transforms.conf files (a.k.a. some8mes they go on the forwarder)
Data Onboarding Process (bringing it together)
IdenMfy the specific sourcetype(s) - onboard each separately Check for pre-exisMng app/TA on splunk.com-- don't reinvent the wheel! Gather info
Where does this data originate/reside? How will Splunk collect it? Which users/groups will need access to this data? Access controls? Determine the indexing volume and data retenMon requirements Will this data need to drive exisMng dashboards (ES, PCI, etc.)? Who is the SME for this data?
Map it out Get a "big enough" sample of the event data IdenMfy and map out fields Assign sourcetype and TA names according to CIM convenMons
On-boarding Process
Dev Create (or use) an app Props / inputs definiMon
Sourcetype definiMon Use data import wizard Import, tweak, repeat Oneshot [hook up monitor]
On-boarding Process
Prod Deploy app Validate Monitor
Test Deploy app Oneshost Validate Hook up monitor Validate
1 2
3
General: Use apps for configs
Use TAs / add-ons from Splunk if possible Use dev, test, prod
Dev can be laptop, test can be ephemeral UF when possible
HF only if filtering / transforming is required in foreign land Unique Sourcetype per event stream Don't send data through Search Heads Don't send data direct to Indexers
Good Hygiene
inputs.conf As specific as possible Set sourcetype, if possible
Don't let splunk auto-sourcetype (no ...too_small) Specify index if possible
props.conf Set: TIME_PREFIX, TIME_FORMAT, MAX_TIMESTAMP_LOOKAHEAD
OpMmally: SHOULD_LINEMERGE = false, LINE_BREAKER, TRUNCATE
Good Hygiene
Data Onboarding Process (details)
IdenMfy the specific sourcetype(s) - onboard each separately Check for pre-exisMng app/TA on splunk.com-- don't reinvent the wheel! Gather info
Where does this data originate/reside? How will Splunk collect it? Which users/groups will need access to this data? Access controls? Determine the indexing volume and data retenMon requirements Will this data need to drive exisMng dashboards (ES, PCI, etc.)? Who is the SME for this data?
Map it out Get a "big enough" sample of the event data IdenMfy and map out fields Assign sourcetype and TA names according to CIM convenMons
Pre-Board
The Common InformaMon Model (CIM) defines relaMonships in the underlying data, while leaving the raw machine data intact
A naming convenMon for fields, evensypes & tags More advanced reporMng and correlaMon requires that the data be normalized, categorized, and parsed
CIM-compliant data sources can drive CIM-based dashboards (ES, PCI, others)
Tangent: What is the CIM and why should I care?
IdenMfy necessary configs (inputs, props and transforms) to properly handle:
Mmestamp extracMon, Mmezone, event breaking, sourcetype/host/source assignments
Do events contain sensiMve data (i.e., PII, PAN, etc.)? Create masking transforms if necessary
Package all index-Mme configs into the TA
Build the index-Mme configs
Assign sourcetype according to event format; events with similar format should have the same sourcetype
When do I need a separate index? When the data volume will be very large, or when it will be searched exclusively a lot
When access to the data needs to be controlled When the data requires a specific data retenMon policy
Resist the temptaMon to create lots of indexes
Tangent: Best & Worst PracMces
Always specify a sourcetype