nl hug 2016 feb hadoop security from the trenches

24
Hadoop Security from the Trenches Bolke de Bruin Chief Wizard

Upload: bolke-de-bruin

Post on 21-Apr-2017

1.780 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Nl HUG 2016 Feb Hadoop security from the trenches

Hadoop Security from the TrenchesBolke de BruinChief Wizard

Page 2: Nl HUG 2016 Feb Hadoop security from the trenches

This is not going to be a perfect talk• It will be incomplete (squeezed for time)• Probably without humor (I am just really bad in telling jokes)• I have a disclaimer (work for a Bank)• A lot of text in Orange (ING is Oranje)

Page 3: Nl HUG 2016 Feb Hadoop security from the trenches

Agenda

• Security today in Hadoop• Kerberos (In depth)• Policy based access• Lineage (A bit)• Encryption

Page 4: Nl HUG 2016 Feb Hadoop security from the trenches

Information security principals

Confidentiality• Information is not

made available or disclosed tounauthorizedindividuals and,entities or processes

Integrity• Maintaining and

assuring theaccuracy andcompleteness of data over its entirelifecycle

Availability•Data must beavailable when itis needed

Unfortunately most of the attention in Hadoop goes to confidentiality

Page 5: Nl HUG 2016 Feb Hadoop security from the trenches

Security today in Hadoop

AuthenticationWho am I?

Kerberos

Apache Knox

AuthorizationWhat can I do?

ApacheRanger

Apache Sentry

AuditWhat did I do?

Apache Ranger

ClouderaNavigator

Data ProtectionCan someone read

my data?

SSL

SASL

KMS

Data GovernanceWhere did my data

come from andwhere is it going?

Apache Atlas

ClouderaNavigator

Identity Management

Page 6: Nl HUG 2016 Feb Hadoop security from the trenches

Taming Kerberos, the ferocious three-headed guard dog of Hadoop

Page 7: Nl HUG 2016 Feb Hadoop security from the trenches

Typical workflow

Page 8: Nl HUG 2016 Feb Hadoop security from the trenches

Kerberized workflow

Use ticket

Hive getsservice ticket HDFS gets

service ticket

Page 9: Nl HUG 2016 Feb Hadoop security from the trenches

Kerberos has great advantages…

• Requires that each client, each request prove it’s identity• Does not require a user to enter password everytime a service registered• Works across operating systems• Kerberos assumes that network connections rather than servers and workstations

are the weak link in network security

• Did you know that Active Directory is just Kerberos+LDAP?

Page 10: Nl HUG 2016 Feb Hadoop security from the trenches

…but its perceived complexity has stopped implementation

• AS, KDC, TGS, SS, TGT, KINIT, KEYTAB, KADMIN So many abbrevations…• But you just need to remember a few: kinit, keytab, kdc

• Synchronization of host clocks required• What wait? You didn’to do that yet? Your local cloud provider already does this for you.

• Separate user databases if combined with LDAP or PAM• Well there is Active Directory and there is FreeIPA

• Tool Xxx is not kerberized and I really need it• Insecure don’t use it or add patches yourself. Yeah OpenSource!

Page 11: Nl HUG 2016 Feb Hadoop security from the trenches

Ehh FreeIPA?

Looks familiar doesn’t it? Oh yes this is Active Directory!

Page 12: Nl HUG 2016 Feb Hadoop security from the trenches

Integration in an Enterprise environment

• Fully integrated with Operating System and Hadoop

• UserIDs are the same, shared andimmediate

• Can use PAM

• YARN, HDFS acls start working out of the box as local users just exist That is the big stuff!

Page 13: Nl HUG 2016 Feb Hadoop security from the trenches

Installing is difficult right?

• Server• # yum –y install ipa-server• # ipa-server-install

• Client• # yum –y install ipa-client• # ipa-client-install

Page 14: Nl HUG 2016 Feb Hadoop security from the trenches

Support in Hadoop distributions is slightly lagging

Quite easy actually: gen_credentials.shjust needs to be adjusted:http://blog.godatadriven.com/samba-configuration.html (for IPA it needs to beadjusted)

https://github.com/HariSekhon/tools/blob/master/ambari_freeipa_kerberos_setup.pl

Written by an ex cloudera guy ;-)

Page 15: Nl HUG 2016 Feb Hadoop security from the trenches

Caveats

• Trusted domains deliver users with “username@REALM”, Hadoop and Hive filter on ‘@’• See: https://issues.apache.org/jira/browse/HADOOP-12751• See: https://issues.apache.org/jira/browse/HIVE-12981

• Workaround: convert @ to _ by means of sssd• full_name_format = %1$s_%2$s• re_expression =

(((?P<Name>[^@]+)_(?P<Domain>.+$))|((?P<Domain>[^\\]+)\\(?P<Name>.+$))|((?P<Name>[^@]+)@(?P<Domain>.+$))|(^(?P<Name>[^@\\]+)$))

• Or just wait for the patches to land

Page 16: Nl HUG 2016 Feb Hadoop security from the trenches

Data access policies and auditing with Ranger

Page 17: Nl HUG 2016 Feb Hadoop security from the trenches

How are policies applied?

Where is Spark?

Page 18: Nl HUG 2016 Feb Hadoop security from the trenches

Active policies

Page 19: Nl HUG 2016 Feb Hadoop security from the trenches
Page 20: Nl HUG 2016 Feb Hadoop security from the trenches

Caveats

• Ranger (but also Sentry) feels like slapped on security. Just usable, but barely• User synchronization can be very slow with many users due to architecture issues• Unix synchronization and authentication is using /etc/passwd /etc/group instead of NSS and PAM

• https://issues.apache.org/jira/browse/RANGER-842• https://issues.apache.org/jira/browse/RANGER-827• If these patches land syncing will be much faster for IPA/SSSD enabled systems

• No real Spark roadmap, just spark-sql. This also goes for Sentry• Doesn’t manage HDFS ACLS and requires Hive user access… defeating end to end security

Page 21: Nl HUG 2016 Feb Hadoop security from the trenches

Data Governance

• Why? • We need to be able to pinpoint what data resides where, why, what happened with it.• Why?• Cause you might want us to remove your data• … and the regulator says so

Page 22: Nl HUG 2016 Feb Hadoop security from the trenches

Encryption

• Data at rest• Used if you don’t trust your physical infrastructure. Cloud!• Only our highest confidentiality levels require it, we are not at that level so we don’t use it

• Data in transit• Data across untrusted networks. Cloud?• Perimeter security solves a lot of these issues, you take a significant performance hit of around 20% if you

enable it within your cluster• For ETL or data ingestion then it becomes more reasonable• For us it is enabled for access TO the cluster NOT WITHIN

• Data democratization • Use case: allow some data scientists to see the original data and some of the masked/anonimized data• We are tinkering with this

Page 23: Nl HUG 2016 Feb Hadoop security from the trenches

An example architecture

Page 24: Nl HUG 2016 Feb Hadoop security from the trenches

We are hiring! [email protected]

24

Frank DerksJohn Muller Pooja Rao Hylke Hendriksen

Giovanni LanziniFabian Jansen Hanneke van Veldhuizen Johan Witman

Wendell KulingJonas Ahrendt Bolke de Bruin Ivo Everts

Doron Reuter

Zhe Sun