nl hug 2016 feb hadoop security from the trenches

Hadoop Security from the TrenchesBolke de BruinChief Wizard

This is not going to be a perfect talk• It will be incomplete (squeezed for time)• Probably without humor (I am just really bad in telling jokes)• I have a disclaimer (work for a Bank)• A lot of text in Orange (ING is Oranje)

Agenda

• Security today in Hadoop• Kerberos (In depth)• Policy based access• Lineage (A bit)• Encryption

Information security principals

Confidentiality• Information is not

made available or disclosed tounauthorizedindividuals and,entities or processes

Integrity• Maintaining and

assuring theaccuracy andcompleteness of data over its entirelifecycle

Availability•Data must beavailable when itis needed

Unfortunately most of the attention in Hadoop goes to confidentiality

Security today in Hadoop

AuthenticationWho am I?

Kerberos

Apache Knox

AuthorizationWhat can I do?

ApacheRanger

Apache Sentry

AuditWhat did I do?

Apache Ranger

ClouderaNavigator

Data ProtectionCan someone read

my data?

SSL

SASL

KMS

Data GovernanceWhere did my data

come from andwhere is it going?

Apache Atlas

ClouderaNavigator

Identity Management

Taming Kerberos, the ferocious three-headed guard dog of Hadoop

Typical workflow

Kerberized workflow

Use ticket

Hive getsservice ticket HDFS gets

service ticket

Kerberos has great advantages…

• Requires that each client, each request prove it’s identity• Does not require a user to enter password everytime a service registered• Works across operating systems• Kerberos assumes that network connections rather than servers and workstations

are the weak link in network security

• Did you know that Active Directory is just Kerberos+LDAP?

…but its perceived complexity has stopped implementation

• AS, KDC, TGS, SS, TGT, KINIT, KEYTAB, KADMIN So many abbrevations…• But you just need to remember a few: kinit, keytab, kdc

• Synchronization of host clocks required• What wait? You didn’to do that yet? Your local cloud provider already does this for you.

• Separate user databases if combined with LDAP or PAM• Well there is Active Directory and there is FreeIPA

• Tool Xxx is not kerberized and I really need it• Insecure don’t use it or add patches yourself. Yeah OpenSource!

Ehh FreeIPA?

Looks familiar doesn’t it? Oh yes this is Active Directory!

Integration in an Enterprise environment

• Fully integrated with Operating System and Hadoop

• UserIDs are the same, shared andimmediate

• Can use PAM

• YARN, HDFS acls start working out of the box as local users just exist That is the big stuff!

Installing is difficult right?

• Server• # yum –y install ipa-server• # ipa-server-install

• Client• # yum –y install ipa-client• # ipa-client-install

Support in Hadoop distributions is slightly lagging

Quite easy actually: gen_credentials.shjust needs to be adjusted:http://blog.godatadriven.com/samba-configuration.html (for IPA it needs to beadjusted)

https://github.com/HariSekhon/tools/blob/master/ambari_freeipa_kerberos_setup.pl

Written by an ex cloudera guy ;-)

Caveats

• Trusted domains deliver users with “username@REALM”, Hadoop and Hive filter on ‘@’• See: https://issues.apache.org/jira/browse/HADOOP-12751• See: https://issues.apache.org/jira/browse/HIVE-12981

• Workaround: convert @ to _ by means of sssd• full_name_format = %1$s_%2$s• re_expression =

(((?P<Name>[^@]+)_(?P<Domain>.+$))|((?P<Domain>[^\\]+)\\(?P<Name>.+$))|((?P<Name>[^@]+)@(?P<Domain>.+$))|(^(?P<Name>[^@\\]+)$))

• Or just wait for the patches to land

Data access policies and auditing with Ranger

How are policies applied?

Where is Spark?

Active policies

Caveats

• Ranger (but also Sentry) feels like slapped on security. Just usable, but barely• User synchronization can be very slow with many users due to architecture issues• Unix synchronization and authentication is using /etc/passwd /etc/group instead of NSS and PAM

• https://issues.apache.org/jira/browse/RANGER-842• https://issues.apache.org/jira/browse/RANGER-827• If these patches land syncing will be much faster for IPA/SSSD enabled systems

• No real Spark roadmap, just spark-sql. This also goes for Sentry• Doesn’t manage HDFS ACLS and requires Hive user access… defeating end to end security

Data Governance

• Why? • We need to be able to pinpoint what data resides where, why, what happened with it.• Why?• Cause you might want us to remove your data• … and the regulator says so

Encryption

• Data at rest• Used if you don’t trust your physical infrastructure. Cloud!• Only our highest confidentiality levels require it, we are not at that level so we don’t use it

• Data in transit• Data across untrusted networks. Cloud?• Perimeter security solves a lot of these issues, you take a significant performance hit of around 20% if you

enable it within your cluster• For ETL or data ingestion then it becomes more reasonable• For us it is enabled for access TO the cluster NOT WITHIN

• Data democratization • Use case: allow some data scientists to see the original data and some of the masked/anonimized data• We are tinkering with this

An example architecture

We are hiring! [email protected]

24

Frank DerksJohn Muller Pooja Rao Hylke Hendriksen

Giovanni LanziniFabian Jansen Hanneke van Veldhuizen Johan Witman

Wendell KulingJonas Ahrendt Bolke de Bruin Ivo Everts

Doron Reuter

Zhe Sun

nl hug 2016 feb hadoop security from the trenches

Data & Analytics