compact ordered dict__k_lab_meeting_
TRANSCRIPT
New dict implementationin Python 3.6
Inada Naoki (@methane)
自己紹介
@methane
K-Labo, KLab Inc.
Python core developer
C, Go, Network (server) programming, MySQL clients
ISUCON 6 winner (See http://isucon.net/ )
Table of contents
● dict in Python● Python 3.5 implementation● Python 3.6 implementation● Toward Python 3.7
Dict in Python
DictKey-Value storage. A.k.a. associative-array, map, hash.
x = {"foo": 42, "bar": 84}
print( x["foo"] ) # => 42
Key feature:
● Constant time lookup● Amortized constant time insertion● Support custom (user-defined) key type
Dicts are everywhere in Pythonx = 5 # global namespace is dict. Insert 'x' to it.def add(a): # Insert 'add' to global dict return a + x # lookup 'x' from global dictprint(add(7)) # search 'print' and 'add' from global dict
There are many dicts in Python program.
Lookup speed is critical.
Insertion speed and memory usage is very important too.
Python 3.5 implementation
Key
hash
value
0 1 2 3 4 5 6 7
d["foo"] = "spam" # insert new item
hash("foo") = 42 # hash value is 4242 % 8 = 2 # hash value % hash table size = 2
Key
hash
value
0 1 2 3 4 5 6 7
d["foo"] = "spam"
hash("foo") = 4242 % 8 = 2
"foo"
42
"spam"
Key
hash
value
0 1 2 3 4 5 6 7
d["bar"] = "ham"
hash("bar") = 5252 % 8 = 4
"foo"
42
"spam"
"bar"
52
"ham"
Key
hash
value
0 1 2 3 4 5 6 7
d["baz"] = "egg"
hash("baz") = 5858 % 8 = 2 # "baz" is conflict with "foo"
"foo"
42
"spam"
"bar"
52
"ham"
Key
hash
value
0 1 2 3 4 5 6 7
"Open addressing" uses another slot in the table.(Another strategy is "chaining")
For example, "linear probing" algorithm uses next entry.※Python uses more complex probing, but I use simpler way in this example.
"foo"
42
"spam"
"bar"
52
"ham"
"baz"
58
"egg"
Key
hash
value
0 1 2 3 4 5 6 7
del d["foo"]
hash("foo") = 4242 % 8 = 2
"foo"
42
"spam"
"bar"
52
"ham"
"baz"
58
"egg"
Key
hash
value
0 1 2 3 4 5 6 7
del d["foo"]
hash("foo") = 4242 % 8 = 2
"bar"
52
"ham"
"baz"
58
"egg"
Key
hash
value
0 1 2 3 4 5 6 7
x = d["baz"]
hash("baz") = 5858 % 8 = 2 (!!?)
"bar"
52
"ham"
"baz"
58
"egg"
Key
hash
value
0 1 2 3 4 5 6 7
del d["foo"] remains DUMMY key
"bar"
52
"ham"
"baz"
58
"egg"
DUMMY
Key
hash
value
0 1 2 3 4 5 6 7
x = d["baz"]
hash("baz") = 5858 % 8 = 2 (conflict with dummy, then linear probing)
"bar"
52
"ham"
"baz"
58
"egg"
DUMMY
Problems in classical open addressing hash table
● Large memory usage○ At least 1/3 of entries are empty
■ Otherwise, "probing" can be too slow○ One entry uses 3 words
■ word = 8 bytes on recent machine○ minimum size = 192 byte
■ 8 (byte/word) * 3 (word/entry) * 8 (table width)
Python 3.6 implementation
Compact and ordered dict
PyPy implements it in 2015https://morepypy.blogspot.jp/2015/01/faster-more-memory-efficient-and-more.html
Python 3.6 dict is almost same as PyPy.
Ruby 2.4, php 7 has similar one.
Key
hash
value
0 1 2 3 4 5 6 7
d["foo"] = "spam" # hash("foo") = 42, 42 % 8 = 2
"foo"
42
"spam"
0index
Key
hash
value
0 1 2 3 4 5 6 7
d["foo"] = "spam"d["bar"] = "ham" # hash("bar") = 52 , 52 % 8 = 4
"bar"
52
"ham"
"foo"
42
"spam"
0 1index
Key
hash
value
0 1 2 3 4 5 6 7
d["foo"] = "spam"d["bar"] = "ham"d["baz"] = "egg"del d["foo"]
"bar"
52
"ham"
"baz"
58
"egg"
DUMMY 2 1index
● Less memory usage○ Index can be 1 byte for small dict○ 3*8 *5 (entries) + 8 (index table) = 128 bytes
■ It was 192 bytes in legacy implementation● Faster iteration (dense entries)● Preserve insertion order● (cons) One more indirect memory access
New dict vs Legacy dict
Toward Python 3.7
Working on ...
● Remove redundant code for optimize legacy implementation.
● OrderedDict based on New dict○ Remove doubly linked list used for keep order○ About 1/2 memory usage!○ Faster creation and iterating.○ (cons) Slower .move_to_end() method
We're finding new contributors
Contributing to Python is easier, thanks to Github.
● Read devguide (https://devguide.python.org/ )● Find easy bug on https://bugs.python.org/ and fix it.● Review other's code● Translate document on Transifex
○ See https://docs.python.org/ja/
Future ideas● specialized dict for namespace
○ all keys are interned string○ only pointer comparison○ no "hash" in entry -> more compact
● Implement set like dict○ current set is larger than dict...
● functools.lru_cache○ Use `od.move_to_end(key)`, instead of linked list
PEP 412: Key sharing dict
PEP 412: Key sharing dict
Introduced in Python 3.4
Instances of same class can share keys object
class A:
def __init__(self, a, b):
self.foo = a
self.bar = b
a = A("spam", "ham")
b = A("bacon", "egg")
KeyClass
value
0 1 2 3 4 5 6 7
"bar"
52
"foo"
42
0 1index
"ham""spam"values
"egg""bacon"values
instance
instance
Problem
● Two instances can have different insertion order○ drop key sharing dict?
■ key sharing dict can save more memory.● But __slots__ can be used for such cases!
■ performance improvements in some microbench● Is it matter for real case? __slots__?
■ Needs consensus● it's more difficult than implementation
Keep key sharing dict support
● Only exactly same order can be permitted○ "skipped" keys are prohibited○ deletion is also prohibited
● Otherwise, stop "key sharing"○ `self.x = None` is faster than `del self.x`