Python Data Structures

13 minute read

#Number Representation in Computer

Computers operate using Boolean numbers(base 2) while humans use numbers with base 10. To store base 10 numbers computers need to encode it in Boolean form. This is commonly done using Int or float format.

The 32-bit int data type can hold integer values in the range of −2,147,483,648 to 2,147,483,647 [$- \left( 2 ^ { 31 } \right) \text { to } 2 ^ { 31 } - 1$] and 64 bit is in the range $- \left( 2 ^ { 63 } \right) \text { to } 2 ^ { 63 } - 1$

32 bit Int

To store larger numbers we use float which is a formulaic representation of real numbers as an approximation so as to support a trade-off between range and precision.

Single precision float with width of 32 bits can support range of ±1.18×10−38 to ±3.4×1038 with precision of approximately 7 decimal digits while double precision 64 bits can support range of ±2.23×10−308 to ±1.80×10308 with precision of approximately 16 decimal digits

Single precision float

Double precision float

Big O Notation

Big O notation is used in Computer Science to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used

Common functions used in algorithm analysis

Growth rates of Common functions used in algorithm analysis(Source Goodrich and Tamassia)

<img src="https://he-s3.s3.amazonaws.com/media/uploads/c14cb1f.JPG"">

Comparison of time and space complexity common data structures

Recursion

An alternative to iteration using for or while loops is recursion. Recursion is a technique by which a function makes one or more calls to itself during execution

A Recursive Implementation of the Factorial Function

def factorial(n):
   """Function to return the factorial
   of a number using recursion"""
   if n == 0:
       return n
   else:
       return n*factorial(n-1)

The time efficiency is O(n), as there are n + 1 activations, each of which accounts for O(1) operations.

One concern with using recursion over iteration is the limit of recursion depth for the function calls ( typical default value is 1000 ). If this limit is reached, the Python interpreter raises a RuntimeError with a message, maximum recursion depth exceeded. Also in an interpreted language such as Python recursion may run slower than iterative alternatives.

Dynamic Arrays

When creating a low-level array in a computer system, the precise size of that array must be explicitly declared . However Python’s list class allows us to add elements to the list using an abstraction called Dynamic Array.

If an element is appended to a list at a time when the underlying array is full, we perform the following steps:

1.Allocate a new array B with larger capacity(Usually double the size of the existing array). 2.Set B[i] = A[i], for i = 0 to n – 1 3.Set A = B, that is use B as the array 4.Insert the new element in the new array

Dynamic Array
import ctypes  # provides low-level arrays

class DynamicArray(object):    

    def __init__(self):
        self.n = 0 # Count actual elements (Default is 0)
        self.capacity = 1 # Default Capacity
        self.A = self.make_array(self.capacity)

    def __len__(self):
        """
        Return number of elements sorted in array
        """
        return self.n

    def __getitem__(self,k):
        """
        Return element at index k
        """
        if not 0 <= k <self.n:
            return IndexError('K is out of bounds!') # Check it k index is in bounds of array

        return self.A[k] #Retrieve from array at index k

    def append(self, ele):
        """
        Add element to end of the array
        """
        if self.n == self.capacity:
            self._resize(2*self.capacity) #Double capacity if not enough room

        self.A[self.n] = ele #Set self.n index to element
        self.n += 1

    def _resize(self,new_cap):
        """
        Resize internal array to capacity new_cap
        """

        B = self.make_array(new_cap) # New bigger array

        for k in range(self.n): # Reference all existing values
            B[k] = self.A[k]

        self.A = B # Call A the new bigger array
        self.capacity = new_cap # Reset the capacity

    def make_array(self,new_cap):
        """
        Returns a new array with new_cap capacity
        """
        return (new_cap * ctypes.py_object)()

The amortized running time of each append operation is O(1); hence, the total running time of n append operations is O(n).

Stack

A stack is a collection of objects that are inserted and removed according to the last-in, first-out (LIFO) principle.

class Stack:

	def __init__(self):
		self.stack = []

	def isEmpty(self):
		return self.stack == []

	def push(self, data):
		self.stack.append(data)

	def pop(self):
		data = self.stack[-1]
		del self.stack[-1]
		return data

	def top(self):
		return self.stack[-1]

	def sizeStack(self):
		return len(self.stack)

Time complexity is O(1) time for push and pop (amortized bounds)

Queues

Queue is a collection of objects that are inserted and removed according to the first-in, first-out (FIFO) principle.

class Queue:

	def __init__(self):
		self.queue = []

	def isEmpty(self):
		return self.queue == []

	def enqueue(self, data):
		self.queue.append(data)

	def dequeue(self):
		data = self.queue[0]
		del self.queue[0]
		return data

	def top(self):
		return self.queue[0]

	def sizeQueue(self):
		return len(self.queue)

Enqueue and Dequeue, which have amortized bounds of O(1) time

Double-Ended Queues

Queue-like data structure that supports insertion and deletion at both the front and the back of the queue is called a double-ended queue, or deque.

class Deque:
    def __init__(self):
        self.items = []

    def isEmpty(self):
        return self.items == []

    def addFront(self, item):
        self.items.append(item)

    def addRear(self, item):
        self.items.insert(0,item)

    def removeFront(self):
        return self.items.pop()

    def removeRear(self):
        return self.items.pop(0)

    def size(self):
        return len(self.items)

The deque class has O(1)-time operations at either end, but O(n)-time when operating near the middle of the deque.

Singly Linked Lists

A singly linked list, in its simplest form, is a collection of nodes that collectively form a linear sequence. Each node stores a reference to an object that is an element of the sequence, as well as a reference to the next node of the list

singly linked list
class Node(object):

    def __init__(self,value):

        self.value = value
        self.nextnode = None

Doubly Linked Lists

A linked list in which each node keeps a reference to the node before it and a reference to the node after it, is known as a doubly linked list.

Double linked list
class DoublyLinkedListNode(object):

    def __init__(self,value):

        self.value = value
        self.next_node = None
        self.prev_node = None

Link-Based vs. Array-Based Sequences

Advantages of Array-Based Sequences

  • Arrays provide O(1)-time access to any element. In contrast, locating the kth element in a linked list requires O(k) or possibly O(n – k) time
  • Array-based representations typically use proportionally less memory than linked structures as memory must be devoted to references that link the nodes.
  • Link-based structures support O(1)-time insertions and deletions at arbitrary positions.

Binary Search Trees

A binary search tree is a binary tree which additionally satisfies the binary search property. The key in each node must be greater than or equal to any key stored in the left sub-tree, and less than or equal to any key stored in the right sub-tree

Binary Search Tree
class Node(object):

	def __init__(self, data):
		self.data = data;
		self.leftChild = None;
		self.rightChild = None;

class BinarySearchTree(object):

	def __init__(self):
		self.root = None;

	def insert(self, data):
		if not self.root:
			self.root = Node(data);
		else:
			self.insertNode(data, self.root);

	def insertNode(self, data, node):

		if data < node.data:
			if node.leftChild:
				self.insertNode(data, node.leftChild);
			else:
				node.leftChild = Node(data);
		else:
			if node.rightChild:
				self.insertNode(data, node.rightChild);
			else:
				node.rightChild = Node(data);

	def removeNode(self, data, node):

		if not node:
			return node;

		if data < node.data:
			node.leftChild = self.removeNode(data, node.leftChild);
		elif data > node.data:
			node.rightChild = self.removeNode(data, node.rightChild);
		else:

			if not node.leftChild and not node.rightChild:
				print("Removing a leaf node...");
				del node;
				return None;

			if not node.leftChild:  
				print("Removing a node with single right child...");
				tempNode = node.rightChild;
				del node;
				return tempNode;
			elif not node.rightChild:
				print("Removing a node with single left child...");
				tempNode = node.leftChild;
				del node;
				return tempNode;

			print("Removing node with two children....");
			tempNode = self.getPredecessor(node.leftChild);   
			node.data = tempNode.data;
			node.leftChild = self.removeNode(tempNode.data, node.leftChild);

		return node;

	def getPredecessor(self, node):

		if node.rightChild:
			return self.getPredeccor(node.rightChild);

		return node;

	def remove(self, data):
		if self.root:
			self.root = self.removeNode(data, self.root);


	def getMinValue(self):
		if self.root:
			return self.getMin(self.root);

	def getMin(self, node):

		if node.leftChild:
			return self.getMin(node.leftChild);

		return node.data;

	def getMaxValue(self):
		if self.root:
			return self.getMax(self.root);

	def getMax(self, node):

		if node.rightChild:
			return self.getMax(node.rightChild);

		return node.data;

	def traverse(self):
		if self.root:
			self.traverseInOrder(self.root);		

	def traverseInOrder(self, node):
		if node.leftChild:
			self.traverseInOrder(node.leftChild);

		print("%s " % node.data);

		if node.rightChild:
			self.traverseInOrder(node.rightChild);			

Tree Traversal

There are many ways to traverse this Tree

Depth First Traversals: (a) Inorder (Left, Root, Right) (b) Preorder (Root, Left, Right) (c) Postorder (Left, Right, Root) (d) Breadth First or Level Order Traversal (visit every node on a level before going to a lower level)

An in-order traversal of a binary search tree will always result in a sorted list of node items. Pre-order traversal or a post-order traversal do not make sense for BST

Priority queue

This is a collection of prioritized elements that allows arbitrary element insertion, and allows the removal of the element that has first priority. When an element is added to a priority queue, the user designates its priority by providing an associated key. The element with the minimum key will be the next to be removed from the queue

An Efficient realization of a priority queue can be done using a data structure called a binary heap. This data structure allows us to perform both insertions and removals in logarithmic time. Heap is a tree that satisfies the heap property: if P is a parent node of C, then the key (the value) of P is either greater than or equal to (in a max heap) or less than or equal to (in a min heap) the key of C

Binary Max Heap

class Heap(object):
	HEAP_SIZE = 10

	def __init__(self):
		self.heap = [0]*Heap.HEAP_SIZE;
		self.currentPosition = -1;

	def insert(self, item):
		if self.isFull():
			print("Heap is full..");
			return

		self.currentPosition = self.currentPosition + 1
		self.heap[self.currentPosition] = item
		self.fixUp(self.currentPosition)

	def fixUp(self, index):
		parentIndex = int((index-1)/2)		
		while parentIndex >= 0 and self.heap[parentIndex] < self.heap[index]:
			temp = self.heap[index]
			self.heap[index] = self.heap[parentIndex]
			self.heap[parentIndex] = temp;
			parentIndex = (int)((index-1)/2)

	def heapsort(self):
		for i in range(0,self.currentPosition+1):
			temp = self.heap[0]
			print("%d " % temp)
			self.heap[0] = self.heap[self.currentPosition-i]
			self.heap[self.currentPosition-i] = temp
			self.fixDown(0,self.currentPosition-i-1)

	def fixDown(self, index, upto):
		while index <= upto:

			leftChild = 2*index+1
			rightChild = 2*index+2

			if leftChild < upto:
				childToSwap = None

				if rightChild > upto:
					childToSwap = leftChild
				else:
					if self.heap[leftChild] > self.heap[rightChild]:
						childToSwap = leftChild
					else:
						childToSwap = rightChild

				if self.heap[index] < self.heap[childToSwap]:
					temp = self.heap[index]
					self.heap[index] = self.heap[childToSwap]
					self.heap[childToSwap] = temp
				else:
					break

				index = childToSwap
			else:
				break;							

	def isFull(self):
		if self.currentPosition == Heap.HEAP_SIZE:
			return True
		else:
			return False		

We can use the heap to sort which is called Heapsort. It is an in-place algorithm with performance of O (n log n)

Heap Sort

Hash Tables

Hash table is a data structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions where the hash function generates the same index for more than one key. Chained hash tables with linked lists are popular to resolve collisions.

Load Factor = $= \frac { n } { k }$ where n is the number of entries occupied in the hash table and k is the number of buckets.

Hash table has a performance of $\Theta \left( 1 + \frac { n } { k } \right)$. As the load factor grows larger, the hash table becomes slower. Hence we may have to resize the hash table.

Hash Table
class HashTable(object):

    def __init__(self,size):

        # Set up size and slots and data
        self.size = size
        self.slots = [None] * self.size
        self.data = [None] * self.size

    def put(self,key,data):
        #Note, we'll only use integer keys for ease of use with the Hash Function

        # Get the hash value
        hashvalue = self.hashfunction(key,len(self.slots))

        # If Slot is Empty
        if self.slots[hashvalue] == None:
            self.slots[hashvalue] = key
            self.data[hashvalue] = data

        else:

            # If key already exists, replace old value
            if self.slots[hashvalue] == key:
                self.data[hashvalue] = data  

            # Otherwise, find the next available slot
            else:

                nextslot = self.rehash(hashvalue,len(self.slots))

                # Get to the next slot
                while self.slots[nextslot] != None and self.slots[nextslot] != key:
                    nextslot = self.rehash(nextslot,len(self.slots))

                # Set new key, if NONE
                if self.slots[nextslot] == None:
                    self.slots[nextslot]=key
                    self.data[nextslot]=data

                # Otherwise replace old value
                else:
                    self.data[nextslot] = data

    def hashfunction(self,key,size):
        # Remainder Method
        return key%size

    def rehash(self,oldhash,size):
        # For finding next possible positions
        return (oldhash+1)%size


    def get(self,key):

        # Getting items given a key

        # Set up variables for our search
        startslot = self.hashfunction(key,len(self.slots))
        data = None
        stop = False
        found = False
        position = startslot

        # Until we discern that its not empty or found (and haven't stopped yet)
        while self.slots[position] != None and not found and not stop:

            if self.slots[position] == key:
                found = True
                data = self.data[position]

            else:
                position=self.rehash(position,len(self.slots))
                if position == startslot:

                    stop = True
        return data

    # Special Methods for use with Python indexing
    def __getitem__(self,key):
        return self.get(key)

    def __setitem__(self,key,data):
        self.put(key,data)

Balanced Search Trees

A standard binary search tree supports O(logn) expected running times for the basic operations. However, we may only claim O(n) worst-case time, because some sequences of operations may lead to an unbalanced tree with height proportional to n. A self-balancing binary search tree is one that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.

A commonly used implementation of this is Red Black tree

Red Black Tree
from .binary_search_tree import TreeMap
class RedBlackTreeMap(TreeMap):
  """Sorted map implementation using a red-black tree."""

  #-------------------------- nested _Node class --------------------------
  class _Node(TreeMap._Node):
    """Node class for red-black tree maintains bit that denotes color."""
    __slots__ = '_red'     # add additional data member to the Node class

    def __init__(self, element, parent=None, left=None, right=None):
      super().__init__(element, parent, left, right)
      self._red = True     # new node red by default

  #------------------------- positional-based utility methods -------------------------
  # we consider a nonexistent child to be trivially black
  def _set_red(self, p): p._node._red = True
  def _set_black(self, p): p._node._red = False
  def _set_color(self, p, make_red): p._node._red = make_red
  def _is_red(self, p): return p is not None and p._node._red
  def _is_red_leaf(self, p): return self._is_red(p) and self.is_leaf(p)

  def _get_red_child(self, p):
    """Return a red child of p (or None if no such child)."""
    for child in (self.left(p), self.right(p)):
      if self._is_red(child):
        return child
    return None

  #------------------------- support for insertions -------------------------
  def _rebalance_insert(self, p):
    self._resolve_red(p)                         # new node is always red

  def _resolve_red(self, p):
    if self.is_root(p):
      self._set_black(p)                         # make root black
    else:
      parent = self.parent(p)
      if self._is_red(parent):                   # double red problem
        uncle = self.sibling(parent)
        if not self._is_red(uncle):              # Case 1: misshapen 4-node
          middle = self._restructure(p)          # do trinode restructuring
          self._set_black(middle)                # and then fix colors
          self._set_red(self.left(middle))
          self._set_red(self.right(middle))
        else:                                    # Case 2: overfull 5-node
          grand = self.parent(parent)            
          self._set_red(grand)                   # grandparent becomes red
          self._set_black(self.left(grand))      # its children become black
          self._set_black(self.right(grand))
          self._resolve_red(grand)               # recur at red grandparent

  #------------------------- support for deletions -------------------------
  def _rebalance_delete(self, p):
    if len(self) == 1:                                     
      self._set_black(self.root())  # special case: ensure that root is black
    elif p is not None:
      n = self.num_children(p)
      if n == 1:                    # deficit exists unless child is a red leaf
        c = next(self.children(p))
        if not self._is_red_leaf(c):
          self._fix_deficit(p, c)
      elif n == 2:                  # removed black node with red child
        if self._is_red_leaf(self.left(p)):
          self._set_black(self.left(p))
        else:
          self._set_black(self.right(p))

  def _fix_deficit(self, z, y):
    """Resolve black deficit at z, where y is the root of z's heavier subtree."""
    if not self._is_red(y): # y is black; will apply Case 1 or 2
      x = self._get_red_child(y)
      if x is not None: # Case 1: y is black and has red child x; do "transfer"
        old_color = self._is_red(z)
        middle = self._restructure(x)
        self._set_color(middle, old_color)   # middle gets old color of z
        self._set_black(self.left(middle))   # children become black
        self._set_black(self.right(middle))
      else: # Case 2: y is black, but no red children; recolor as "fusion"
        self._set_red(y)
        if self._is_red(z):
          self._set_black(z)                 # this resolves the problem
        elif not self.is_root(z):
          self._fix_deficit(self.parent(z), self.sibling(z)) # recur upward
    else: # Case 3: y is red; rotate misaligned 3-node and repeat
      self._rotate(y)
      self._set_black(y)
      self._set_red(z)
      if z == self.right(y):
        self._fix_deficit(z, self.left(z))
      else:
        self._fix_deficit(z, self.right(z))

Source: Goodrich and Tamassia book

Updated: