So far in previous ML posts, we've customized an algorihim to work with the structure offered by numpy arrays representing matrices. Depending on the algorithim applied, numpy arrays may not be the most efficient structure to work with.

What are we talking about here? Data Structures are a specialized means of organizing and storing data in computers in such a way that we can perform operations on the stored data more efficiently.

Early in my coding career, I discovered that in the programming world, we abstract data in a structure that can be represented as class object nodes. Specifically, while on assignment at Bank of America Merchant services by direction of my superior, I did some deep diving into MS SQL online transactional processing relational database engines and how they work. Internally, this particular product would rely upon applying algorithims to a data structure called a B-tree. Shout out to Itzik Ben-Gan, who wrote some literature that became foundational to my understanding of relational database management engines. My boss at BAMS happened to be involved with the technical review of Itzik's popular literature. These books were strongly recommended during my project onboarding. I have these books close by to date via Kindle.

B-trees as an astraction of data, presented themselves to me as difficult to comprehend. Before learning of them, my world consisted of tables or views: database objects that we may imagine as behaving like a spreadsheet. The table object in my in my mind's eye was revealed as an abstract way to view B-trees, underneath.

These B-tree structures and the algorithims that are applied to them are abstracted away from the analyst or administrator working in the database by means of structured query language or SQL. This opened up the natural observation that it would be critical for me to understand more about data structures if I want to go a layer deeper than the SQL abstractions. All of the sudden, there was more to the universe than user defined functions, procedures, and table objects. B-trees and their "pages" are a variant of binary search trees, a node-based binary tree data structure.

Scope

In this post we'll illustrate our knowledge of common data structures, how to work with them. Our aim here is to level set some understanding that we'll need before jumping into the topic of graph neural networks. You see, graphs are their own classification of custom data structure and graph neural networks, or GNN's have some interesting applications in the ML world.

Where to begin? My knowledge is finite. In honesty, after satisfying my initial curiousity, the subject would be revisted in very few instances in my career. Outside of the deprecated queue, I have hardly see custom structures in production during my career. In 2018, I learned how to build an algorithim that functions in traversing a linked list in fear that it would come up one day in a technical interview. To this day, it never has.

To ramp up the basics of data structures, we should discuss

  1. linked lists,
  2. binary search trees,
  3. and graphs.

With an understanding of these three data structures, our readers should be able to follow subsequent discussions around GNN's. Let's state that as our goal. One of the objectives of this blog is to demonstrate my knowledge and transfer what I know to those who care to learn more.

Other common data structures include arrays, hash tables. These are variants of python lists and dictionaries. I'm not going to spend time on those here. We've spent some time working with them in previous posts, already. Remember we just want the bottom line up front when it comes to the mysterious prerequisites for working with GNN's. I will say before learning minimal nuances, my experiences in working with arrays and hash table like structures are rooted in JavaScript front end development and backend development with Django, .NET, nodeJS. To that end, I was first exposed to hash tables before gaining object oriented experience by way of witnessing the HASH JOIN SQL operation at BAMS and following up with Itzik's literature.

In common practices, detailed understanding data structures and algorithms are abstracted away through the use of data science libraries like sci kit learn. I figure if we want to get to the good stuff, we need to at least cover the basics for how some of these internal mechanics of ML training and prediction processes function. Let's also state the application of data structures and algorithims extend far beyond machine learning.

Data Generation

If data structures are specialized means of organizing data in memory or storage, such that we can work with them more efficiently, we need to start our journey by producing a dataset. I am a student of Mathematics: as previous member of an educational institution, yes, but also in life beyond school.

A sequence of data points or numbers that has always intrigued me includes prime numbers. For the sake of having data values to work with, let's create an algorithim that functions in returning the first n members of the prime number sequence.

Given

A prime number is a whole number greater than 1 whose only factors are 1 and itself. A factor is a whole number that can be divided evenly into another number. The first few prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23 and 29.

Implementation

This function could easily be implemented as a method of a custom class object acting as an instantiation of a custom data structure.

def genPrimeLi(n=10):
  try:
    assert n > 0, "number of elements must be greater than zero"
    assert isinstance(n, int), "number of elements must be an integer"
    seq = []
    seq.append(2)
    idx = 1
    if n > 1:
      idx += 1
      delta = 1
      factor = 2
      lastElement = 3
      seq.append(lastElement)
    while idx < n:
      nextElement = lastElement + delta
      if nextElement % factor != 0:
        factor += 1
      elif factor == nextElement:
        seq.append(nextElement)
        lastElement = nextElement
        idx += 1
        delta = 1
        factor = 2
      else:
        delta += 1
        factor = 2
    return seq
  except AssertionError as msg:
    print(msg)
genPrimeLi()
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Linked Lists

Now that we've generated a sequence of prime numbers structured as a List, we'll generate the same sequence as a LinkedList.

class Node:
  def __init__(self, data=None):
    self.data = data
    self.next = None
  def __repr__(self):
    return str(self.data)
class LinkedList:
  def __init__(self):
    self.head = None
  def __repr__(self):
    node = self.head
    nodes = []
    while node != None:
      nodes.append(str(node.data))
      node = node.next
    nodes.append('empty')
    return ' >> '.join(nodes)
  def append(self, data=None):
    node = self.head
    if self.head == None:
      self.head = Node(data)
    else:
      while node.next != None:
        node = node.next
      node.next = Node(data)
  def traverse(self, func):
    try:
      assert callable(func), "function must be callable"
      assert self.head != None, "LinkedList is empty"
      node = self.head
      while node != None:
        func(node.data)
        node = node.next
    except AssertionError as msg:
      print(msg)
ll = LinkedList()
ll.append(1)
ll.append(2)
ll.append(3)
ll
1 >> 2 >> 3 >> empty
primeLinkedList = LinkedList()
for e in genPrimeLi():
  primeLinkedList.append(e)
primeLinkedList
2 >> 3 >> 5 >> 7 >> 11 >> 13 >> 17 >> 19 >> 23 >> 29 >> empty
primeLinkedList.traverse(print)
2
3
5
7
11
13
17
19
23
29

Additional methods for LinkedList class might include insert, count, delete. There are variations of the linked list. Explore the web for the deprecated queue and other variants.

Binary Search Tree

Now that we've generated a sequence of prime numbers structured as a LinkedList, we'll generate the same sequence as a binary search tree. A binary search tree, and a linked list are both special classes of graphs that are acyclic.

class BSTNode:
  def __init__(self, data=None):
    self.data = data
    self.right = None
    self.left = None
class BinarySearchTree:
  def __init__(self):
    self.head = None

!UNDER CONSTRUCTION, CHECK BACK FOR MORE SOON!