Lab2: Deep Learning Basics and Learned Index¶

In Lab2, you will learn deep learning basics and use them to build an NN4Sys application, Learned Index. For the basics, we heavily borrow from a textbook Dive into Deep Learning (d2l), in particular S3.2. Object-Oriented Design for Implementation. It provides a nice "object-oriented" interface for us to organize deep learning code, with some useful utility functions. On the second part, you will train a learned index of your own.

Section 0: Getting Started¶

Environment preparation¶

  1. install conda or miniconda (see here)
  2. $ conda create -n cs7670 python=3 : create a conda environment called "cs7670"
  3. $ conda activate cs7670 : activate this environment
  4. $ pip install d2l==1.0.0a1 : install necesssary packages from d2l
  5. $ pip install torch torchvision termcolor : install PyTorch and a color printing package
  6. $ conda install ipykernel : install IPython kernel

Lab2 preparation¶

  1. Click the GitHub Lab2 link on GitHub homepage to create your Lab2 clone on GitHub.
  2. Open a Linux terminal.
  3. Clone Lab2 repo to your local machine:
    $ cd ~  
    $ git clone git@github.com:NEU-CS7670-labs/lab2-<Your-GitHub-Username>.git lab2

Note that the repo address git@github.com:... can be obtained by going to GitHub repo page (your cloned lab2), clicking the green "Code" button, then choose "SSH".

  1. Check contents:
    $ cd ~/lab2; ls
    // you should see:
    FighterJet.mp4  Lab2.ipynb  data  utils.ipynb
  1. Start your lab2:
    $ cd ~/lab2
    $ conda activate cs7670  # if you haven't
    $ jupyter-notebook

This should open your default browser. Click the file named Lab2.ipynb.

A note about Jupyter notebook. If you're not familiar with Jupyter notebook, here is a quick tutorial. We will only use the basics, and you don't have to be an expert of this tool.

Section1: Understanding DL training interfaces¶

This section is a revised version borrowed from Object-Orented Design for Implementation from d2l.ai.

below you should start to run the code snippets (click "Run" button on the top toolbar, or press "Ctrl + Enter" by default)

In [ ]:
from termcolor import colored

def info(msg):
    assert isinstance(msg, str)
    print(colored(msg, "magenta", attrs=['bold']))

info("Active environment should be cs7670:")
! conda info | grep 'active env'
In [ ]:
import time
import random
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

Section 1.1: Utility functions¶

a) add function to class¶

We need a few utilities to simplify object-oriented programming in Jupyter notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first utility function allows us to register functions as methods in a class after the class has been created. In fact, we can do so even after we’ve created instances of the class! It allows us to split the implementation of a class into multiple code blocks.

In [ ]:
def add_to_class(Class):
    def wrapper(obj):
        setattr(Class, obj.__name__, obj)
    return wrapper
In [ ]:
class HelloWorld:
    def __init__(self):
        super().__init__()
        self.msg = "nothing"

# create an instance of HelloWorld
hello = HelloWorld()
info(f"an instance of class HelloWorld has a message hello.msg=``{hello.msg}''")

Next let's add one function to the above class HelloWorld.

In [ ]:
# add one function to class
@add_to_class(HelloWorld)
def update_msg(self, msg):
    self.msg = msg
    
# update the msg to "hello world"
hello.update_msg("hello world")

info(f"the same instance now has a message hello.msg=``{hello.msg}''")
In [ ]:
# Exercise: use "@add_to_class" helper to implement a function "print_msg" 
# that prints "self.msg" stored in the HelloWorld instance. 

# TODO: your code here


# print message
info('Expected to see "hello world"')
hello.print_msg()

b) HyperParameters¶

The second one is a utility class that saves all arguments in a class’s __init__ method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.

In [ ]:
# ClassHyperParameters saves all arguments in a class's 
# `__init__` method as class attributes.

class HyperParameters:
    def save_hyperparameters(self, ignore=[]):
        # saves all arguments in a class's `__init__` method as class attributes.
        pass

To use it, we define our class that inherits from HyperParameters and calls save_hyperparameters in the __init__ method.

In [ ]:
# Call the fully implemented HyperParameters class saved in d2l
class A(d2l.HyperParameters):    
    def __init__(self, a, b):
        print('self.a =', self.a)

info("you should see an AttributeError.\n")
tmp = A(a=1, b=2)
In [ ]:
@add_to_class(A)
def __init__(self, a, b):
    self.save_hyperparameters(ignore=['b'])
    print('self.a =', self.a)
    print('There is no self.b =', not hasattr(self, 'b'))
    
info("you should see no errors now.")
tmp = A(a=1, b=2)

c) ProgressBoard: plotting figures¶

The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) TensorBoard we name it ProgressBoard.

The draw function plots a point (x, y) in the figure, with label specified in the legend. The optional every_n smooths the line by only showing points in the figure. Their values are averaged from the neighbor points in the original figure.

In [ ]:
board = d2l.ProgressBoard('this is name')
for x in np.arange(0, 10, 0.1):
    board.draw(x, np.sin(x), 'sin', every_n=2)
    board.draw(x, np.cos(x), 'cos', every_n=10)

Section 1.2: Module base class¶

The Module class (Lab2Module below) is the base class of all models we will implement. At a minimum we need to define three methods:

  • The __init__ method stores the learnable parameters,
  • the training_step method accepts a data batch to return the loss value,
  • the configure_optimizers method returns the optimization method, or a list of them, that is used to update the learnable parameters.
In [ ]:
class Lab2Module(nn.Module, d2l.HyperParameters):
    
    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
        super().__init__()
        self.save_hyperparameters()
        self.board = d2l.ProgressBoard()
        
    def loss(self, y_hat, y):
        raise NotImplementedError

    def forward(self, X):
        assert hasattr(self, 'net'), 'Neural network is defined'
        return self.net(X)

    def plot(self, key, value, train):
        """Plot a point in animation."""
        assert hasattr(self, 'trainer'), 'Trainer is not inited'
        self.board.xlabel = 'epoch'
        if train:
            x = self.trainer.train_batch_idx / \
                self.trainer.num_train_batches
            n = self.trainer.num_train_batches / \
                self.plot_train_per_epoch
        else:
            x = self.trainer.epoch + 1
            n = self.trainer.num_val_batches / \
                self.plot_valid_per_epoch
        self.board.draw(x, value.to(d2l.cpu()).detach().numpy(),
                        ('train_' if train else 'val_') + key,
                        every_n=int(n))

    def training_step(self, batch):
        l = self.loss(self(*batch[:-1]), batch[-1])
        self.plot('loss', l, train=True)
        return l

    def configure_optimizers(self):
        raise NotImplementedError

You may notice that Module is a subclass of nn.Module, the base class of neural networks in PyTorch. It provides convenient features to handle neural networks. For example, if we define a forward method, such as forward(self, X), then for an instance a we can invoke this function by a(X). This works since it calls the forward method in the built-in __call__ method.

Section 1.3: Data base class¶

The DataModule class (Lab2Data below) is the base class for data. Quite frequently the __init__ method is used to prepare the data. This includes downloading and preprocessing if needed. The train_dataloader returns the data loader for the training dataset. A data loader is a (Python) generator that yields a data batch each time it is used. This batch is then fed into the training_step method of Module to compute the loss.

In [ ]:
class Lab2Data(d2l.HyperParameters):
    
    def __init__(self, root='./data', num_workers=4):
        self.save_hyperparameters()

    def get_dataloader(self, train):
        raise NotImplementedError

    def train_dataloader(self):
        return self.get_dataloader(train=True)

Section 1.4: Training base class¶

The Trainer class (Lab2Trainer below) trains the learnable parameters in the Module class with data specified in DataModule. The key method is fit, which accepts two arguments: model, an instance of Module, and data, an instance of DataModule. It then iterates over the entire dataset max_epochs times to train the model.

In [ ]:
class Lab2Trainer(d2l.HyperParameters):
    
    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
        self.save_hyperparameters()
        assert num_gpus == 0, 'No GPU support yet'

    def prepare_data(self, data):
        self.train_dataloader = data.train_dataloader()
        self.num_train_batches = len(self.train_dataloader)

    def prepare_model(self, model):
        model.trainer = self
        model.board.xlim = [0, self.max_epochs]
        self.model = model

    def fit(self, model, data):
        self.prepare_data(data)
        self.prepare_model(model)
        self.optim = model.configure_optimizers()
        self.epoch = 0
        self.train_batch_idx = 0
        self.val_batch_idx = 0
        for self.epoch in range(self.max_epochs):
            self.fit_epoch()

    def fit_epoch(self):
        raise NotImplementedError

Get yourself reasonably comfortable with how Module, Data, and Trainer interact with each other because you will need to fill in the raise NotImplementedError soon.

Section 2: a toy example, AI fighter jet¶

Watch the "AI fighter jet" problem in the vedio: either see it here or play ~/lab2/FighterJet.mp4.

To summarize, we want to train a figther jet NN that follows a safety rule:
the fighter jet fires iff the number of missiles on-the-fly is greater than zero.
According to the safety rule, the jet will not fire no matter how many other (enemy) jets exist.

Exercise 1: implement FighterJetData¶

You will implement a dataset (fill in __init__) for the fighter jet, and use it for training.

  • self.X will store the NN inputs (see below). It will be a tensor of "[#jets, #missiles]"
  • self.Y will store the outputs (a tensor of size one). It will contain a firing score: 0 means not-fire, 1 means fire.
  • You need to write code to produce self.Y in __init__ according to our safety rule.
In [ ]:
class FighterJetData(Lab2Data):
    
    def __init__(self, num_train=1000, batch_size=32):
        super().__init__()
        self.save_hyperparameters()

        # prepare training inputs
        n = num_train           # total number of istances
        jets = torch.randint(0, 20, (n,)).float()  # get a random #jets from [0,20)
        missiles = torch.randint(0, 3, (n,)).float()  # get a random #missiles from [0,3)
        self.X = torch.stack((jets, missiles), -1) # stack tensors to [[#jets, #missiles], ...]
        
        # TODO: your code here
        self.Y = None
    
    def get_dataloader(self, train):
        assert train, "We only use this dataset for training."
        dataset = torch.utils.data.TensorDataset(self.X, self.Y)
        return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)
In [ ]:
info("""
you should see something like:
    x= tensor([[13.,  1.]]) y= tensor([[1.]])
    x= tensor([[14.,  0.]]) y= tensor([[0.]])
    x= tensor([[1., 2.]]) y= tensor([[1.]])
    x= tensor([[6., 0.]]) y= tensor([[0.]])
    x= tensor([[5., 2.]]) y= tensor([[1.]])
    x= tensor([[4., 1.]]) y= tensor([[1.]])
    x= tensor([[15.,  0.]]) y= tensor([[0.]])
    x= tensor([[12.,  0.]]) y= tensor([[0.]])
    x= tensor([[8., 2.]]) y= tensor([[1.]])
    x= tensor([[18.,  1.]]) y= tensor([[1.]])

Check if the output value follows our safety rule:
  x[1]>0 => y=1  and  x[1]=0 => y=0
If not, you need to fix it.
""")

a = FighterJetData(10,1)
for x,y in a.train_dataloader():
    print("x=",x, "y=",y)

Exercise 2: implement FighterJetModule¶

Next, you will implement a NN to learn from the training data.

  • A simple implementation choice will be Multilayer Perceptron (MLP), or fully-connected feed-forward networks.
  • Read d2l S5.2.2.1 to see how to define one in PyTorch
In [ ]:
class FighterJetModule(Lab2Module):
    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
        super().__init__()
        self.save_hyperparameters()
        self.board = d2l.ProgressBoard()
        
        # TODO: your code here
        self.net = None
In [ ]:
info("""
you should see something like: 
    Sequential(
      (0): Linear(in_features=2, out_features=16, bias=True)
      (1): ReLU()
      (2): Linear(in_features=16, out_features=16, bias=True)
      (3): ReLU()
      (4): Linear(in_features=16, out_features=1, bias=True)
    )
""")

m = FighterJetModule()
print(m.net)

Exercise 3: implement loss and configure_optimizers of FighterJetModule¶

Next we need to implement the loss function and add a optimizer to the module.

  • a classic loss function will be $(y-\hat{y})^2$ ($y$ is true label, $\hat{y}$ is the NN output), but you can use whatever loss function you want.
  • for optimizers, you may want to choose one from PyTorch. Read torch.optim. You can start with troch.optim.Adam or troch.optim.SGD.
In [ ]:
@add_to_class(FighterJetModule)
def loss(self, y_hat, y):
    # `y` is the true label
    # `y_hat` is the predicted label from the current NN
    
    # TODO: your code here
    return None

@add_to_class(FighterJetModule)
def configure_optimizers(self):
    # you should return an opertimizer from `torch.optim`.
    # TODO: your code here
    return None

Untile now, you've implemented the NN architecture (exercise 1), training dataset (exercise 2), loss function, and an optimizer (exercise 3). Next is an implementation of one round of training. Read line by line to make sure you understand them.

Here are some pointers:

  • Line A: torch.nn.Module.train
  • Line B: torch.optim.optimizer.zero_grad
  • Line C: torch.no_grad
  • Line D: torch.optim.optimizer.step
In [ ]:
@add_to_class(Lab2Trainer)
def fit_epoch(self):
    self.model.train()          # Line A
    for batch in self.train_dataloader:
        loss = self.model.training_step(batch)
        self.optim.zero_grad()  # Line B
        with torch.no_grad():   # Line C
            loss.backward()
            self.optim.step()   # line D
        self.train_batch_idx += 1
In [ ]:
# Training
model = FighterJetModule()           # create a model
data = FighterJetData()              # create dataset
trainer = Lab2Trainer(max_epochs=20) # create trainer, train 20 epochs

# Train! 
# you will see the loss changes while training (lower loss is better)
trainer.fit(model, data)
In [ ]:
# check if the NN learned the safety rule
info("given an input Tensor(10000,0) [#jets, #missiles], should we fire?\n \
     (by safety rule in the video, no), but...")

with torch.no_grad():
    ret = model.forward(torch.Tensor([1345,0]))
    print("fire?", ret > 0.5)
In [ ]:
%run utils.ipynb

p_points = 0
n_points = 0
with torch.no_grad():
    for p,n in zip(get_positive_tests(), get_negative_tests()):
        if model.forward(p).item() >= 0.5:
            p_points += 1
        if model.forward(n).item() < 0.5:
            n_points += 1

info(f"=== points ===\n"
     f"  positive: [{p_points}/{get_num_positive_cases()}]\n"
     f"  negative: [{n_points}/{get_num_negative_cases()}]\n"
     f"  total:    [{p_points+n_points}/{get_num_positive_cases() + get_num_negative_cases()}]")

Challenge I: safe AI fighter jet¶

Try to train a NN that produces

=== points ===
  positive: [500/500]
  negative: [500/500]
  total:    [1000/1000]

Hint: this is supposed to be a non-trivial job (but sometimes people get lucky). If you're struggling, you might want to re-implement the NN (what NNs have larger learning capacity?), and also modify the training dataset (what data will let your NN learn the safety rule?).

Section 3: Learned index¶

In this section, we're trying to replicate "S2.3 A First, Naive Learned Index" in the learned index paper, where we use one neural network to learn a sorted dataset.

In [ ]:
# below are some global variables (hypterparameters)
# they are here for an easier hyperparameter tuning
# (you will need to come back and change them)
m_learning_rate = 0.01
m_batch_size = 128
m_max_epochs = 40
m_normalize = True

a glance at the datasets¶

Below we provide three datasets (named easy, medium, and hard) and their visualizations. All of these datasets are sorted data (see also SOSD). easy and medium are synthetic data; hard is generated from Wikipedia.

In [ ]:
%run utils.ipynb
import matplotlib.pyplot as plt

# (1) study datasets
datasets = {
    "easy" : get_linear_dataset(batch_size=m_batch_size, normalize=m_normalize),
    "medium" : get_lognormal_dataset(batch_size=m_batch_size, normalize=m_normalize),
    "hard" : get_wiki_dataset(batch_size=m_batch_size, normalize=m_normalize)
}

# visualize the distribution of three cases
def plot_distribution(name):
    xs = []
    ys = []
    for x,y in datasets[name].dataset:
        xs.append(x.item())
        ys.append(y.item())
    plt.plot(xs, ys)
    plt.xlabel("database key")
    plt.ylabel("data position")
    plt.title(f"dataset [{name}]")
    plt.show()

for name in datasets:
    plot_distribution(name)

Exercise 4: implement a monolithic NN for learned index¶

Of course, a simple starting point will be an MLP. You can implement whatever NNs you want and compare their performance.

In [ ]:
# (2) define your model (NN)
class LearnedIndex(d2l.Module):
    
    def __init__(self):
        super().__init__()
        self.save_hyperparameters()
        
        # TODO: your code here
        self.net = None
        
    def loss(self, y_hat, y):
        # TODO: your code here
        return None
        
    def configure_optimizers(self):
        # TODO: your code here; remeber to use global var, `learning_rate`
        # (for simpler parameter tuning)
        return None
In [ ]:
# TODO: choose the dataset to learn
my_dataset = datasets["easy"]
In [ ]:
# prepare training
model = LearnedIndex()      # create a model
data = my_dataset           # create dataset
trainer = d2l.Trainer(max_epochs=m_max_epochs) # create trainer

# Train!
trainer.fit(model, data)

Below is a test of how well your learned index perform. The higher the "index points", the better.

In [ ]:
# see how well our learned index is
%run utils.ipynb

ind_points = 0
with torch.no_grad():
    # assert "index_err_bound" in globals(), "run %run utils.ipynb"
    for x,y in my_dataset.dataset:
        if abs(model.forward(x).item() - y.item() ) <= my_dataset.get_err_bound():
            ind_points += 1
        

info(f"=== index points ===\n"
     f" [{ind_points}/{len(my_dataset.dataset)}]\n")

Challenge II: improve your learned index performance¶

Can you achieve the following learned index performance?

# for easy dataset
=== index points ===
 [9000/10000]

# for medium dataset
=== index points ===
 [8000/10000]

# for hard dataset
=== index points ===
 [7000/10000]

Hints: try to tune parameters and hyperparameters (go back to the code block with global parameters), including:

  1. NN parameters
  2. epochs (or when to stop training?)
  3. optimizer and its parameters
  4. batch sizes
  5. normalizing data
  6. what else? (Google "neural network hyperparameters" to find more)

Section 4: RMI¶

Challenge III: implement RMI¶

Implement RMI in the learned index paper, and train your RMIs to achieve

# for easy dataset
=== index points ===
 [10000/10000]

# for medium dataset
=== index points ===
 [10000/10000]

# for hard dataset
=== index points ===
 [10000/10000]
In [ ]:
# write whatever code you need to build RMI here
In [ ]: