ALIFE 2021 Tutorial: Differentiable Self-Organizing Systems

Name: ALIFE 2021 Tutorial: Differentiable Self-Organizing Systems
Uploaded: 2022-02-26T20:01:28.000Z
Duration: 3 h 34 min 16 s
Description: This tutorial covers the basics of Differentiable Programming, forward and backward propagation. Then I talk about Self-Organization and Neural Cellular Automata. I train NCAs to perform Texture Synthesis and approximate Minimum Spanning Tree construction. Tutorial website: https://selforglive.github.io/

Introduction

In this section, Alexander Martvinsev introduces himself and his work at Google Research. He also provides an overview of the presentation.

Alexander Martvinsev is a researcher at Google who leads a team that works on differentiable self-organizing systems.

The presentation covers results obtained by Martvinsev and his colleagues over the last couple of years. It builds upon publications submitted to the differentiable self-organizing systems thread at distilled.pub journal.

The project was inspired by Professor Mike Levin's presentation "What Bodies Think About" at Nurip's Conference 2018, which provides an insightful overview into cognition and information processing in living systems outside of neural networks.

Self-Organization

In this section, Martvinsev discusses self-organization and its properties.

Natural structures are results of collective behavior of locally communicating agents that work together towards some shared global goal. This characteristic is often referred to as self-organization.

One intriguing property of self-organized systems is that it's almost impossible to predict the global behavior even if you precisely know the local interaction rules without actually executing those rules in a simulation or in the real world.

With differentiable self-organizing systems project, they aim to find the local rule that would produce a desired target behavior using deep learning or differentiable programming.

Differentiable Programming

In this section, Martvinsev provides an overview of basic building blocks of differentiable programming or deep learning.

Functions are computer programs that take some input x does something to it and produces an output. This input can be a single number, a list of numbers, or some more complicated data structure that's still represented as numbers in computer memory.

Martvinsev recommends Sebastian's YouTube channel for great videos on programming and coding self-organizing systems such as reaction diffusion systems or simulated slime modes.

Differentiable programming is used to build functions that are components or circuits.

Examples of Computable Functions

In this section, Martvinsev provides examples of computable functions.

Examples of computable functions include a function that computes the sign of its argument given x it returns sine of x or it can be a function that takes an image represented as an array of rgb pixels and decides if an image contains hotdog or not.

Introduction to Neural Networks

In this section, the speaker introduces neural networks and their applications in various fields.

Neural Network Functions

A function f can be a controller function for a robot that takes current perception vector x and state vector st that represents robots shorter memory information about the current task and produce actions vector y and st plus one which represents the robot memory at the next moment of time.

All function f can be a text processing system that inputs characters or words one by one and accumulates the meaning of the sentence in those sequences of states.

These functions can correspond to different pixels or areas of the image, and we want to apply the same operation to every image neighborhood. This is how convolutional neural networks work.

Model Architecture

In this section, the speaker discusses model architecture, its role in programming a neural computer, and how it works with parameters.

Model Architecture

Model architecture defines the computational structure or computational graph required to compute outputs given input.

Different model architectures are more suitable for different input modalities. For example, convolutional neural networks are typically used for images while sequence-based models such as LSTMs are applied for sequences such as languages.

Parameters are responsible for determining what the function does. The actual task that these architectures solve is defined by learned parameters.

Loss Function

In this section, the speaker explains loss functions and their importance in measuring how well a system performs.

Loss Function

A loss function takes output of the model and produces a single value that measures how bad or good the system performs.

The property that it produces a single number allows us to compare different stages of model training using optimization algorithms to tune model parameters theta to find models that perform better.

However, a small loss value during training doesn't guarantee that the model is actually doing what we want it to do. It can overfit to the data or incorporate some harmful biases that lead to incorrect or even dangerous decisions made when the system is deployed in the real world.

The Art of Differentiable Programming

In this section, the speaker discusses the importance of differentiable programming and how it is used to optimize loss functions. He also mentions a book on the topic and explains why loss-based approaches are appropriate for his project.

Key Points:

Differentiable programming involves optimizing loss functions through a training schedule.

Loss-based approaches are appropriate for creating artificial life-like systems that perform specific tasks.

The flexibility of combining functional building blocks is what makes deep learning so popular.

Automatic differentiation is used to get the gradient of loss over parameters in differentiable programming.

Forward Mode vs Reverse Mode Automatic Differentiation

In this section, the speaker explains the difference between forward mode and reverse mode automatic differentiation and why reverse mode is preferred in differentiable programming.

Key Points:

Forward automatic differentiation allows us to fix one network input and compute derivatives of all other intermediate values and outputs over this fixed input.

Reverse automatic differentiation allows us to fix one network output and compute derivatives of this output value over all network intermediate values and inputs.

Reverse mode automatic differentiation is preferred in differentiable programming because we typically have a single loss or objective value to care about with millions or billions of input parameters that influence the loss.

Impressive Use of Differentiation: G9.js Library

In this section, the speaker demonstrates an impressive use of differentiation using the g9.js library developed by Guillermo Webster.

Key Points:

The g9.js library can be used to create interactive graphics such as a dragon curve where each endpoint can be grabbed and moved.

Differentiable Programming with Interactive Demonstrations

In this section, the speaker discusses how to use differentiable programming frameworks to create interactive demonstrations. The speaker explains how these frameworks work and provides examples of different types of interactions that can be created.

Using Differentiable Programming Frameworks

Differentiable programming frameworks can be used to create interactive demonstrations.

These frameworks allow for the creation of complex interfaces by defining a number of parameters and points, which are then automatically produced by the library.

The core of these libraries is essentially a gradient computation, which uses finite differences to estimate the derivative of point positions over each parameter.

Modern differentiable programming frameworks like T5 TFGS can be used to reproduce these interactions.

Examples of Interactive Demonstrations

A demonstration was created using T5 TFGS that allows users to adjust the turn angles between segments in a line when dragging a point. This causes the entire line to readjust itself based on the new angle values.

Another example was shown where only one node in a line would move if dragged, but if the first node was moved, then all nodes would move. This was achieved using stop gradient primitives that prevent propagation of gradients beyond certain nodes.

A fractal tree example was also shown as an example of what could be done with T5 TFGS.

Building Blocks for Differentiable Programs

In this section, the speaker discusses basic building blocks for creating differentiable programs or circuits. They explain how forward propagation and backpropagation work in automatic differentiation cases.

Basic Building Blocks

The basic building blocks for creating differentiable programs or circuits are addition, multiplication, and various types of honorary functions.

Fan out or split is also treated as a separate type of gate because it requires special handling when implementing backpropagation.

Forward Propagation and Backpropagation

Forward propagation involves replacing each value with a pair of values that correspond to the original value and its derivative over the input node selected as a derivative source. These derivatives are then propagated forward starting from source nodes.

Backpropagation in automatic differentiation cases involves selecting some node at the end of the network (typically a single loss value) and constructing an auxiliary network that starts from that value and passes the gradient.

Differentiable Programming and Backpropagation

This section covers the basics of differentiable programming, backpropagation, and building a dense layer.

Building Blocks of Differentiable Programming

Differentiable programming involves computing gradients for optimization.

Gradients can be computed using forward and backward propagation.

Backward propagation through multiplication operations and unary functions requires remembering values used during forward pass.

Linear layers with activation functions are used to add complexity to narrow circuits.

Building a Dense Layer

A dense layer is built by multiplying an input vector by a matrix of coefficients to produce an output vector.

Bias values are added to each element of the output vector.

Two backpropagation circuits are needed: one for propagating gradients to inputs and another for propagating gradients to parameters.

The choice of activation function is critical; ReLU works better than sigmoid in deep models.

Self-Organizing Systems

This section covers the basics of self-organizing systems in differentiable programming.

Components of Self-Organizing Systems

A self-organizing system consists of a neural controller that is shared by a set of agents or modules.

Agents interact with their environment based on their internal state, which is updated by the neural controller using feedback from the environment.

The neural controller learns to optimize some objective function that measures how well the agents perform their tasks.

Applications of Self Organizing Systems

Self organizing systems have been applied in various fields such as robotics, game AI, natural language processing etc.

Introduction to Neural Cell Automata

In this section, the speaker introduces the topic of neural cell automata and explains that they will be implementing a minimalistic version of it using PyTorch.

Implementing a Minimalistic Neural Cell Automata Model

The model operates on a uniform grid where each cell is represented by real values.

The state of each cell is updated iteratively using information collected from its 3x3 neighborhood.

The speaker uses Google Colab with GPU support to create a new notebook for implementation.

The speaker cheats a little bit by embedding some boilerplate code snippets into the notebook using Collab's custom snippets feature.

The speaker imports PyTorch modules and sets the default tensor type to GPU CUDA tensor. They also install and import another useful library called IronOps for high dimensional array manipulation.

The speaker explains that they will be using PyTorch instead of TensorFlow in this tutorial and will implement a variant of the model from an article on text generation with neural CA.

Texture Synthesis with Neural Cell Automata

In this section, the speaker discusses texture synthesis with neural cell automata and explains how they will implement it in their model.

Implementing Texture Synthesis

The speaker imports code snippets from a PyTorch column notebook that accompanies an article on texture synthesis with neural CA.

The speaker copies a texture loss function based on VGG16 pre-trained neural network into their notebook.

The speaker defines three useful functions for computing gray matrices, squared losses, and texture descriptors.

Implementing the Neural Cell Automata Model

In this section, the speaker explains how they will implement the neural cell automata model in PyTorch.

Implementing the Model

The speaker defines a class for implementing the neural cell automata model with forward and backward methods.

The speaker explains that they will use a 3x3 convolutional kernel to update each cell's state based on its neighborhood information.

The speaker discusses how they will use an attention mechanism to weight each cell's contribution to the final output.

The speaker explains that they will use a softmax function to normalize the attention weights across all cells in the grid.

Training and Testing the Model

In this section, the speaker discusses how they will train and test their neural cell automata model.

Training and Testing

The speaker explains that they will use Adam optimizer with a learning rate of 1e-4 to train their model.

They also explain that they will use mean squared error as their loss function during training.

The speaker shows some examples of synthesized textures generated by their model during testing.

Tensor Dimensions and Axis Manipulations

In this section, the speaker explains how to expand dimensions by adding a dummy dimension of size 1 at the beginning of tensor dimensions. They also modify a line using their rearrange function to express axis manipulations more explicitly.

Expanding Dimensions

The speaker explains that tensors don't have batch dimension.

To add a batch dimension, they expand dimensions by adding a dummy dimension of size 1 at the beginning of tensor dimensions.

Rearranging Axes

The speaker modifies a line using their rearrange function.

This allows them to express axis manipulations more explicitly.

Naming Array Dimensions and Shuffling Axes

In this section, the speaker explains how they name array dimensions with single letter names for better understanding. They also shuffle axes putting the channel at the second place.

Naming Array Dimensions

The speaker names array dimensions with single letter names for better understanding.

They say that an array has the axis of batch height with a channel.

Shuffling Axes

The speaker shuffles axes putting the channel at the second place.

Using Laplacian and Sobel Filters in Neurosci Model

In this section, the speaker explains how they use Laplacian and Sobel filters in their neurosci model. They remove coefficients from filter definitions and use larger values to allow more gradient flows through branches during backpropagation.

Using Filters in Neurosci Model

The speaker uses Laplacian and Sobel filters in their neurosci model.

They remove coefficients from filter definitions to allow more gradient flows through branches during backpropagation.

Defining Cell Representation

Each cell is represented by a vector of values.

The speaker defines the number of these values as 12, which is the smallest they managed to get while not degrading quality too much.

Using One Filter per Channel in New Model

In this section, the speaker explains how they use just one filter per channel in their new model. They create an array of filters with four Laplacians for x Sobel and four y Sobel kernels.

Using One Filter per Channel

The speaker uses just one filter per channel in their new model.

Creating Array of Filters

The speaker creates an array of filters with four Laplacians for x Sobel and four y Sobel kernels.

Applying Laplacian Filters to RGB Channels

In this section, the speaker explains how they apply Laplacian filters to first three channels that are also going to be interpreted as RGB channels. They do this to avoid bias towards visible channels being connected to a particular orientation of patterns.

Applying Filters to RGB Channels

The speaker applies Laplacian filters to first three channels that are also going to be interpreted as RGB channels.

Avoiding Bias Towards Visible Channels

This is done to avoid bias towards visible channels being connected to a particular orientation of patterns.

Implementing Cell Automata Rule in Constructor

In this section, the speaker explains how they implement cell automata rule in constructor. They define parameters and update step is implemented by format function.

Defining Parameters

The speaker defines parameters for implementing cell automata rule in constructor.

Implementing Update Step

The update step is implemented by format function.

Perception Phase of Cell Update

In this section, the speaker explains how they implement the perception phase of cell update. They use wraparound padding to avoid worrying about boundary conditions and apply convolution operation to individual filter and channels.

Using Wraparound Padding

The speaker uses wraparound padding to avoid worrying about boundary conditions.

Applying Convolution Operation

The speaker applies convolution operation to individual filter and channels.

Introduction to the Cell Automata

In this section, the speaker introduces the concept of cell automata and explains how they can be used to create patterns.

Creating an Update Rule for the Cell Automata

The speaker creates an update rule for the cell automata by concatenating a perception vector with the current state of the cell.

The perception vector is collected through image filters that gather information about neighborhood cells.

The updates are multiplied by a stochastic updates mask, which is an array of random values transformed so that values smaller than a threshold become zero and larger values become one.

Training the Model

The weights and biases arrays are initialized to unit normal distribution scaled by 10^-3.

A loss function is defined to steer the system into producing a particular texture.

A pool of states is created to stabilize patterns so that they continue to produce reasonable results when there is already some non-initial state of a grid when they started iteration.

Loss Function

The loss function consists of two terms - style loss from original texture styles in this article and overflow loss, which limits range of values that state channels can take from -1 to 1 because it's easier to quantize those later on.

Implementing the Training Loop

In this section, the speaker explains how they implemented the training loop and discusses some differences between TensorFlow and PyTorch.

Training Loop Implementation

The speaker has implemented a training loop of 8,000 iterations.

They use a "no gradient context" in PyTorch to disable gradient computation when it is not needed.

The speaker samples a batch of indices from a pool and extracts grid states into an array x. For every 8th batch, they replace the first element with the seed state so that the model doesn't forget how to grow starting from seed. They then sample a random number of steps and iterate the model for these number of steps starting from the grid state. They recursively apply the current CA to the state, evaluate loss, propagate gradients through the graph towards network parameters using non-gradient context, perform optimizer step, place current states back to pool at corresponding positions so they can be reused on next training steps, record current loss value to generate plot and render plot.

Differences Between TensorFlow and PyTorch

One difference between TensorFlow and PyTorch is that by default PyTorch tries to record all operations on something like gradient tape so it remembers which tensors were computed using which operations from which inputs. This causes some problems sometimes.

Another difference is that in TensorFlow there is a special context called "gradient tape" used when you need to compute gradients while otherwise gradients are not tracked. In contrast, in PyTorch you use "no gradient context" when you don't need to compute gradients.

Adding Non-Linearities to the Model

In this section, the speaker discusses adding non-linearities to the model.

First Attempt at Adding Non-Linearity

The speaker tries to add non-linearity to their model by taking ReLU of vector y before processing it with convolution. They print out the number of parameters of neural CA and find that it has 300 parameters which is very small.

After running training for about four minutes, they find that the result is somewhat better than expected but still not great.

Second Attempt at Adding Non-Linearity

The speaker concatenates the ReLU fix with a construct where they take minus y, then take ReLU, and then take minus again. This creates two different coefficients for each value - one for positive part of activation range and one for negative part. Each value is split into two channels - if it's positive it goes into one branch, if it's negative it gets treated by one set of coefficients. Their convolution now has two times more inputs so they multiply the number of inputs for this operation by four rather than by two.

After running training again, they find that their new model has 588 parameters and will see how well this model trains.

Implementing the Neural CA Model

In this section, the speaker implements the neural CA model and discusses how it produces a moving texture.

Applying Same Program to Multiple Locations on a Grid

The speaker applies the same program to multiple locations on a grid.

This is essentially having multiple instances of the same function that share parameters.

Back propagation allows finding parameters that make this self-organizing system produce a pattern that satisfies objectives enforced.

Visualization of Weights of Metrics

The speaker visualizes weights of metrics using a rectangle.

The main diagonal of negative values is something like decay, which stabilizes cell automata when they grow too high.

Other terms encode interaction between channels of the model, and adding Laplacian filters is equivalent to blurring.

Color Distribution in Texture Synthesis

The style transfer loss used may not be super good at reproducing blocks of solid colors.

It's much better with edges and textures.

Exporting Parameters for Cell Automata

A piece of code exports parameters for cell automata in a small self-contained C function.

Most of the code consists of quantized parameters of the model.

Visualizing the Frame Buffer and Perception Buffer

In this section, the speaker explains how to visualize the frame buffer and perception buffer.

Three Filters for Position in Frame Buffer

The whole frame is split into three parts, each of which corresponds to four consecutive channels encoded as RGBA.

Each part has a filter depending on its position in the frame buffer.

Visualization of Perception Buffer

The perception buffer is visualized using a shader that implements an update rule.

The update rule reads from buffers produced by a perception shader and concatenates them into a vector.

Update Function

An update function is generated from the concatenated vector.

The update function has a gate-like structure that selects coefficients based on input vector elements' signs.

Stochastic Update Mask

Once the update vector is computed, it applies a random mask to it called stochastic update mask.

This mask adds it to the current state.

Updating Cells Independently

In this section, the speaker discusses updating cells independently.

Updating Channels Stochastically

Each channel can be updated independently of others.

This approach decreases granularity of synchronization even further.

Adapting Circuit to Other Computational Architecture

Adapting circuit to other computational architecture allows resolving ambiguities and making design decisions that were not sure about before.

For example, using linear combinations of negative and positive parts can be replaced with linear combinations of y and x.

Computational Substrate

The computational substrate is the environment used to implement the model.

It allows design decisions to be nailed down.

Synthetic Tasks for Cell Automata

In this section, the speaker discusses synthetic tasks for cell automata.

Minimum Spanning Tree Construction Tasks

A random set of points on a plane is generated.

The minimum spanning tree is computed and rasterized as line segments.

Input and Output for Task

The input image consists of nodes encoded in the first channel of the excel state.

The output image connects nodes with lines.

Training Setup Code

Pull X contains initial states, and pull Y contains expected results.

An array of task pairs is generated, and training code from previous examples is reused.

Modifying the Training Objective

In this section, the speaker discusses modifying the training objective to steer the model towards drawing more lines and using different weights for positive and negative losses.

Modifying Training Objective

The model is learning something, so the training objective will be modified.

Positive difference values correspond to areas where the model draws a line where it shouldn't be.

Different weights will be used for positive and negative difference losses.

Loss is now just pixel-wise difference between the current state of the system and target pattern.

Visualizing Channel Data

In this section, the speaker realizes they made a mistake in computing loss with the wrong channel. They visualize data from another channel that looks better.

Computing Loss with Correct Channel

The speaker made a mistake in computing loss with the wrong channel.

Another channel looks much better when visualized.

Starting Training

In this section, training begins on modified objectives.

Starting Training

The training has started during video playback.

Speaker presents notebook to play around with it.

Observing Model Convergence

In this section, after 8000 steps of training, we see that the model has converged into producing something that connects points.

Model Convergence

After 8000 steps of training, we see that model has converged into producing something that connects points.

There are roots spanning through boundaries.

Unexpected Behaviors

In this section, unexpected behaviors are observed while designing a system to make a spanning tree.

Unexpected Behaviors

System happens to hallucinate additional nodes where they shouldn't be although original nodes don't disappear.

When starting from all zero state, the system happens to produce dots out of nowhere and connects them.

The system is essentially a pattern generator although there was no texture generation objective.

Generalization

In this section, the speaker discusses how life generalizes well to situations that are widely outside of normal conditions that living systems exist in.

Generalization

Life generalizes well to situations that are widely outside of normal conditions that living systems exist in.

There may not be open-enders but there are a lot of unexpected behaviors.

Unexpected Adaptation

In this section, the speaker discusses how systems can sometimes adapt to situations outside of their evolutionary regime and shares an example of unexpected adaptation.

Strange Adaptation

Systems can sometimes adopt to situations outside of their evolutionary regime.

Unexpected adaptation occurs when a system trained for one task does something else.

The notebook used in the experiment will be uploaded to Github after the tutorial ends.

Tasks That Can Be Solved

Tasks that can be solved starting from local neighborhoods and then spreading the influence work well.

Pattern recognition is an example of a task that works well with this method.

The speaker conducted an experiment where they fed a network with letters and it had to spot letters Z or X Y Z and connect them with lines.

It was easy for the network to filter out X Y Z but it couldn't build connection lines because it had to decide between which pairs of letters should it build lines.

Model Size Restriction

The restriction on building connection lines may have been due to the size of the model, which was less than 600 parameters.

Behaviors like patterns observed during texture synthesis or detecting nodes in one channel are all encoded in these 600k fish coefficients.

Simplifying Circuit Representation

In this section, the speaker talks about simplifying circuit representation by modifying filters used by models.

Side Filters

The speaker replaced filters used by models with side filters, which are just differences between center and nearby values.

Four different versions of these filters were rotated by 90 degrees so that there were three channels looking up, down, left, and right respectively.

Shader Toy Implementation

The speaker shows a Shader Toy implementation of the method.

Perception and update are merged in a single rule, resulting in only a single pass.

Two patterns are included in the same shader, with updates depending on time and pixel position.

Circuit Representation

The speaker wanted to come up with a representation of a model that would look like an electronic circuit.

They realized that it was pretty non-trivial to root those filters if they wanted to be two-layered.

Each cell is represented by this circuit, which has 12 nodes corresponding to different channels.

Understanding Neural Networks

In this section, the speaker discusses how neural networks work and how they can be implemented in analog circuits.

Neural Network Implementation

Positive values move to the right, negative values move to the left.

Gray points visualize connection coefficients between nodes.

Feedback loops feed update value back to state holding registers.

Can be treated as a system of ODEs rather than discrete time and implemented as an elegant analog circuit.

Reaction Diffusion Models

Reaction diffusion models are a version of cell automata that use only diffusion to communicate.

The system is completely isotropic, so a special loss function is needed that doesn't penalize incorrect orientation of the pattern produced.

The model doesn't overfit to grid structures and produces patterns at different orientations in different runs.

Generalizes easily to three-dimensional meshes when we diffuse over mesh edges.

Volumetric Visualization

Trained models can be run in volume by replacing 2D diffusion with 3D diffusion.

Custom raycaster used for volume rendering.

Deep Dream Dot C

In this section, the speaker discusses Deep Dream Dot C, which is an implementation of convolutional neural networks.

Convolutional Neural Networks

No bullet points with timestamps available.

Introduction and Next Steps

In this section, the speaker introduces a project and discusses the next steps for differentiable self-organization.

Applications and Modeling

Two main directions for differentiable self-organization: applications and modeling.

Applications include adopting principles to a wide variety of tasks, such as robot swarms or genetically engineered plants.

Modeling involves finding new models to apply principles of differentiable self-organization, such as reaction diffusion in 3D.

Understanding Principles

The third direction is interpreting and understanding the principles of operation for these models.

It is difficult to predict even simple models due to the many interacting components involved.

Machine learning may provide some chance to design and reverse engineer these systems.