Skip to content

Latest commit

 

History

History
357 lines (257 loc) · 12.9 KB

File metadata and controls

357 lines (257 loc) · 12.9 KB

VHDL Source Documentation

This directory contains synthesizable VHDL implementations of neural network layers and components for FPGA-based hardware acceleration.

Table of Contents

Overview

The VHDL source implements a modular, pipelined neural network architecture with the following capabilities:

Supported layer types:

  • Dense (fully connected) layers with configurable neurons
  • 1D Convolutional layers with kernel-based processing
  • Thresholding layers for quantized activations
  • Argmax layers for classification output
  • Flatten layers for dimension reduction

Key architectural features:

  • Ping-pong buffering for pipelined layer-parallel operation
  • Automatic ROM generation for weights and biases
  • Configurable bit widths and fixed-point precision
  • Modular design extensible to various layer types

Note: Mathematical functions used in delay calculations:

ceil(n)  := the smallest integer greater or equal than n
floor(n) := the greatest integer smaller or equal than n

Directory Structure

source/
├── README.md              # This file
├── nn_pkg.vhd            # Common package definitions
│
├── Layer Implementations (1D):
├── layer.vhd             # Dense layer
├── conv_layer.vhd        # 1D convolutional layer
├── thresh_layer.vhd      # Thresholding layer
├── argmax_layer.vhd      # Argmax layer
│
├── Processing Units:
├── neuron.vhd            # Dense neuron
├── conv_neuron.vhd       # Convolutional neuron
├── threshold.vhd         # Threshold activation
├── multithreshold.vhd    # Multi-threshold unit
├── argmax.vhd            # Argmax unit
│
├── Memory & Buffering:
├── ping_pong_buffer.vhd  # Dual buffer for pipelining
├── conv_ping_pong_buffer.vhd
├── block_ram.vhd         # Generic RAM
├── block_ram_amd.vhd     # AMD/Xilinx specific
├── block_ram_intel.vhd   # Intel specific
├── vector_bram.vhd       # Vector-based RAM
│
├── Data Management:
├── data_fetcher.vhd      # Data/weight fetcher
├── data_fetcher_thresh.vhd
├── data_writer.vhd       # Output writer
├── padding_interface.vhd # Zero-padding for convolution
├── writing_interface.vhd # Write control
│
├── Utility Components:
├── adder_tree.vhd        # Parallel adder tree
├── c_s_tree.vhd          # Compare & select tree
├── flatten.vhd           # Flatten layer
├── relu.vhd              # ReLU activation
├── softmax.vhd           # Softmax (experimental)
├── exp_lut.vhd           # Exponential LUT
│
├── Network Top-Level:
├── network.vhd           # Network composition
├── network_pipe.vhd      # Pipelined network
├── nn_controller.vhd     # Network controller
├── aximm_network.vhd     # AXI memory-mapped wrapper
├── input_provider.vhd    # Input interface
└── output_fifo.vhd       # Output buffering

Generics Reference

Common generics used across layer implementations:

Generic Description
G_DATA_WIDTH Number of samples in one time series / feature map width
G_BIT_WIDTH Bit width of one sample
G_PRECISION Fixed-point precision for multiplication result slicing
G_WEIGHT_WIDTH Bit width of weights (dense layers)
G_KERNEL_SIZES Kernel length for convolution
G_LAYER_SIZES Number of input nodes for each layer
G_SPECIAL_LAYERS Placeholder for intermediate special layers (flatten, etc.)
G_ACTIVATIONS Activation type per layer (RELU or linear); currently only linear is fully tested
G_LAYERS_TYPE Layer type: 'd' = dense, 't' = thresholding, 'c' = convolutional
G_PATHS Path array for data files (mainly thresholding layers)
G_INPUT_PREC Bit width of each layer input
G_OUTPUT_PREC Bit width of each layer output
G_PARALLEL_ELEMENTS Number of parallel elements (thresholding layers)
G_SIGNED_NUMBERS Determines signed or unsigned interpretation

Layer Architectures

The layer architecture is modular and extensible, supporting dense, 1D convolution, and thresholding.

Dense Layer

The dense layer implements a fully connected layer with parallel input processing.

Architecture of a dense layer.

Note: The ROM containing weights and biases is separate in the actual design to enable automatic generation.

Ping Pong Buffer

The ping_pong_buffer contains two RAM clusters, allowing the previous layer to write into one while the current layer reads from the other. This enables pipelined operation across the entire network.

Memory organization for inputs i1 = [i11, i12, i13] and i2 = [i21, i22, i23] with G_DATA_WIDTH = 3:

            inputs
             ---->
         +-----+-----+
         | i11 | i21 |
       | +-----+-----+
data_  | | i12 | i22 |
width  v +-----+-----+
         | i13 | i23 |
         +-----+-----+

Each column represents one time series, stored in one RAM. All RAMs share the same address, so reading produces one row of the memory map.

Data Fetcher

The data_fetcher retrieves data from the ping-pong buffer's read RAM and weights/biases from ROM using two counters:

  • sample_count: Sweeps input samples of a time series
  • out_count: Sweeps outputs

These counters serve as memory addresses and are shared with data_writer.

Example timing diagram with G_DATA_WIDTH = 2, G_INPUT_NUMBER = 1, G_NODE_NUMBER = 3.

The data fetcher accesses entire RAM rows. For sample_count_o = 1 in the memory map above, it fetches [i12, i22]. This repeats for each output.

Dense Neuron

The neuron processes all inputs in parallel for each output individually:

Example architecture for a dense neuron with G_INPUT_NUMBER = 5.

The neuron receives samples from all inputs in parallel, multiplies them with weights, and sums the results using an adder tree. This architecture uses one neuron per layer, processing inputs for each output sequentially with corresponding weights and biases.

Delay calculation: Let i = number of inputs, ceil() = ceiling function, log2() = base-2 logarithm:

d = 1 + 1 + ceil(log2(i)) + 1 = 3 + ceil(log2(i))
    |   |   |               |
    |   |   |               +-- output register
    |   |   +-- adder tree delay
    |   +-- product register
    +-- input register

Data Writer

The data_writer writes processed data to the next layer's ping-pong buffer. It receives counter information from data_fetcher and delays it to synchronize with neuron processing delay.

Convolutional Layer (1D)

The convolutional layer extends the dense architecture with padding and modified neuron processing.

Architecture of a convolutional layer.

Padding

Zero-padding prevents size reduction during convolution. For vector v = [v1, v2, v3, v4] and kernel k = [k1, k2, k3, k4, k5]:

1. step: XX XX v1 v2 v3 v4 XX XX
         k1 k2 k3 k4 k5

2. step: XX XX v1 v2 v3 v4 XX XX
            k1 k2 k3 k4 k5

3. step: XX XX v1 v2 v3 v4 XX XX
               k1 k2 k3 k4 k5

4. step: XX XX v1 v2 v3 v4 XX XX
                  k1 k2 k3 k4 k5

Padding Interface

The padding_interface corrects invalid read addresses and provides zero-padding:

Example timing with G_DATA_WIDTH = 4 and G_KERNEL_SIZE = 5.

Convolutional Neuron

The convolutional neuron uses a shift register (length G_KERNEL_SIZE) as input:

Example architecture with G_KERNEL_SIZE = 3 and G_INPUT_NUMBER = 4.

Delay calculation: Let i = number of inputs, k = kernel size:

d = 1 + 1 + floor(k/2) + ceil(log2(k)) + ceil(log2(i)) + 1
    = 3 + k/2 + ceil(log2(k)) + ceil(log2(i))

Note: The delay formula uses k/2 for the shift register, which works across multiple kernel sizes but may need verification.

Writing Interface

The writing_interface enables writing only when data is in a valid region:

Example timing with G_DATA_WIDTH = 4, G_KERNEL_SIZE = 5, G_NODE_NUMBER = 2.

Argmax Layer

The argmax layer selects the maximal output value from the previous layer.

Compare & Select Tree

Nodes consist of multiplexers and comparators selecting the greater of two inputs:

A Compare & Select node

Argmax Architecture

Uses Ping Pong Buffer and Data Fetcher for input, with Compare & Select Tree and index MUX for processing:

Architecture of the argmax block in the argmax layer.

Flatten Layer

The Flatten layer modifies w_addr and w_en signals to change memory write locations in the next layer.

Note: Multiple Flatten layers in one network may not be currently supported.

Network Architecture

Generics

Generic Description
G_DATA_WIDTH Number of samples in one input
G_BIT_WIDTH Bit width of one sample
G_PRECISION Fixed-point precision
G_KERNEL_SIZES Kernel sizes for each layer
G_LAYER_SIZES Input and output sizes for each layer

Network Composition

The network combines layers using generate statements. ROMs are automatically generated by scripts.

Neural Network Controller

The nn_controller coordinates layer processing and synchronizes ping-pong buffer wram signals:

Conceptual timing diagram of network functionality

Each layer's fin signal triggers the next layer's start signal.

Note: Advanced control logic may be needed for pipelines where multiple layers work concurrently.

Delay Calculations

Layer delays determine network throughput. Let i = inputs, o = outputs, d = data width, k = kernel size.

The pipelined architecture adds o * d - 1 to the computation delay:

Dense Layer Delay

d_dense = 1 + 1 + 3 + ceil(log2(i)) + 1 + o * d - 1
        = 5 + ceil(log2(i)) + o * d

Convolutional Layer Delay

d_conv = 1 + 1 + 3 + ceil(log2(i)) + ceil(log2(k)) + 1 + o * (d + k - 1) - 1
       = 5 + ceil(log2(i)) + ceil(log2(k)) + o * (d + k - 1)

Note: Neuron delay adjustment (removal of ceil(k)) may be related to shift register timing.

ROM Generation

ROMs for layer weights and biases are generated by NetworkGenerator.generate_roms():

Naming convention: wb_rom_<n>.vhd where <n> is the layer number.

Note: Generation deletes existing files matching regex wb_rom_[0-9]+.

Related Documentation

Within this repository:

Top-level documentation: