VHDL Source Documentation

This directory contains synthesizable VHDL implementations of neural network layers and components for FPGA-based hardware acceleration.

Overview
Directory Structure
Generics Reference
Layer Architectures
Network Architecture
Delay Calculations
ROM Generation
Related Documentation

Overview

The VHDL source implements a modular, pipelined neural network architecture with the following capabilities:

Supported layer types:

Dense (fully connected) layers with configurable neurons
1D Convolutional layers with kernel-based processing
Thresholding layers for quantized activations
Argmax layers for classification output
Flatten layers for dimension reduction

Key architectural features:

Ping-pong buffering for pipelined layer-parallel operation
Automatic ROM generation for weights and biases
Configurable bit widths and fixed-point precision
Modular design extensible to various layer types

Note: Mathematical functions used in delay calculations:

ceil(n)  := the smallest integer greater or equal than n
floor(n) := the greatest integer smaller or equal than n

Directory Structure

source/
├── README.md              # This file
├── nn_pkg.vhd            # Common package definitions
│
├── Layer Implementations (1D):
├── layer.vhd             # Dense layer
├── conv_layer.vhd        # 1D convolutional layer
├── thresh_layer.vhd      # Thresholding layer
├── argmax_layer.vhd      # Argmax layer
│
├── Processing Units:
├── neuron.vhd            # Dense neuron
├── conv_neuron.vhd       # Convolutional neuron
├── threshold.vhd         # Threshold activation
├── multithreshold.vhd    # Multi-threshold unit
├── argmax.vhd            # Argmax unit
│
├── Memory & Buffering:
├── ping_pong_buffer.vhd  # Dual buffer for pipelining
├── conv_ping_pong_buffer.vhd
├── block_ram.vhd         # Generic RAM
├── block_ram_amd.vhd     # AMD/Xilinx specific
├── block_ram_intel.vhd   # Intel specific
├── vector_bram.vhd       # Vector-based RAM
│
├── Data Management:
├── data_fetcher.vhd      # Data/weight fetcher
├── data_fetcher_thresh.vhd
├── data_writer.vhd       # Output writer
├── padding_interface.vhd # Zero-padding for convolution
├── writing_interface.vhd # Write control
│
├── Utility Components:
├── adder_tree.vhd        # Parallel adder tree
├── c_s_tree.vhd          # Compare & select tree
├── flatten.vhd           # Flatten layer
├── relu.vhd              # ReLU activation
├── softmax.vhd           # Softmax (experimental)
├── exp_lut.vhd           # Exponential LUT
│
├── Network Top-Level:
├── network.vhd           # Network composition
├── network_pipe.vhd      # Pipelined network
├── nn_controller.vhd     # Network controller
├── aximm_network.vhd     # AXI memory-mapped wrapper
├── input_provider.vhd    # Input interface
└── output_fifo.vhd       # Output buffering

Generics Reference

Common generics used across layer implementations:

Generic	Description
`G_DATA_WIDTH`	Number of samples in one time series / feature map width
`G_BIT_WIDTH`	Bit width of one sample
`G_PRECISION`	Fixed-point precision for multiplication result slicing
`G_WEIGHT_WIDTH`	Bit width of weights (dense layers)
`G_KERNEL_SIZES`	Kernel length for convolution
`G_LAYER_SIZES`	Number of input nodes for each layer
`G_SPECIAL_LAYERS`	Placeholder for intermediate special layers (flatten, etc.)
`G_ACTIVATIONS`	Activation type per layer (RELU or linear); currently only linear is fully tested
`G_LAYERS_TYPE`	Layer type: 'd' = dense, 't' = thresholding, 'c' = convolutional
`G_PATHS`	Path array for data files (mainly thresholding layers)
`G_INPUT_PREC`	Bit width of each layer input
`G_OUTPUT_PREC`	Bit width of each layer output
`G_PARALLEL_ELEMENTS`	Number of parallel elements (thresholding layers)
`G_SIGNED_NUMBERS`	Determines signed or unsigned interpretation

Layer Architectures

The layer architecture is modular and extensible, supporting dense, 1D convolution, and thresholding.

Dense Layer

The dense layer implements a fully connected layer with parallel input processing.

Architecture of a dense layer.

Note: The ROM containing weights and biases is separate in the actual design to enable automatic generation.

Ping Pong Buffer

The ping_pong_buffer contains two RAM clusters, allowing the previous layer to write into one while the current layer reads from the other. This enables pipelined operation across the entire network.

Memory organization for inputs i1 = [i11, i12, i13] and i2 = [i21, i22, i23] with G_DATA_WIDTH = 3:

            inputs
             ---->
         +-----+-----+
         | i11 | i21 |
       | +-----+-----+
data_  | | i12 | i22 |
width  v +-----+-----+
         | i13 | i23 |
         +-----+-----+

Each column represents one time series, stored in one RAM. All RAMs share the same address, so reading produces one row of the memory map.

Data Fetcher

The data_fetcher retrieves data from the ping-pong buffer's read RAM and weights/biases from ROM using two counters:

sample_count: Sweeps input samples of a time series
out_count: Sweeps outputs

These counters serve as memory addresses and are shared with data_writer.

Example timing diagram with G_DATA_WIDTH = 2, G_INPUT_NUMBER = 1, G_NODE_NUMBER = 3.

The data fetcher accesses entire RAM rows. For sample_count_o = 1 in the memory map above, it fetches [i12, i22]. This repeats for each output.

Dense Neuron

The neuron processes all inputs in parallel for each output individually:

Example architecture for a dense neuron with G_INPUT_NUMBER = 5.

The neuron receives samples from all inputs in parallel, multiplies them with weights, and sums the results using an adder tree. This architecture uses one neuron per layer, processing inputs for each output sequentially with corresponding weights and biases.

Delay calculation: Let i = number of inputs, ceil() = ceiling function, log2() = base-2 logarithm:

d = 1 + 1 + ceil(log2(i)) + 1 = 3 + ceil(log2(i))
    |   |   |               |
    |   |   |               +-- output register
    |   |   +-- adder tree delay
    |   +-- product register
    +-- input register

Data Writer

The data_writer writes processed data to the next layer's ping-pong buffer. It receives counter information from data_fetcher and delays it to synchronize with neuron processing delay.

Convolutional Layer (1D)

The convolutional layer extends the dense architecture with padding and modified neuron processing.

Architecture of a convolutional layer.

Padding

Zero-padding prevents size reduction during convolution. For vector v = [v1, v2, v3, v4] and kernel k = [k1, k2, k3, k4, k5]:

1. step: XX XX v1 v2 v3 v4 XX XX
         k1 k2 k3 k4 k5

2. step: XX XX v1 v2 v3 v4 XX XX
            k1 k2 k3 k4 k5

3. step: XX XX v1 v2 v3 v4 XX XX
               k1 k2 k3 k4 k5

4. step: XX XX v1 v2 v3 v4 XX XX
                  k1 k2 k3 k4 k5

Padding Interface

The padding_interface corrects invalid read addresses and provides zero-padding:

Example timing with G_DATA_WIDTH = 4 and G_KERNEL_SIZE = 5.

Convolutional Neuron

The convolutional neuron uses a shift register (length G_KERNEL_SIZE) as input:

Example architecture with G_KERNEL_SIZE = 3 and G_INPUT_NUMBER = 4.

Delay calculation: Let i = number of inputs, k = kernel size:

d = 1 + 1 + floor(k/2) + ceil(log2(k)) + ceil(log2(i)) + 1
    = 3 + k/2 + ceil(log2(k)) + ceil(log2(i))

Note: The delay formula uses k/2 for the shift register, which works across multiple kernel sizes but may need verification.

Writing Interface

The writing_interface enables writing only when data is in a valid region:

Example timing with G_DATA_WIDTH = 4, G_KERNEL_SIZE = 5, G_NODE_NUMBER = 2.

Argmax Layer

The argmax layer selects the maximal output value from the previous layer.

Compare & Select Tree

Nodes consist of multiplexers and comparators selecting the greater of two inputs:

A Compare & Select node

Argmax Architecture

Uses Ping Pong Buffer and Data Fetcher for input, with Compare & Select Tree and index MUX for processing:

Architecture of the argmax block in the argmax layer.

Flatten Layer

The Flatten layer modifies w_addr and w_en signals to change memory write locations in the next layer.

Note: Multiple Flatten layers in one network may not be currently supported.

Network Architecture

Generics

Generic	Description
`G_DATA_WIDTH`	Number of samples in one input
`G_BIT_WIDTH`	Bit width of one sample
`G_PRECISION`	Fixed-point precision
`G_KERNEL_SIZES`	Kernel sizes for each layer
`G_LAYER_SIZES`	Input and output sizes for each layer

Network Composition

The network combines layers using generate statements. ROMs are automatically generated by scripts.

Neural Network Controller

The nn_controller coordinates layer processing and synchronizes ping-pong buffer wram signals:

Conceptual timing diagram of network functionality

Each layer's fin signal triggers the next layer's start signal.

Note: Advanced control logic may be needed for pipelines where multiple layers work concurrently.

Delay Calculations

Layer delays determine network throughput. Let i = inputs, o = outputs, d = data width, k = kernel size.

The pipelined architecture adds o * d - 1 to the computation delay:

Dense Layer Delay

d_dense = 1 + 1 + 3 + ceil(log2(i)) + 1 + o * d - 1
        = 5 + ceil(log2(i)) + o * d

Convolutional Layer Delay

d_conv = 1 + 1 + 3 + ceil(log2(i)) + ceil(log2(k)) + 1 + o * (d + k - 1) - 1
       = 5 + ceil(log2(i)) + ceil(log2(k)) + o * (d + k - 1)

Note: Neuron delay adjustment (removal of ceil(k)) may be related to shift register timing.

ROM Generation

ROMs for layer weights and biases are generated by NetworkGenerator.generate_roms():

Naming convention: wb_rom_<n>.vhd where <n> is the layer number.

Note: Generation deletes existing files matching regex wb_rom_[0-9]+.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VHDL Source Documentation

Table of Contents

Overview

Directory Structure

Generics Reference

Layer Architectures

Dense Layer

Ping Pong Buffer

Data Fetcher

Dense Neuron

Data Writer

Convolutional Layer (1D)

Padding

Padding Interface

Convolutional Neuron

Writing Interface

Argmax Layer

Compare & Select Tree

Argmax Architecture

Flatten Layer

Network Architecture

Generics

Network Composition

Neural Network Controller

Delay Calculations

Dense Layer Delay

Convolutional Layer Delay

ROM Generation

Related Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VHDL Source Documentation

Table of Contents

Overview

Directory Structure

Generics Reference

Layer Architectures

Dense Layer

Ping Pong Buffer

Data Fetcher

Dense Neuron

Data Writer

Convolutional Layer (1D)

Padding

Padding Interface

Convolutional Neuron

Writing Interface

Argmax Layer

Compare & Select Tree

Argmax Architecture

Flatten Layer

Network Architecture

Generics

Network Composition

Neural Network Controller

Delay Calculations

Dense Layer Delay

Convolutional Layer Delay

ROM Generation

Related Documentation