This directory contains synthesizable VHDL implementations of neural network layers and components for FPGA-based hardware acceleration.
- Overview
- Directory Structure
- Generics Reference
- Layer Architectures
- Network Architecture
- Delay Calculations
- ROM Generation
- Related Documentation
The VHDL source implements a modular, pipelined neural network architecture with the following capabilities:
Supported layer types:
- Dense (fully connected) layers with configurable neurons
- 1D Convolutional layers with kernel-based processing
- Thresholding layers for quantized activations
- Argmax layers for classification output
- Flatten layers for dimension reduction
Key architectural features:
- Ping-pong buffering for pipelined layer-parallel operation
- Automatic ROM generation for weights and biases
- Configurable bit widths and fixed-point precision
- Modular design extensible to various layer types
Note: Mathematical functions used in delay calculations:
ceil(n) := the smallest integer greater or equal than n
floor(n) := the greatest integer smaller or equal than n
source/
├── README.md # This file
├── nn_pkg.vhd # Common package definitions
│
├── Layer Implementations (1D):
├── layer.vhd # Dense layer
├── conv_layer.vhd # 1D convolutional layer
├── thresh_layer.vhd # Thresholding layer
├── argmax_layer.vhd # Argmax layer
│
├── Processing Units:
├── neuron.vhd # Dense neuron
├── conv_neuron.vhd # Convolutional neuron
├── threshold.vhd # Threshold activation
├── multithreshold.vhd # Multi-threshold unit
├── argmax.vhd # Argmax unit
│
├── Memory & Buffering:
├── ping_pong_buffer.vhd # Dual buffer for pipelining
├── conv_ping_pong_buffer.vhd
├── block_ram.vhd # Generic RAM
├── block_ram_amd.vhd # AMD/Xilinx specific
├── block_ram_intel.vhd # Intel specific
├── vector_bram.vhd # Vector-based RAM
│
├── Data Management:
├── data_fetcher.vhd # Data/weight fetcher
├── data_fetcher_thresh.vhd
├── data_writer.vhd # Output writer
├── padding_interface.vhd # Zero-padding for convolution
├── writing_interface.vhd # Write control
│
├── Utility Components:
├── adder_tree.vhd # Parallel adder tree
├── c_s_tree.vhd # Compare & select tree
├── flatten.vhd # Flatten layer
├── relu.vhd # ReLU activation
├── softmax.vhd # Softmax (experimental)
├── exp_lut.vhd # Exponential LUT
│
├── Network Top-Level:
├── network.vhd # Network composition
├── network_pipe.vhd # Pipelined network
├── nn_controller.vhd # Network controller
├── aximm_network.vhd # AXI memory-mapped wrapper
├── input_provider.vhd # Input interface
└── output_fifo.vhd # Output buffering
Common generics used across layer implementations:
| Generic | Description |
|---|---|
G_DATA_WIDTH |
Number of samples in one time series / feature map width |
G_BIT_WIDTH |
Bit width of one sample |
G_PRECISION |
Fixed-point precision for multiplication result slicing |
G_WEIGHT_WIDTH |
Bit width of weights (dense layers) |
G_KERNEL_SIZES |
Kernel length for convolution |
G_LAYER_SIZES |
Number of input nodes for each layer |
G_SPECIAL_LAYERS |
Placeholder for intermediate special layers (flatten, etc.) |
G_ACTIVATIONS |
Activation type per layer (RELU or linear); currently only linear is fully tested |
G_LAYERS_TYPE |
Layer type: 'd' = dense, 't' = thresholding, 'c' = convolutional |
G_PATHS |
Path array for data files (mainly thresholding layers) |
G_INPUT_PREC |
Bit width of each layer input |
G_OUTPUT_PREC |
Bit width of each layer output |
G_PARALLEL_ELEMENTS |
Number of parallel elements (thresholding layers) |
G_SIGNED_NUMBERS |
Determines signed or unsigned interpretation |
The layer architecture is modular and extensible, supporting dense, 1D convolution, and thresholding.
The dense layer implements a fully connected layer with parallel input processing.
Architecture of a dense layer.Note: The ROM containing weights and biases is separate in the actual design to enable automatic generation.
The ping_pong_buffer contains two RAM clusters, allowing the previous layer to write into one while the current layer reads from the other. This enables pipelined operation across the entire network.
Memory organization for inputs i1 = [i11, i12, i13] and i2 = [i21, i22, i23] with G_DATA_WIDTH = 3:
inputs
---->
+-----+-----+
| i11 | i21 |
| +-----+-----+
data_ | | i12 | i22 |
width v +-----+-----+
| i13 | i23 |
+-----+-----+
Each column represents one time series, stored in one RAM. All RAMs share the same address, so reading produces one row of the memory map.
The data_fetcher retrieves data from the ping-pong buffer's read RAM and weights/biases from ROM using two counters:
sample_count: Sweeps input samples of a time seriesout_count: Sweeps outputs
These counters serve as memory addresses and are shared with data_writer.
G_DATA_WIDTH = 2, G_INPUT_NUMBER = 1, G_NODE_NUMBER = 3.
The data fetcher accesses entire RAM rows. For sample_count_o = 1 in the memory map above, it fetches [i12, i22]. This repeats for each output.
The neuron processes all inputs in parallel for each output individually:
G_INPUT_NUMBER = 5.
The neuron receives samples from all inputs in parallel, multiplies them with weights, and sums the results using an adder tree. This architecture uses one neuron per layer, processing inputs for each output sequentially with corresponding weights and biases.
Delay calculation:
Let i = number of inputs, ceil() = ceiling function, log2() = base-2 logarithm:
d = 1 + 1 + ceil(log2(i)) + 1 = 3 + ceil(log2(i))
| | | |
| | | +-- output register
| | +-- adder tree delay
| +-- product register
+-- input register
The data_writer writes processed data to the next layer's ping-pong buffer. It receives counter information from data_fetcher and delays it to synchronize with neuron processing delay.
The convolutional layer extends the dense architecture with padding and modified neuron processing.
Architecture of a convolutional layer.Zero-padding prevents size reduction during convolution. For vector v = [v1, v2, v3, v4] and kernel k = [k1, k2, k3, k4, k5]:
1. step: XX XX v1 v2 v3 v4 XX XX
k1 k2 k3 k4 k5
2. step: XX XX v1 v2 v3 v4 XX XX
k1 k2 k3 k4 k5
3. step: XX XX v1 v2 v3 v4 XX XX
k1 k2 k3 k4 k5
4. step: XX XX v1 v2 v3 v4 XX XX
k1 k2 k3 k4 k5
The padding_interface corrects invalid read addresses and provides zero-padding:
G_DATA_WIDTH = 4 and G_KERNEL_SIZE = 5.
The convolutional neuron uses a shift register (length G_KERNEL_SIZE) as input:
G_KERNEL_SIZE = 3 and G_INPUT_NUMBER = 4.
Delay calculation:
Let i = number of inputs, k = kernel size:
d = 1 + 1 + floor(k/2) + ceil(log2(k)) + ceil(log2(i)) + 1
= 3 + k/2 + ceil(log2(k)) + ceil(log2(i))
Note: The delay formula uses k/2 for the shift register, which works across multiple kernel sizes but may need verification.
The writing_interface enables writing only when data is in a valid region:
G_DATA_WIDTH = 4, G_KERNEL_SIZE = 5, G_NODE_NUMBER = 2.
The argmax layer selects the maximal output value from the previous layer.
Nodes consist of multiplexers and comparators selecting the greater of two inputs:
A Compare & Select nodeUses Ping Pong Buffer and Data Fetcher for input, with Compare & Select Tree and index MUX for processing:
Architecture of the argmax block in the argmax layer.The Flatten layer modifies w_addr and w_en signals to change memory write locations in the next layer.
Note: Multiple Flatten layers in one network may not be currently supported.
| Generic | Description |
|---|---|
G_DATA_WIDTH |
Number of samples in one input |
G_BIT_WIDTH |
Bit width of one sample |
G_PRECISION |
Fixed-point precision |
G_KERNEL_SIZES |
Kernel sizes for each layer |
G_LAYER_SIZES |
Input and output sizes for each layer |
The network combines layers using generate statements. ROMs are automatically generated by scripts.
The nn_controller coordinates layer processing and synchronizes ping-pong buffer wram signals:
Each layer's fin signal triggers the next layer's start signal.
Note: Advanced control logic may be needed for pipelines where multiple layers work concurrently.
Layer delays determine network throughput. Let i = inputs, o = outputs, d = data width, k = kernel size.
The pipelined architecture adds o * d - 1 to the computation delay:
d_dense = 1 + 1 + 3 + ceil(log2(i)) + 1 + o * d - 1
= 5 + ceil(log2(i)) + o * d
d_conv = 1 + 1 + 3 + ceil(log2(i)) + ceil(log2(k)) + 1 + o * (d + k - 1) - 1
= 5 + ceil(log2(i)) + ceil(log2(k)) + o * (d + k - 1)
Note: Neuron delay adjustment (removal of ceil(k)) may be related to shift register timing.
ROMs for layer weights and biases are generated by NetworkGenerator.generate_roms():
Naming convention: wb_rom_<n>.vhd where <n> is the layer number.
Note: Generation deletes existing files matching regex wb_rom_[0-9]+.
Within this repository:
- ../simulation/README.md - Testbenches and verification environment
- ../simulation/scripts/README.md - Test script structure
Top-level documentation:
- ../README.md - Project overview and quick start guide









