No Description

Victor Suarez Rovere 2cb964722a micropython on the CPU doc update 1 week ago
core_jpeg cab1574abd add JPEG decompression support and example 1 week ago
demos d3cf22278c tested baremetal cpu graphics 3 months ago
doc 2cb964722a micropython on the CPU doc update 1 week ago
hardware 76e20e0214 fix markdown syntax 1 week ago
micropython @ 51dfc37b30 e393f23135 sync micropython sources 3 months ago
target-cpu cab1574abd add JPEG decompression support and example 1 week ago
.gitignore 2200ba507c line draw primitive & RAM access optimizations 6 months ago
.gitmodules 04830ccb19 tested micropython basic demo 4 months ago
LICENSE 1fc338f198 initial commit 9 months ago
LITEX-CONTRIBUTORS 1fc338f198 initial commit 9 months ago
Makefile 2200ba507c line draw primitive & RAM access optimizations 6 months ago
Makefile.common 2200ba507c line draw primitive & RAM access optimizations 6 months ago
README.md 2cb964722a micropython on the CPU doc update 1 week ago
accel.py 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_cores.c 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_cores.h 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_ellipse_fill32.inl d3cf22278c tested baremetal cpu graphics 3 months ago
accel_ellipse_fill32.v 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_line32.inl d3cf22278c tested baremetal cpu graphics 3 months ago
accel_line32.v 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_rectangle_fill32.inl d3cf22278c tested baremetal cpu graphics 3 months ago
accel_rectangle_fill32.v 2200ba507c line draw primitive & RAM access optimizations 6 months ago
accel_regs.h 2200ba507c line draw primitive & RAM access optimizations 6 months ago
bus.h 2200ba507c line draw primitive & RAM access optimizations 6 months ago
cflexhdl.h 1fc338f198 initial commit 9 months ago
crt0.S 1fc338f198 initial commit 9 months ago
digilent_arty.py 04830ccb19 tested micropython basic demo 4 months ago
drawing_test.c 2200ba507c line draw primitive & RAM access optimizations 6 months ago
ellipse_fill32.cc d3cf22278c tested baremetal cpu graphics 3 months ago
ellipse_fill32.v 1fc338f198 initial commit 9 months ago
graphics.h 1fc338f198 initial commit 9 months ago
lambdaconcept_ecpix5.py 2200ba507c line draw primitive & RAM access optimizations 6 months ago
line32.cc d3cf22278c tested baremetal cpu graphics 3 months ago
line32.v 2200ba507c line draw primitive & RAM access optimizations 6 months ago
linker.ld 1fc338f198 initial commit 9 months ago
main.c 2200ba507c line draw primitive & RAM access optimizations 6 months ago
misc.h 1fc338f198 initial commit 9 months ago
rectangle_fill32.cc d3cf22278c tested baremetal cpu graphics 3 months ago
rectangle_fill32.v 1fc338f198 initial commit 9 months ago
sim_fb.c 1fc338f198 initial commit 9 months ago
sim_fb.h 1fc338f198 initial commit 9 months ago
sim_linux.c 1fc338f198 initial commit 9 months ago
sw_cores.cpp 2200ba507c line draw primitive & RAM access optimizations 6 months ago
sw_cores.h 2200ba507c line draw primitive & RAM access optimizations 6 months ago
types.h 2200ba507c line draw primitive & RAM access optimizations 6 months ago
verilog-gen-header.txt 1fc338f198 initial commit 9 months ago
wpu.py 2200ba507c line draw primitive & RAM access optimizations 6 months ago

README.md

Hardware accelerated graphics (2D GPU) - graphics server system

Introduction

This project aims to provide a graphics server system based on hardware accelerated graphics, and a easy way to develop the graphics primitives. The accelerated versions runs faster using the exact same C code as the software version by automatic translation (transpiling) to Verilog code.

As an example, let's see how to use and develop drawing privimites for solid rectangles and ellipses.

Ellipse case (see ellipse_fill32.cc file):

MODULE ellipse_fill32(
  bus_master(bus),
  const uint16& x0,
  const uint16& x1,
  const uint16& y0,
  const uint16& y1,
  const uint32& rgba, //color
  const uint32& base, //pixel offset
  const int16& xstride, //normally 1, but can run backwards
  const int16& ystride //bytes to skip for next line (usually the framebuffer width * 4 bytes)
  )

The bus argument is automatically handled both in software and in hardware by privided macros.
The rectangle primitive follows the same function signature.

Software implementation

You can directly call the function using compilation with a normal C compiler:

ellipse_fill32(BUSMASTER_ARG, x0, x1, y0, y1, rgba, base, xstride, ystride);

The BUSMASTER_ARG macro is an automatic argument, defined on a provided header file. It's useful for the implementation of the hardware accelerated primitive, as explained in the next section.

The following image is produced by the simulator by calling 1000 times to the software implementation of the primitives, using random coordinates:

Hardware implementation

To target hardware implementation on a FPGA, a Verilog file is automatically generated from from the corresponding C file having the drawing primitive algorithm, by using the following external tools: CflexHDL and the Silice transpilers (see Makefile.common target "c2v" for the invocation command details). Since this project aims to be transpiler-agnostic, the PipelineC transpiler is planned for a future version.

See a portion of the generated Verilog code, where part of the C expressions and interactions with the memory bus can be readily appreciated:

4: begin
    if (_q_x<_q_rw) begin
        _t_xx = _q_x*_q_x;
        _t_xh = (_t_xx)*(_q_hh);
        if (_t_xh+_q_yw<_q_wh) begin
            _d_bus_dat_w = in_rgba;
            _d_bus_we = 1;
            _d_bus_stb = 1;
            _d_bus_cyc = 1;
            if (!((_d_bus_stb&&_d_bus_we)&&!(_d_bus_stb&&in_bus_ack&&_d_bus_we))) begin
                _d_bus_stb = 0;
                _d_bus_adr = (_q_bus_adr+(in_xstride));
                _d_x = _q_x+1;
            end
        end else begin
            _d_bus_adr = (_q_bus_adr+(in_xstride));
            _d_x = _q_x+1;
        end
        _d__idx_fsm0 = 4;
    end else begin
        _d__idx_fsm0 = 5;
    end
end

After generating a System On Chip (SoC) for the target FPGA using the LiteX framework and a provided script, that includes the generated verilog files, the accelerator can be called as follows:

regs->x0 = x0;
regs->x1 = x1;
regs->y0 = y0;
regs->y1 = y1;
regs->base = VIDEO_FRAMEBUFFER_BASE + y0*FRAME_PITCH + x0*sizeof(rgba);
regs->xstride = SDRAM_BUS_BITS/8;
regs->ystride = FRAME_PITCH;
regs->rgba = rgba;

regs->run = 1; //start
while(!regs->done); //wait until done

As seen, you first set memory mapped registers with the desired values, then you start the core, then wait until the done flag is set.

Each accelerator core gets mapped starting at a fixed address (default 0x80000000 for the first accelerator, 0x80000800 for the second and so on, as provided by the correspoding macros). A C structure with the layout of registers is conveniently provided too (each register is 32-bit aligned, even if smaller).

The resulting execution in hardware is as follows:

Testing equivalence of software and hardware

You can visually appreciate that both images seems like the same, but how we can be sure the generated Verilog behaves the same as the software implementation? The solution is to use the accelerated implementations and also call the compiled software implementation in the same SoC, then we can compare if results are the same. The test program does specifically that, reporting how many pixels are in error, if any:

passed test

You can see that the accelerator is about 3X faster, while it also frees the CPU for other tasks.

In case of any discrepance, non-matching pixels are marked in red (this was generated by inducing a coordinate error in the software implementation) and the amount of pixels in error is reported.

failded test

Console output would be in this case:

Pixel errors: 320 (screen should have no red pixels)

==========================================
*** TESTS FAILED ***
==========================================

Prototype application and simulator

The acceleration is readily appreciated on the following video, where the software implementation is run prior to the hardware one. After that, a clock demo application is shown, using a combination of drawn rectangles and ellipses.

The code for that demo application can be compiled with the main simulator code on the host machine (e.g. Linux) to see results on the simulator window. This eases testing while in development since the program compiles in few seconds, then it can be run on the hardware platform to check if producing the same results. In the provided video, you can see that the hardware matches the simulator.

Accelerator performance

Accelerator cores gets directly connected to the dynamic RAM of the respective boards (acting similarly as a DMA), and using caching to achieve access at high speed. That way, the accelerated cores are about 7X faster than when running in software, as the current tests reports (line drawing core on a ECP5 device):

Start software rendering
elapsed 9057225 us, ops/s: 828153 (2 FPS @640x480)
Switch to hardware rendering
elapsed 1229585 us, ops/s: 6100246 (19 FPS @640x480)
Just waiting a bit to evaluate image...
Pixel errors: 0 (screen should have no red pixels)

==========================================
TESTS PASSED
==========================================

The following video shows the corresponding images on the display (first the software renderer, then the accelerated renderer).

Supported boards

Curently the project supports the following boards:
Lambdaconcept ECPIX-5 with a Lattice ECP5 device. It currently utilices an open source toolchain for building the bitstream. The board name used in the project is lambdaconcept_ecpix5.

Digilent Arty A7-35T with an AMD Artix7 device. The board name used in the project is digileny_arty.

Build instructions

Just run:

make BOARD=lambdaconcept_ecpix5 run upload

Where the target run runs the simulator, and the upload target makes the bitstream and uploads it to the FPGA board.

Micropython support, with graphics

The FPGA board is now capable of running a graphics-enabled micropython port, capable of controlling a PC monitor. Up to 8 bit per color channel are supported.

Example: accel_basic.py

Build & run instructions for micropython:

Generate and upload the bitstream (use your own serial port location if different):

make BOARD=digilent_arty digilent_arty SERIAL_PORT=/dev/ttyUSB1
cd micropython/ports/litex
make
litex_term.py --kernel build/firmware.bin /dev/ttyUSB1
cd -
openFPGALoader -b arty build/digilent_arty/gateware/digilent_arty.bit

Then the .py file can be uploaded using the standard mpremote command:

mpremote connect /dev/ttyUSB1 run demos/micropython/accel_basic.py

This generates the following picture:

To run the clock demo, accel_clock.py (in same folder)

Micropython support on the CPU board

The same repository used for the port of micropython to the FPGA platform was used to target the project's CPU board (F133A SoC by Allwinner Tech)

Supported features are:

  • REPL (Python prompt) over UART
  • Minimal umachine support
  • Time functions (except RTC)
  • Video Framebuffer
  • Dynamic memory allocation (using the internal DDR2 RAM)

The repository of the extended micropython is linked as a git submodule on the main repo (see micropython@51dfc37b30 at the project's main repo). The sub repository directs to the micropython repo.

The source for the f133 port are under the following folder

A basic example to test micropython on the CPU is to use the REPL mode:

Under the f133 folder, there's a micropython script that shows how to access the video framebuffer: test/test_video.py

Execution of this produced the following result:

Note the compatibility of the micropython code with the one for the FPGA platform.

Build instructions:

cd ports/f133
make

The make command will build the firmware and upload it to the board. To upload new firmware, you have to first push the reset button.

New CPU-based board

A brand-new board was designed using a graphics-capable CPU:

It's capable of running the clock demo, using the graphics primitives ellipse and rectangle fill

This produces the following image:

Build instructions:

   cd target-cpu/f133-bare/
   make

The make command will build the firmware and upload it to the board. To upload new firmware, you have to first push the reset button.

This milestone is conclusive proof that this framework is capable running the drawing primitives as software or hardware, since the same code runs on the CPU as software and in the FPGA as a hardware core, producing matching visual results.

The same port of micropython to the FPGA platform was used to target the project's CPU board (F133A SoC by Allwinner Tech)

Software JPEG image decompression

A new function to decompress JPEG file was implemented, it's based on the C model of a Verilog decompressor, so in that form it's useful to test things in sofware before moving it to hardware. The original code was changed to avoid dynamic memory allocations thus easier to run in the bare metal environment, see target-cpu/f133-bare for sources.

A JPEG file is embedded in the firmware image by means of a direct include of the raw file data (.incbin directive in the rawdata.S assembler source)

The example image size is 33.8KB when compressed and 921.6 KB when decompressed (27:1 ratio). Software decompresion takes 195ms (5FPS) expected to reach 30FPS with the planned video decoder accelerator (Verilog).

Working example on the bare metal enviroment:

The decompression algorithm is in the c_model_jpeg_test.cpp source.

About NLnet Foundation

This project is funded through the NGI0 Entrust Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.