Contact Us | Site Map    
Home About Technology Products Applications Partners Support

Technology


Home > Technology

Making Parallelism Work



While modern VLSI technology allows 100's of cores to be integrated on a single chip, the main challenges are to manage data bandwidth and make the performance accessible to programmers. Addressing these issues from ground up, stream processing has emerged as today's most widely used massively parallel technique, commonly used in GPUs.

die photoSPI has furthered stream processing to target embedded media and signal processing with specific attention to low cost, power-efficiency and programming simplicity.

SPI's architecture is based on a standard RISC processor that offloads performance-critical tasks to a data-parallel unit designed to efficiently exploit the levels of parallelism found in multimedia and signal processing applications.

Conventional multicore that use coherent caches are inefficient to support the high computation rates with little data reuse that is characteristic for these applications. In contrast, the stream processor architecture use a sophisticated C compiler to manage DMA to transfer only the data needed, well in advance of computation.

The efficient data management results in an order of magnitude improved compute performance and power-efficiency without increasing cost or development effort.

Implementation

SPI's Storm-1 Series include:

  • System CPU : Runs user apps, main OS, and drivers
  • DSP Subsystem Coprocessor : DSP MIPS runs main application threads, Data Parallel Unit runs compute-intensive kernels
  • I/O Subsystem : Gigabit Ethernet, PCI, video ports, etc

block diagram

Storm-1 SP16 Block Diagram

click image for higher resolution in new window  


The DSP application main threads run on the DSP MIPS that make function calls to the Data Parallel Unit for acceleration of compute-intensive kernels.

A kernel processes data records (called streams) in parallel across lanes. Storm-1 Series has up to 16 lanes, each lane a VLIW core with five 32-bit ALUs that includes MAC. Kernels and streams are DMA'd in the background from external memory to local memory of each lane based on compile-time allocation and real-time dependencies.

The lanes can exchange data every clock cycle while conditional streams constructs allows applications with complex control code and non-trivial dependencies to be efficiently implemented.

See the white paper for more details.

Stream Programming Model

The Storm-1 family is programmed using Stream™, SPI’s single-threaded programming and execution model. Stream uses standard ANSI C syntax with a few additional keywords to give a programmer predictable, transparent and easy-to-use access to all of the performance available in a stream processor.

The Stream Programming Model (SPM) is a unique API that addresses the primary challenges of programming a massively parallel embedded system with heterogeneous cores. Load balancing, data movement and synchronization are all greatly simplified using SPM while maintaining excellent efficiency and performance.

SPM also provides a simple single-threaded programming model for utilizing the massive parallelism of the DPU. Using only C with intrinsics (for DSP operations that have no C equivalent) programmers have a very productive and predictable flow for optimizing code. Also, the cache-less nature of the Storm-1 device means that code can be optimized and then put aside without the concern that changes elsewhere in the system will affect performance.

block diagram

click image for higher resolution in new window


Please also refer to the section on Tools.

 

Further Reading

IEEE Journal of Solid-State Circuits: A 512 GOPS Stream Processor for Signal, Image, and Video Processing
(Requires IEEE membership login)