Home >
Technology
Making Parallelism WorkWith improvements in processor clock frequency slowing to a crawl, the industry agrees that parallel processing is the only viable way forward. While modern VLSI technology can fit 100's of cores on a chip, bandwidth-efficiency and easy-to-use development tools are critical to achieve real-world performance. Addressing these challenges from ground up, SPI's Stream Processor™ Architecture harnesses massive parallelism in a standard, single-flow C programming model.
A compact silicon footprint and ultra-high bandwidth-efficiency enable a more than tenfold improvement in performance, power and cost (GOPS/Watt/mm2) compared to traditional processor and DSP architectures.
The key concept is based on massive parallelism fed by a memory hierarchy where decoupled data movement and execution is managed by the C compiler to reduce global bandwidth needs.
A data-parallel single instruction flow provides inherent load-balancing, and explicit, 'cache-less', data movement deliver the predictability to enable efficient compilation using a straightforward C programming methodology.
ImplementationSPI's stream processor SoCs include: - System MIPS: Runs user apps, main OS, and drivers
- DSP Coprocessor: DSP MIPS runs main application threads, Data Parallel Unit runs compute-intensive kernels

Storm-1 SP16 Block Diagram click image for higher resolution in new window The DSP application threads run on the DSP MIPS that make kernel function calls to the Data Parallel Unit where kernels process data records (streams) in parallel.
A kernel is executed in lock-step across lanes, each a VLIW core that operates on a batch of stream data stored in local memory. Kernels and streams are DMA'd in the background from external memory by the DPU Dispatcher based on dependencies set up by the compiler.
The lanes can exchange data every clock cycle and conditional streams allows applications with complex control code and non-trivial dependencies to be efficiently implemented.
See the white paper for more details.
Programming modelThe Stream Programming Model defines Components with Pipelines of Kernels, that allows for the tools to automate scheduling, relieving the programmer from having to deal with memory or synchronization.
Any ANSI C application can be compiled to run, from which kernels can be mapped to the DPU, step-by-step, for performance optimization. By remaining in C, algorithmic changes can quickly be introduced without unraveling assembly or RTL, and source-tree management becomes trivial.
For Kernels a few StreamC™ keywords define the stream and intrinsics are available for DSP operations that have no C standard equivalent, such as DOT product.
For multiple threads and processes, conventional OS-level task switching is used and SPI provides an application framework and plug-in DSP libraries.

click image for higher resolution in new window
Please also refer to the section on Tools.
ScalabilityThe architecture's bandwidth-efficiency allows for direct leverage higher transistor densities. The ratio for energy consumption and latency between local and global memory access increases exponentially, further emphasizing the necessity to leverage locality to conserve data bandwidth.
The number of lanes can be expanded and thread-level parallelism can be added at the top level to provide scaling beyond 100 TeraOPs in a single affordable, energy-efficient chip.
An important aspect of scalability is software compatibility. Traditional multi-core devices normally require a re-write of the code to make use of additional cores or extra cache memory. The Stream Processor Architecture and programming model is core-agnostic ; the same program can run (i.e. after re-compilation) on any number of lanes.
|