Contact Us | Site Map    
Home About Technology Products Applications Partners Support

Technology


Home > Technology > FAQ

Stream Processing Frequently Asked Questions




General

Q: How does SPI's stream processors compare with traditional DSPs?

A: SPI's stream processor is highly parallel and keeps 1,000's of operations in flight every cycle. Whereas a traditional DSP relies on large caches to mitigate memory latencies, and burdens the programmer to manage DMA, SPI uses compiler-managed distributed memory hierarchy with background data movement and execution scheduling.

Q: Is there a simple analogy?

A: Imagine doing a plumbing project ... you start and you see you need a pipe
... so you drive to the store ...
back with the pipe you discover you need a fitting
... so you drive to the store ...
come back with a fitting
... and discover you need solder ... (etc.)

This is very inefficient and analogous to how conventional caches work in a traditional DSP or processor.  As main memory is hundreds of processor cycles away it is critical to be smart about when to drive across town to the plumbing store. And for parallel processors, this could mean tenths of thousands of operations! So, what to do? You make a shopping list! Instead of getting data from main memory right when you discover it is needed, you determine what you need upfront (the programming model + SPI compiler), and use a DMA engine that gets it for you just in time (DPU dispatcher).

Q: Are there other products on the market today based on stream processing?

A: Yes. Originally developed at MIT and Stanford by professor Bill Dally and his team, stream processing is used in graphics processors (GP-GPUs), game console chips (IBM Cell), and in high performance computing (HPC). SPI is first company to further stream processing for low-cost, low-power DSP applications.

Q: Is stream processing suitable for any task?

A: SPI's stream processing targets compute-intensive applications that exhibit usable data parallelism and locality by reference, i.e. general signal and media processing. 

Q: Why is it more power-efficient?

A: Power consumption in a processor is dominated by data movement to/from external memory and instruction decode. In stream processing, every bit of data moved to/from memory is scheduled and the energy to decode instructions is amortized across the lanes. Furthermore, an unusually large set of dedicated register files are connected to each ALU to reduce wire lengths and avoid shared register conflicts common with other architectures.

Q: What makes it cost-effective?

A: The cost-efficiency comes from easier programming, data-paralleleism and an efficient memory hierarchy. In traditional architectures, cache memory and control logic dominate chip area (cost), in a stream processor, the processing elements (ALUs) dominate. Consequently, SPI's stream processors are able to deliver more compute power per mm2.

Q: Why such high performance?

A: The high performance is a result from combining instruction (VLIW) and data-parallelism (SIMD) with a high-bandwidth distributed memory hierarchy that keeps 1,000's of operations in flight at GHz level speeds. The architecture can sustain more than 10 Tb/s of ALU bandwidth in a cache-miss free environment to realize high utilization.


Programming


Q: Is it really just C code...? How do I optimize my code?

A: The tools flow is based on ANSI C with performance-critical kernel functions are tagged with SPI keywords and operate on streams of data. Programmers do what they’re good at – algorithmic optimization;  The compiler does what it's good at – all of the book-keeping such as managing on-chip instruction and data memories, scheduling transfers of streams between DRAM and on-chip memories, and VLIW scheduling.

The profiling tools provide a good visibility into stream-level transfers and VLIW scheduling for kernels. Typical optimization steps involve identifying the kernels and using intrinsics for lower levels of bit-swizzling and arithmetic.

Q: Can I take C code, compile and run it without any modifications?

A: Yes, and then it will run on the DSP MIPS. To take advantage of the DPU, you then define kernel functions and streams, a process that can be done step-by-step. The kernel functions are also programmed in C, but operations are on local streams of data (e.g. no global pointers). Kernels may also use intrinsics for specific DSP functions, such as DOT product that have no standard C equivalent.

Q: How can I make use of my existing DSP code?

A: You can use and port C code and use similar program structures as with a conventional DSP. Porting between a traditional DSP to SPI may be a simpler task than porting between two traditional DSPs as there is no hardware-dependent assembly or cache/DMA programming involved.

Q: What is a stream?

A: A stream is a data record of a finite length, like a normal C buffer. It is defined by a base pointer and access pattern. In video and image processing it can often be a line of macroblocks or a swath of data from a larger external array, such as a video frame or scanned image.

Streams 'live' in the Lane Register Files (LRF) with allocation based on life-times managed by the compiler supported by runtime DMA. In Storm-1, the maximum stream size is 128 KBytes.

Q: What is a kernel function?

A: A kernel function is a performance-critical function that processes and produces streams. Typical kernels functions include filters, transforms and motion estimation. Several kernel functions typically form a pipeline keeping streams local without touching external memory.

Q: How do I move streams to/from the lanes?

A: Based on its definition and dependencies the compiler and hardware automates the process. Loading of streams is done in the background and concurrent with execution for minimal overhead. The standard case is that the first data record of the stream is moved to Lane 0, next record to Lane 1 etc. Overlapping, redundant data is only loaded once and the stream definition can express more sophisticated access, such as strides and indexing, directly supported by a sophisticated hardware DMA.

Q: What about running multiple applications at the same time?

A: Multitasking is supported at thread-level by time-multiplexing in the same way as on a single-core device.


Hardware architecture

Q: How is it different from a traditional multi-core processor?

A: SPI's stream processors combine instruction-level parallelism (VLIW) and data-parallelism (SIMD). One program runs at a time across all lanes ("cores"), rather than different programs/threads on each core. For video, image and signal processing, data-parallel execution is more efficient compared to thread-parallel due to implicit load-balancing and synchonization. Die size and power is also consiberably less at eth same level of performacne due to a single instruction sequencer and instruction memory.

Q: Is a stream processor a reconfigurable type processor?

A: No. A stream processor is conventional processor that runs a software program. There is no "reconfiguration time" or proprietary hardware-centric programming involved.

Q: Is the hardware using specific acceleration blocks?

A: No. The raw performance and an efficient instruction set makes it possible to avoid fixed-function blocks that otherwise might degrade the flexibility and value of a DSP.


Applications

Q: I can see that this architecture can work well for simple data-parallel algorithms, but what about applications with more complex control code, conditionals and data dependencies?

A: The SPI architecture is designed to run complex algorithms. As an example, a HD H.264 codec has been ported, and imaging applications with serial depedencies, such as error diffusion, maps well to the architecture.

Q: What libraries are available?

A: SPI and ecosystem partners provide a growing and comprehensive set of kernels and complete applications for general DSP functions, video, and imaging. Examples of functions are FIR, DCT, FFT, convolutions, MIMO receiver, H.264, MPEG-4, MJPEG and audio codecs.