Parallel Computing Architectures and Systems Explained

Posted by Anonymous and classified in Computers

Written on in English with a size of 211.7 KB

Vector Processors: Characteristics

A vector processor is a computer system designed to perform operations on entire arrays (vectors) of data simultaneously instead of processing one element at a time. They are highly efficient when the same operation is performed repeatedly on large data sets.

Key Characteristics

  • Vector Registers: Capable of holding a complete vector of operands; supports simultaneous operations on all stored elements.
  • Vectorized and Pipelined Functional Units: Perform the same operation (e.g., addition) on all elements of a vector or between pairs of elements from two vectors in a SIMD fashion.
  • Vector Instructions: Operate on entire vectors, not just scalars. One instruction handles an entire block of elements, improving performance by reducing instruction count.
  • Interleaved Memory: Uses multiple independent memory banks to allow fast loading and storing of vector elements by minimizing memory access delay.
  • Strided Memory Access & Scatter/Gather: Strided access allows accessing elements at regular intervals (e.g., every 4th item). Scatter/Gather enables reading or writing elements at irregular memory locations using dedicated hardware support.

Distributed Memory Systems

Distributed-memory systems are computer architectures in which each processor or node has its own local memory, and processors communicate by passing messages over a network.

System Components

  • Cluster: The most common form of distributed-memory system, made up of multiple commodity systems (like standard PCs) connected via a commodity interconnection network, such as Ethernet. Each computer is known as a node.
  • Nodes: Computational units in the system. In modern systems, each node is often a shared-memory system (e.g., multicore processor).
  • Hybrid Systems: These combine distributed-memory architecture at the cluster level with shared-memory systems within each node, enhancing performance and scalability.
  • Grid Computing: Connects geographically distributed computers into a single system. Grids are typically heterogeneous, meaning nodes may use different hardware, software, and operating systems.

Advantages and Disadvantages

  • Advantages: Excellent for parallel processing at scale, easy to add nodes for power, and improved fault tolerance.
  • Disadvantages: Programming is complex due to manual communication handling, latency and bandwidth limitations, and the need for careful synchronization.

BzU4JzTTteVfAAAAAElFTkSuQmCC

NUMA Multicore Systems

AAAAAElFTkSuQmCC

In a Non-Uniform Memory Access (NUMA) system, each processor or core has its own local memory. A core can access local memory faster than remote memory, and access to other memory blocks goes through the interconnect.

Advantages

  • Faster access to local memory.
  • More scalable than UMA.
  • Potential to use larger memory spaces.

Uniform Memory Access (UMA)

In UMA systems, all cores access memory with equal latency and bandwidth. Common memory is shared via a single interconnect. It is easier to program but can become a bottleneck as the number of cores increases.

MKMRj+P+lb1+z5FLViAAAAAElFTkSuQmCC

Cache Coherence Problem

In a shared-memory multicore system, each core has its own private cache. When the same memory location is stored in multiple caches and one core updates it, other cores may continue to use the old value, leading to incorrect results.

MAAAAASUVORK5CYII=

Approaches to Cache Coherence

  • Snooping Cache Coherence: All caches monitor (snoop) the shared interconnection. When a core updates a cache line, it broadcasts a message, and other cores invalidate their copies. This is not scalable for large systems.
  • Directory-Based Cache Coherence: A directory stores information about which cores have copies of each cache line. Only the cores that have the line are informed, avoiding unnecessary broadcasts. This works well for large systems.

Parallel Programming

Parallel programming is a computing paradigm where many operations are carried out simultaneously to solve a problem by dividing tasks into subtasks.

Types of Parallelism

  • Data Parallelism: Performing the same operation on different pieces of data simultaneously (e.g., applying a filter to every pixel in an image).
  • Task Parallelism: Executing different tasks or functions at the same time (e.g., a computer playing music while downloading a file).

Classifications of Parallel Computers

Flynn’s taxonomy categorizes systems based on the number of instruction and data streams.

SIMD Systems

Single Instruction, Multiple Data architectures apply a single instruction simultaneously to multiple data items. They are highly effective for data-parallel operations but can struggle with conditional logic, which forces some datapaths to idle.

MIMD Systems

Multiple Instruction, Multiple Data systems allow multiple processors to execute different instructions on different data sets simultaneously. They feature asynchronous execution, no global clock, and high scalability.

MIMD Architectures

  • Shared-Memory MIMD: All processors share a common memory space. Requires synchronization mechanisms like semaphores or locks.
  • Distributed-Memory MIMD: Each processor has its own local memory and communicates via message passing.

2rKu1LxniHoAAAAASUVORK5CYII=

Interconnection Networks

The interconnection network connects processors and memory. Performance depends heavily on the efficiency of this network.

Shared-Memory Interconnects

  • Bus: Low cost and flexible, but becomes a bottleneck in large systems.
  • Crossbar: Allows simultaneous access to different memory modules; much faster than a bus but expensive.

Distributed-Memory Interconnects

  • Direct Interconnects: Includes Ring, Toroidal Mesh, and Hypercube topologies.
  • Indirect Interconnects: Includes Crossbar and Omega networks.

Coordinating Processes and Threads

Parallel programs require coordination to ensure efficiency. This involves load balancing (equal work distribution) and minimizing communication. In shared-memory systems, threads use shared variables for communication and synchronization methods like Mutex, semaphores, or monitors to prevent race conditions.

Comparison Table

FeatureShared-Memory SystemsDistributed-Memory Systems
MemorySingle shared spaceLocal memory per processor
CommunicationImplicit (shared data)Explicit (message passing)
ProgrammingEasierMore complex
ScalabilityPoorHigh

Related entries: