IntroductionHistorically PC processors were serial processors - they computed one operation at a time - for which improvements in performance were largely obtained through increasing the rate at which the processor could perform operations. Over a period of less than 20 years (1983-2002), processor clock frequencies climbed from 5MHz to more than 3GHz. Multi-core SystemsIn 2005, dual-core processors were introduced by both AMD and Intel. More recently, Intel has continued this trend by introducing quad-core (4 core) processors in late 2006. These multi-core processors contain multiple execution cores on the same chip, each of which can independently perform operations, thereby introducing a new level of parallelization to desktop processors. In order to take advantage of this new computing capability, it is necessary to run software that can run computations in parallel. For example, the FDTD Solutions parallel engine has the ability to split a single simulation into multiple independent computation threads that execute concurrently on multi-core systems. While current dual/quad-core processors double/quadruple the computation throughput of the processor, some parts of the processor's memory subsystem are shared among the cores (see Figure 1). Due to the sharing of the memory resources, the memory bandwidth - the rate at which RAM memory can be accessed - may not be increasing in step with the processing capability. As application performance is typically a function of computation capability and memory performance, the overall increase in application performance is not likely to scale linearly with the number of cores within the multi-core processor. Although the performance increase with multi-core processors may not be exactly what you might first expect based on the number of cores, there is still a significant performance gain.
While the memory architecture for multi-core processors may limit total memory bandwidth, it does facilitate very fast sharing of memory data. This is important since some algorithms, including the FDTD algorithm, require a small amount of shared memory access when implemented in parallel. On a multi-core processor system, this communication can be done very quickly, since each core has direct access to the same memory.
Multi-core fast facts
Multiprocessor SystemsAnother type of parallel system that is widely available is the multiprocessor computer, which has multiple processors attached to the same motherboard. Multiprocessor systems come in many flavors, but can be classified as one of 2 main types: Symmetric Multiprocessor (SMP) and Non-Uniform Memory Access (NUMA). SMP systems are very much like the multi-core processors discussed in the previous section. These systems share a single memory controller/system bus, and have performance characteristics similar to multi-core systems. In this section we will focus on the NUMA multi-processor systems since they have some properties that make them distinct from a multi-core system. Non-uniform memory access means that a given processor in a multi-processor system accesses different parts of the system memory at different speeds. Systems based on the AMD Opteron processor implement this type of architecture. In these systems, each processor has local memory which it accesses at full speed, but each processor may also access the memory of other processors at a reduced speed (see Figure 3). The fact that each processor has its own local memory means that the memory access has been parallelized in addition to the computation ability. This important point means that this type of system has greater memory bandwidth available to each processor, resulting in better scalability of parallel computation.
The caveat to the increased memory bandwidth in the NUMA system is that communication between processors must take place at a reduced speed. On the AMD Opteron based system, this occurs over a Hyper-transport bus. While this bus can still be very fast, it can lead to slightly less efficient communication between processes as compared to the dual-core system. The total memory capacity on the multiprocessor system also benefits from the parallelized memory architecture because each processor can handle approximately the same amount of memory as a single processor system. Current multiprocessor systems can handle 32-64GB of memory. Figure 4 summarizes the performance characteristics of the multiprocessor system. These systems rank high for memory bandwidth due to the parallelized memory architecture. And while the memory capacity is greater than that of the multi-core systems, the memory capacity is moderate because it is still limited by the hardware. The communication efficiency is moderate to reflect overhead in communication between processors.
Multiprocessor fast facts
ClustersThe final parallel system that we will discuss is the cluster. A cluster is composed of a number of PCs, workstations or servers connected together via a network. While clusters come in many forms, we will discuss them in a general sense simply referring to each computer in the cluster as a node. In a cluster, each node is a complete PC with its own processor(s) and memory. Like the NUMA multiprocessor system, the cluster parallelizes memory access as well as computation capability. However, in contrast to the multiprocessor system, the number of total nodes is not limited by hardware, and therefore the cluster provides a system that can scale to very large amounts of memory capacity while retaining high memory bandwidth (see Figure 5).
The penalty that is paid for the large amount of computation capability and memory bandwidth in a cluster is that the communication efficiency is reduced. The communication between processors is done via a network layer which is likely to be slower than communication through shared memory or across a Hyper-transport bus. To address this limitation, there are various high-speed network interconnects available for clusters that can greatly improve communication efficiency. The FDTD algorithm is well suited for clusters since, for many typical problems, excellent performance can be obtained using an appropriate network communication layer. The degree to which communication efficiency affects the performance of FDTD Solutions is dependent on the speed of the network hardware as well as the problem size.
Cluster fast facts
Choosing the Right System for FDTD SolutionsAs FDTD Solutions supports a large range of hardware options, choosing the optimum hardware for your application will depend on the problem size, your budget and your performance goals. We provide the following guidelines to help you choose the best system for your application:
Beyond the above considerations, the following additional recommendations should be followed when purchasing hardware to obtain good performance:
|
|
||
|
||
|
![]() |
Evaluate fully functional Lumerical Software |
| Request a Price Email/sales@lumerical.com Tel/1.604.733.9006 x100 Find a Local Representative |
|
||
|
||
|
"[Lumerical's] technical support is second to none."
E. Chow, Agilent
"[I get] support from physicists with a deep understanding of my research questions."
M. McCutcheon, Harvard University
"Lumerical's technical support is excellent and very responsive."
M. Webster, Lightwire