Technology Trends: Parallel Processing and FDTD Solutions
Introduction
Historically PC processors were serial processors - they computed one operation at a time - for which
improvements in performance were largely obtained through increasing the rate at which the processor could
perform operations. Over a period of less than 20 years (1983-2002), processor clock frequencies climbed
from 5MHz to more than 3GHz.
In more recent years, processor vendors have encountered challenges in increasing clock speeds beyond
3GHz and have begun to focus on parallel computation in order to further improve PC performance. The
idea of parallel computation is quite simple: perform multiple operations at the same time to increase
the computation throughput. While the concept is simple, in practice parallel computing requires
specialized software and algorithms engineered for this purpose. Fortunately, many algorithms such as
the FDTD algorithm can be carried out as a parallel computation that can exploit increases in computer
processor capability for years to come. While there are a variety of parallel computer systems commonly
in use today, in this paper three common systems that can boost the performance of your FDTD Solutions
simulations are examined.
Multi-core Systems
In 2005, dual-core processors were introduced by both AMD and Intel. More recently, Intel has continued
this trend by introducing quad-core (4 core) processors in late 2006. These multi-core processors contain
multiple execution cores on the same chip, each of which can independently perform operations, thereby introducing a new
level of parallelization to desktop processors. In order to take advantage of this new computing
capability, it is necessary to run software that can run computations in parallel. For example, the
FDTD Solutions parallel engine has the ability to split a single simulation into multiple independent
computation threads that execute concurrently on multi-core systems.
|
While current dual/quad-core processors double/quadruple the computation throughput of the processor,
some parts of the processor's memory subsystem are shared among the cores (see Figure 1). Due to the
sharing of the memory resources, the memory bandwidth - the rate at which RAM memory can be accessed -
may not be increasing in step with the processing capability. As application performance is typically a
function of computation capability and memory performance, the overall increase in application performance
is not likely to scale linearly with the number of cores within the multi-core processor.
Although the performance increase with multi-core processors may not be exactly what you might first
expect based on the number of cores, there is still a significant performance gain.
|

Figure 1. A representative diagram of a dual-core system, showing the memory shared between
both of the cores within the processor.
|
|
|
While the memory architecture for multi-core processors may limit total memory bandwidth, it does facilitate
very fast sharing of memory data. This is important since some algorithms, including the FDTD algorithm,
require a small amount of shared memory access when implemented in parallel. On a multi-core processor
system, this communication can be done very quickly, since each core has direct access to the same memory.
The final characteristic that we will examine for the multi-core processor PC is the memory capacity. The
memory capacity, which is currently limited to the 8-16GB of memory that can be fit onto a motherboard,
will ultimately limit the size of problem you can simulate with FDTD Solutions.
Figure 2 summarizes the characteristics of the multi-core processor to facilitate comparison of some
important performance metrics with the other parallel computer systems discussed below.
|

Figure 2. The performance profile of multi-core computing systems.
|
Multi-core fast facts
- As of Dec 2006, all dual core processors available share a single memory controller/bus; however, there are variations in
the memory cache systems that can impact performance.
- In late 2006, Intel introduced quad-core processors. Quad-core Xeon processors support a dual
independent system bus which doubles memory bandwidth.
- Total memory bandwidth is determined by the RAM and front side bus (FSB) speeds.
|
Multiprocessor Systems
Another type of parallel system that is widely available is the multiprocessor computer, which has multiple
processors attached to the same motherboard. Multiprocessor systems come in many flavors, but can be
classified as one of 2 main types: Symmetric Multiprocessor (SMP) and Non-Uniform Memory Access (NUMA).
SMP systems are very much like the multi-core processors discussed in the previous section. These systems
share a single memory controller/system bus, and have performance characteristics similar to multi-core
systems. In this section we will focus on the NUMA multi-processor systems since they have some properties
that make them distinct from a multi-core system.
|
Non-uniform memory access means that a given processor in a multi-processor system accesses different parts
of the system memory at different speeds. Systems based on the AMD Opteron processor implement this type
of architecture.
In these systems, each processor has local memory which it accesses at full speed, but each processor may
also access the memory of other processors at a reduced speed (see Figure 3). The fact that each processor has its own
local memory means that the memory access has been parallelized in addition to the computation ability. This
important point means that this type of system has greater memory bandwidth available to each processor,
resulting in better scalability of parallel computation.
|

Figure 3. A representative quad-processor system, showing the parallelized memory
access for each processor.
|
The caveat to the increased memory bandwidth in the NUMA system is that communication between processors
must take place at a reduced speed. On the AMD Opteron based system, this occurs over a Hyper-transport
bus. While this bus can still be very fast, it can lead to slightly less efficient communication between
processes as compared to the dual-core system.
|
The total memory capacity on the multiprocessor system also benefits from the parallelized memory
architecture because each processor can handle approximately the same amount of memory as a single
processor system. Current multiprocessor systems can handle 32-64GB of memory.
Figure 4 summarizes the performance characteristics of the multiprocessor system. These systems rank
high for memory bandwidth due to the parallelized memory architecture. And while the memory capacity is
greater than that of the multi-core systems, the memory capacity is moderate because it is still limited by
the hardware. The communication efficiency is moderate to reflect overhead in communication between processors.
|

Figure 4. The performance profile of multiprocessor computing systems.
|
Multiprocessor fast facts
- Many multiprocessor systems contain multi-core processors.
- All systems based on the AMD Opteron processor are based on the NUMA architecture.
- Some systems based on the Intel Xeon processor are based on the NUMA architecture.
- Currently available multiprocessor systems support 2,4 and 8 processors.
|
Clusters
|
The final parallel system that we will discuss is the cluster. A cluster is composed of a number of PCs,
workstations or servers connected together via a network. While clusters come in many forms, we will discuss
them in a general sense simply referring to each computer in the cluster as a node.
In a cluster, each node is a complete PC with its own processor(s) and memory. Like the NUMA multiprocessor
system, the cluster parallelizes memory access as well as computation capability. However, in contrast to
the multiprocessor system, the number of total nodes is not limited by hardware, and therefore the cluster
provides a system that can scale to very large amounts of memory capacity while retaining high memory
bandwidth (see Figure 5).
|

Figure 5. A representative cluster with scalable memory and computation capabilities.
|
The penalty that is paid for the large amount of computation capability and memory bandwidth in a cluster
is that the communication efficiency is reduced. The communication between processors is done via a
network layer which is likely to be slower than communication through shared memory or across a
Hyper-transport bus. To address this limitation, there are various high-speed network interconnects
available for clusters that can greatly improve communication efficiency. The FDTD algorithm is
well suited for clusters since, for many typical problems, excellent performance can be obtained using an
appropriate network communication layer. The degree to which communication efficiency affects the
performance of FDTD Solutions is dependent on the speed of the network hardware as well
as the problem size.
|
Perhaps one of the greatest strengths of the cluster is its enormous memory capacity. The total amount
of memory is limited only by what is practical for your desired application and the budget. As
testimony to this, many clusters have been built to date
that contain hundreds of GB and even TB of memory capacity. Given the flexibility provided by the scalable
computing power of computing clusters, clusters may be
the only practical choice for extremely large FDTD problems.
Figure 6 below summarizes the performance characteristics of clustered computing systems. Both memory
bandwidth and memory capacity
rank high due to the scalability in a cluster. The computation efficiency ranks low to reflect the
increased challenges in communications between nodes.
|

Figure 6. The performance profile of cluster computing systems.
|
Cluster fast facts
- FDTD Solutions has been run on clusters with in excess of 1000 processors.
- Many clusters use multiprocessor and multi-core systems as the nodes.
- FDTD Solutions supports a large number of network interconnect technologies commonly used in
clusters such as Ethernet, Gb Ethernet, Myrinet, Infiniband, Infinipath, and Quadrics.
|
Choosing the Right System for FDTD Solutions
As FDTD Solutions supports a large range of hardware options, choosing the optimum hardware for your
application will depend on the problem size, your budget and your performance goals. We provide the
following guidelines to help you choose the best system for your application:
|
- Determine the maximum amount of memory your typical design problem requires. This can be determined by
setting up some test problems in the FDTD Solutions graphical layout editor and by using the
'Check memory requirements' capability in the 'Simulate' menu. Figure 7 can be
used as a guideline to determine which types of systems are appropriate.
- Once you have determined which systems are candidates for your application, you should evaluate which
ones are required to meet your performance needs. In general, the performance of the systems ordered
from lowest to highest is multi-core, multiprocessor and cluster. You may be able to eliminate an option
based on whether performance is an issue for you.
- The final choice will probably come down to a budget decision. Because FDTD Solutions runs on
scalable parallel computers, it is always possible to trade off performance with price.
For parallel systems, you will likely want to purchase the largest number of nodes/processors that fit
within your budget.
|

Figure 7. Choosing the right parallel system to run FDTD Solutions simulations depends on the memory
requirements and therefore the problem size of your application, and the performance required.
|
Beyond the above considerations, the following additional recommendations should be followed when
purchasing hardware to obtain good performance:
- Memory access speed is very important in modern microprocessors. Higher front side bus (FSB) speeds
translate into faster memory performance. For new systems, we recommend choosing a system with a
1GHz or greater FSB.
- Memory bandwidth is also determined by the bandwidth of the RAM modules. Because of this, choosing
higher bandwidth RAM modules is also recommended. For new systems, we recommend PC2-5300 or greater.
- Dual-core processors come at little added cost and improve performance. We recommend choosing
multi-core processors for new systems.
|