Lumerical Solutions, Inc.
简体中文  |  繁體中文  |  Deutsch  |  Français  |  日本語  |  한국어   
Login |  Register  
   Products |  Download |  Support |  About us |  Contact us  
Home
Products
Support
About us

Technology Trends: Parallel Processing and FDTD Solutions



Introduction
Multi-core systems
Multiprocessor systems
Clusters
Choosing the right system for FDTD Solutions



Additional information

Licensing FDTD Solutions
Trial download
Accuracy datasheet

Contact us

Sales and licensing
Technical queries
phone: (604) 733-9006

Introduction

Historically PC processors were serial processors - they computed one operation at a time - for which improvements in performance were largely obtained through increasing the rate at which the processor could perform operations. Over a period of less than 20 years (1983-2002), processor clock frequencies climbed from 5MHz to more than 3GHz.

In more recent years, processor vendors have encountered challenges in increasing clock speeds beyond 3GHz and have begun to focus on parallel computation in order to further improve PC performance. The idea of parallel computation is quite simple: perform multiple operations at the same time to increase the computation throughput. While the concept is simple, in practice parallel computing requires specialized software and algorithms engineered for this purpose. Fortunately, many algorithms such as the FDTD algorithm can be carried out as a parallel computation that can exploit increases in computer processor capability for years to come. While there are a variety of parallel computer systems commonly in use today, in this paper three common systems that can boost the performance of your FDTD Solutions simulations are examined.



Multi-core Systems

In 2005, dual-core processors were introduced by both AMD and Intel. More recently, Intel has continued this trend by introducing quad-core (4 core) processors in late 2006. These multi-core processors contain multiple execution cores on the same chip, each of which can independently perform operations, thereby introducing a new level of parallelization to desktop processors. In order to take advantage of this new computing capability, it is necessary to run software that can run computations in parallel. For example, the FDTD Solutions parallel engine has the ability to split a single simulation into multiple independent computation threads that execute concurrently on multi-core systems.
While current dual/quad-core processors double/quadruple the computation throughput of the processor, some parts of the processor's memory subsystem are shared among the cores (see Figure 1). Due to the sharing of the memory resources, the memory bandwidth - the rate at which RAM memory can be accessed - may not be increasing in step with the processing capability. As application performance is typically a function of computation capability and memory performance, the overall increase in application performance is not likely to scale linearly with the number of cores within the multi-core processor. Although the performance increase with multi-core processors may not be exactly what you might first expect based on the number of cores, there is still a significant performance gain.

A representative diagram of a dual-core system, showing the memory shared between
	both of the cores within the processor.

Figure 1. A representative diagram of a dual-core system, showing the memory shared between both of the cores within the processor.

While the memory architecture for multi-core processors may limit total memory bandwidth, it does facilitate very fast sharing of memory data. This is important since some algorithms, including the FDTD algorithm, require a small amount of shared memory access when implemented in parallel. On a multi-core processor system, this communication can be done very quickly, since each core has direct access to the same memory.

The final characteristic that we will examine for the multi-core processor PC is the memory capacity. The memory capacity, which is currently limited to the 8-16GB of memory that can be fit onto a motherboard, will ultimately limit the size of problem you can simulate with FDTD Solutions. Figure 2 summarizes the characteristics of the multi-core processor to facilitate comparison of some important performance metrics with the other parallel computer systems discussed below.

The performance profile of multi-core computing systems.

Figure 2. The performance profile of multi-core computing systems.

Multi-core fast facts
  • As of Dec 2006, all dual core processors available share a single memory controller/bus; however, there are variations in the memory cache systems that can impact performance.
  • In late 2006, Intel introduced quad-core processors. Quad-core Xeon processors support a dual independent system bus which doubles memory bandwidth.
  • Total memory bandwidth is determined by the RAM and front side bus (FSB) speeds.



Multiprocessor Systems

Another type of parallel system that is widely available is the multiprocessor computer, which has multiple processors attached to the same motherboard. Multiprocessor systems come in many flavors, but can be classified as one of 2 main types: Symmetric Multiprocessor (SMP) and Non-Uniform Memory Access (NUMA). SMP systems are very much like the multi-core processors discussed in the previous section. These systems share a single memory controller/system bus, and have performance characteristics similar to multi-core systems. In this section we will focus on the NUMA multi-processor systems since they have some properties that make them distinct from a multi-core system.
Non-uniform memory access means that a given processor in a multi-processor system accesses different parts of the system memory at different speeds. Systems based on the AMD Opteron processor implement this type of architecture. In these systems, each processor has local memory which it accesses at full speed, but each processor may also access the memory of other processors at a reduced speed (see Figure 3). The fact that each processor has its own local memory means that the memory access has been parallelized in addition to the computation ability. This important point means that this type of system has greater memory bandwidth available to each processor, resulting in better scalability of parallel computation.
A representative quad-processor system, showing the parallelized memory 
	access for each processor.

Figure 3. A representative quad-processor system, showing the parallelized memory access for each processor.

The caveat to the increased memory bandwidth in the NUMA system is that communication between processors must take place at a reduced speed. On the AMD Opteron based system, this occurs over a Hyper-transport bus. While this bus can still be very fast, it can lead to slightly less efficient communication between processes as compared to the dual-core system.
The total memory capacity on the multiprocessor system also benefits from the parallelized memory architecture because each processor can handle approximately the same amount of memory as a single processor system. Current multiprocessor systems can handle 32-64GB of memory.

Figure 4 summarizes the performance characteristics of the multiprocessor system. These systems rank high for memory bandwidth due to the parallelized memory architecture. And while the memory capacity is greater than that of the multi-core systems, the memory capacity is moderate because it is still limited by the hardware. The communication efficiency is moderate to reflect overhead in communication between processors.
The performance profile of multiprocessor computing systems.

Figure 4. The performance profile of multiprocessor computing systems.

Multiprocessor fast facts
  • Many multiprocessor systems contain multi-core processors.
  • All systems based on the AMD Opteron processor are based on the NUMA architecture.
  • Some systems based on the Intel Xeon processor are based on the NUMA architecture.
  • Currently available multiprocessor systems support 2,4 and 8 processors.



Clusters

The final parallel system that we will discuss is the cluster. A cluster is composed of a number of PCs, workstations or servers connected together via a network. While clusters come in many forms, we will discuss them in a general sense simply referring to each computer in the cluster as a node.

In a cluster, each node is a complete PC with its own processor(s) and memory. Like the NUMA multiprocessor system, the cluster parallelizes memory access as well as computation capability. However, in contrast to the multiprocessor system, the number of total nodes is not limited by hardware, and therefore the cluster provides a system that can scale to very large amounts of memory capacity while retaining high memory bandwidth (see Figure 5).


Figure 3.  A representative cluster with scalable memory 
	and computation capabilities.

Figure 5. A representative cluster with scalable memory and computation capabilities.

The penalty that is paid for the large amount of computation capability and memory bandwidth in a cluster is that the communication efficiency is reduced. The communication between processors is done via a network layer which is likely to be slower than communication through shared memory or across a Hyper-transport bus. To address this limitation, there are various high-speed network interconnects available for clusters that can greatly improve communication efficiency. The FDTD algorithm is well suited for clusters since, for many typical problems, excellent performance can be obtained using an appropriate network communication layer. The degree to which communication efficiency affects the performance of FDTD Solutions is dependent on the speed of the network hardware as well as the problem size.
Perhaps one of the greatest strengths of the cluster is its enormous memory capacity. The total amount of memory is limited only by what is practical for your desired application and the budget. As testimony to this, many clusters have been built to date that contain hundreds of GB and even TB of memory capacity. Given the flexibility provided by the scalable computing power of computing clusters, clusters may be the only practical choice for extremely large FDTD problems.

Figure 6 below summarizes the performance characteristics of clustered computing systems. Both memory bandwidth and memory capacity rank high due to the scalability in a cluster. The computation efficiency ranks low to reflect the increased challenges in communications between nodes.
The performance profile of cluster computing systems.

Figure 6. The performance profile of cluster computing systems.

Cluster fast facts
  • FDTD Solutions has been run on clusters with in excess of 1000 processors.
  • Many clusters use multiprocessor and multi-core systems as the nodes.
  • FDTD Solutions supports a large number of network interconnect technologies commonly used in clusters such as Ethernet, Gb Ethernet, Myrinet, Infiniband, Infinipath, and Quadrics.



Choosing the Right System for FDTD Solutions

As FDTD Solutions supports a large range of hardware options, choosing the optimum hardware for your application will depend on the problem size, your budget and your performance goals. We provide the following guidelines to help you choose the best system for your application:
  • Determine the maximum amount of memory your typical design problem requires. This can be determined by setting up some test problems in the FDTD Solutions graphical layout editor and by using the 'Check memory requirements' capability in the 'Simulate' menu. Figure 7 can be used as a guideline to determine which types of systems are appropriate.
  • Once you have determined which systems are candidates for your application, you should evaluate which ones are required to meet your performance needs. In general, the performance of the systems ordered from lowest to highest is multi-core, multiprocessor and cluster. You may be able to eliminate an option based on whether performance is an issue for you.
  • The final choice will probably come down to a budget decision. Because FDTD Solutions runs on scalable parallel computers, it is always possible to trade off performance with price. For parallel systems, you will likely want to purchase the largest number of nodes/processors that fit within your budget.


Figure 7.  A representative cluster with scalable memory 
	and computation capabilities.

Figure 7. Choosing the right parallel system to run FDTD Solutions simulations depends on the memory requirements and therefore the problem size of your application, and the performance required.

Beyond the above considerations, the following additional recommendations should be followed when purchasing hardware to obtain good performance:
  • Memory access speed is very important in modern microprocessors. Higher front side bus (FSB) speeds translate into faster memory performance. For new systems, we recommend choosing a system with a 1GHz or greater FSB.
  • Memory bandwidth is also determined by the bandwidth of the RAM modules. Because of this, choosing higher bandwidth RAM modules is also recommended. For new systems, we recommend PC2-5300 or greater.
  • Dual-core processors come at little added cost and improve performance. We recommend choosing multi-core processors for new systems.


  Copyright 2003-2008 Lumerical Solutions, Inc.
  All rights reserved.
Home  |  Site map  |  Privacy  |  Subscribe