Lumerical Solutions, Inc.
简体中文  |  繁體中文  |  Deutsch  |  Français  |  日本語  |  한국어   
Login |  Register  
   Products |  Download |  Support |  About us |  Contact us  
Home
Products
Support
About us
August 21, 2009

Technology Trends: FDTD Solutions 500%+ faster via Intel's Nehalem microarchitecture

Improvements to memory bandwidth yields impressive gains in calculation speed for memory intensive applications like FDTD Solutions


Intel's Nehalem processor technology, now readily available through all major IT vendors, contains some major advancements that lead to substantial performance gains for memory intensive applications like FDTD Solutions. In this article, we show the results of performance testing on a range of systems based on the Nehalem technology to demonstrate that, while increases in processor clock rates are slowing, dramatic processor performance gains will be gained through parallelism. Relatively inexpensive computing systems, like the Xeon 5500 series HPC workstation tested, offer tremendous 500%+ speed improvements over 12 to 18 month-old workstations. Unlike other classes of high performance hardware, these systems are widely compatible with other design and engineering applications and offer great value for those seeking faster computation speeds in FDTD Solutions and other applications.

 

Additional information

Trial download
(registration required)

Contact us

Sales and licensing
Technical queries
phone: +1 604 733 9006

Background: What is Nehalem?

Nehalem is the codename that Intel uses to refer to their latest microarchitecture. They have deployed this technology into the desktop market under the product name Core i7. The same microarchitecture has also been used in the more powerful Xeon 3500 and 5500 series workstation/server processors. The Xeon editions of the processors are tweaked to allow higher memory speeds and dual processor configurations.

Earlier this year we published an article that looked at the Core i7 processor compared to the previous generation Intel microarchitecture. To ensure consistency, for this current study we used the same benchmark tests as was featured in the previous article. As shown in Figure 1 to the right, these tests involve real application benchmarks of a CMOS image sensor and a solar cell with FDTD Solutions running in parallel mode. Such realistic application benchmarks are an important aspect of any credible performance study, because the performance of an FDTD solver will vary depending on the size and complexity of the model.

The result showed from the previous study showed a dramatic 3x performance improvement for processors released in the same calendar year, as shown in Figure 2 below. In the current work, we have expanded the computer test systems to include workstation and HPC server grade Nehalem processors for those seeking even larger performance gains.

test problems benchmarked include CMOS image sensor and plasmonic solar cell

Figure 1. Benchmark test simulations included a 5GB CMOS image sensor simulation (top) and a 400MB plasmonic solar cell simulation (bottom).

FDTD performance versus processor generation

Figure 2. FDTD Solutions relative computation speed versus processor release date for E3110 and Core i7. The Core i7 has a computation speed around three times larger than the E3110 system.

The Systems

The test systems are designed to cover three distinct categories of processors: the desktop processors (Core i7); the workstation processor (Xeon 3500 series); and the performance workstation and HPC server processors (Xeon 5500 series). The most notable difference between the Core i7 and Xeon processors is the maximum RAM speed. The Xeons can run the RAM at speeds up to 1333MHz, whereas the maximum speed for the Core i7 processor is 1066MHz. All processors use three channels of DDR3 RAM, however the added clock rate for the Xeon processors give them an extra boost in memory bandwidth. The most notable difference between the Xeon 3500 series and Xeon 5500 series is that the 5500 series can be used in a two processor configuration while the 3500 series can not.

We have also included systems running both Windows Vista and RedHat Enterprise Linux 5. Where we have used Linux systems, we have ensured that a system with identical hardware running Windows is also included to facilitate a direct comparison between the operating systems. The complete system specifications are given in Table 1.

Table 1. The computer systems that were used to test the performance of FDTD Solutions. The short system names defined in the top row of the table help associate the systems with the performance results that follow.

System Namei7 Windowsi7 LinuxW3570 WindowsW3570 LinuxX5550
ProcessorIntel Core i7 920Intel Xeon W3570Intel Xeon X5550
Number of processors12
MotherboardGigabyte GA-EX58-UD3RIntel WX58BPIntel S5520SC
MemoryKingston ValueRAM DDR3 1333MHz
6GB, Unbuffered, Non-ECC (KVR1333D3N9K3/6G)12GB, Unbuffered ECC (KVR1333D3E9SK3/6G)
VideoASUS EN8400GS
CaseAntec Sonata III 500Intel SC5600BASE
Storage250GB SATA 3Gb/s hard drive
OSWindows Vista Business x64RedHat Enterprise Desktop x86_64 Windows Vista Business x64RedHat Enterprise Desktop x86_64 Windows Vista Business x64
Approximate price (USD)$1,100$1,000$2,000$1,900$3,700

Performance Results

Figure 3 below shows the results of running our CMOS image sensor and solar cell benchmarks on the various test systems. We have included as a reference the results from our last article of the Xeon E3110, which represents typical performance from Q1-2008. All results have been normalized to the E3110 system. In all cases below, the results represent the performance achieved running FDTD Solutions in parallel mode using all of the logical processors available to the operating system.
FDTD performance on the different test systems benchmarked

Figure 3. Relative performance of the test systems as measured with the CMOS image sensor and plasmonic solar cell test cases.

» FDTD Solutions scales well on Xeon X5550 system

The results show that the dual processor Xeon 5550 system is approximately 2x the speed of the single processor systems. This scalability leads to a tremendous amount of computational performance in a single system. The ability to scale application performance nearly ideally is another major advancement with the Nehalem microarchitecture. Previous generations of Intel Xeon multi-processors did not exhibit such ideal scaling when running FDTD Solutions.

» Little difference between Windows and Linux running FDTD Solutions

The performance results show very similar performance between the systems with identical hardware and different operating systems. In any case where there is a difference, the Linux systems seem to have the edge. However, any observed difference is minor.

» FDTD Solutions running on Xeon W3570 about 15% faster than Core i7

The Xeon W3570 shows a modest 15% improvement over the Core i7. As noted earlier, the main difference between these 2 processors is the speed at which the RAM operates. The 15% performance gain results from the 25% higher speed of the RAM in the Xeon W3570. The W3570 processor also runs at a higher clock speed, however this seems be a negligible difference in this test.

The Importance of Parallel Processing

All of the results presented so far have measured the performance of FDTD Solutions running in parallel mode. With processor vendors focused on delivering performance gains through increased numbers of cores and processors in a system, it is necessary to have parallelized software to realize these gains. This is a dramatic shift from the last several decades where the main source of performance gains was realized through increasing clock rates. In the past five years we have seen clock rates peak (see Figure 4) as parallelism becomes more available.

With increasing numbers of cores and processors in systems, properly architected parallelized software is required to realize the potential of this hardware. To illustrate this point, we also measured the performance of FDTD Solutions running as a single process on a single processor of each of our test systems. We used these results to come up with a Parallel Performance Gain factor, which is simply the ratio of the parallel performance to the single processor performance of FDTD Solutions running on a particular system. The parallel performance gain results are shown in Figure 5.
clock rates versus processor generation

Figure 4. Clock speeds of released Intel processors from 1990 through to the present. Clock rates peaked in early 2004.

Source: http://www.intel.com/pressroom/kits/quickreffam.htm
FDTD performance on the different test systems benchmarked

Figure 5. Performance improvement for each test system available by running FDTD Solutions in parallel. While the E3110 system offers a modest 13-18% improvement, the Core i7 systems offer much larger performance gains by running in parallel.

» Nearly 5x improvement using parallel FDTD Solutions

For the Xeon X5550 system, we can see just how important parallelism is. By simply using the parallel mode of FDTD Solutions a nearly 5x improvement in performance is realized.

» Trend is that parallel performance gain is increasing

The results for the Xeon E3110 system from early 2008 represent typical results from early dual core processors. Initially, there was only a marginal performance gain by running parallel software on a single system. Jump ahead to 2009 where now we see gains of 2x to 5x on a single system just by using parallel software.

Conclusions

The advancements contained in Intel's Nehalem processor offers tremendous benefits for memory intensive applications like FDTD Solutions. Benchmarking of real applications that are representative of the types of problems current users are intersted in indicate that six-fold speed improvements are readily available over workstations that are only 12-18 months old. The benefits available to FDTD Solutions users with systems incorporating Intel's new processor architecture are generally available regardless of the operating system used, from a wide array of hardware vendors at prices similar to those for older systems.


Appendix: Nehalem microarchitecture explained

So what has Intel done that has enabled these advancements? The main changes made to the Nehalem microarchitecture can include:
  1. Switching to DDR3 type RAM modules,
  2. Adding a third RAM channel, and
  3. Integrating the memory controller into the processor.
The impact of these 3 changes are explained in more detail in the following sections.

DDR3

DDR3 is the next generation of the Double Data Rate SDRAM technology. It offers more memory bandwidth per module than its predecessor, DDR2. The revisions of DDR3 that the initial Nehalem processors support provide peak transfer rates from 8533MB/s to 10667MB/s. The peak transfer rate for DDR2 modules used in most systems is 6400MB/s so DDR3 results in a 33% to 66% gain in memory bandwidth.

Three memory channels

The standard for most systems for the past number of years has been dual channel RAM. In this configuration paired modules of DDR or DDR2 RAM are used to effectively double the available memory bandwidth. Nehalem has extended this idea to a 3 (triple) channel RAM configuration. In this configuration, 3 DDR3 modules are used to achieve 3x the memory bandwidth that a single module provides. This represents another gain of 50% over previous generations of processors.

Integrated memory controller

Nehalem integrates the memory controller, which was formerly a separate chip on the motherboard in previous microarchitectures, onto each processor. This provides a definite advantage when scaling to multiple processor configurations. In this case, each processor has its own memory controller and can have its own bank of memory. This means that if you have 2 processors, you also have 2 memory controllers and 2 banks of RAM, so both the processing power and memory bandwidth scale similarly. To make these 2 processors work as a single system, Nehalem adds a high speed Quick Path Interconnect (QPI) which is a high speed link that allows each processor to access the other processor's memory and communicate. A schematic of this architecture is shown in Figure 6.
Schematic of 5500 processor

Figure 6. The Intel 5500 processor architecture contains 2 quad-core processors, each with its own bank of RAM.

The multi-processor Nehalem architecture falls under the category of Non-Uniform Memory Access (NUMA). This means that the speed at which a processor can access different parts of memory is different depending on the location of that memory. Memory attached to the processor itself is fastest to access, but that processor can also access the memory of the other processor at a little slower rate. The NUMA architecture is certainly not novel to Intel. AMD has been using a very similar architecture with their Opteron processor for several years. The AMD Opteron based systems support up to 8 processors. AMD uses a HyperTransport (HT) link to communicate between processors which is analogous to Intel's QPI.

The NUMA architecture used in Nehalem helps FDTD Solutions scale nearly ideally by increasing memory bandwidth and processing power simultaneously. Previous generations of multi-processor systems by Intel used a Symmetric Multiprocessor Architecture (SMP). In this configuration, all processors access a single memory bank through a common memory controller hub, as shown in Figure 7. That memory controller hub often tends to be a performance bottleneck for memory intensive applications such as FDTD Solutions.
SMP processor results in performance bottleneck owning to shared memory controller hub

Figure 7. The Symmetric Multiprocessor Architecture (SMP) results in performance bottlenecks owing to the shared memory controller hub.

Note: Nehalem, Xeon, Core i7, Opteron, Windows Vista, and RedHat Enterprise Linux are registered trademarks of their respective companies.


  Copyright 2003-2010 Lumerical Solutions, Inc.
  All rights reserved.
Home  |  Site map  |  Privacy  |  Subscribe