Improvements to memory bandwidth yields impressive gains in calculation speed for memory intensive applications like FDTD SolutionsIntel's Nehalem processor technology, now readily available through all major IT vendors, contains some major advancements that lead to substantial performance gains for memory intensive applications like FDTD Solutions. In this article, we show the results of performance testing on a range of systems based on the Nehalem technology to demonstrate that, while increases in processor clock rates are slowing, dramatic processor performance gains will be gained through parallelism. Relatively inexpensive computing systems, like the Xeon 5500 series HPC workstation tested, offer tremendous 500%+ speed improvements over 12 to 18 month-old workstations. Unlike other classes of high performance hardware, these systems are widely compatible with other design and engineering applications and offer great value for those seeking faster computation speeds in FDTD Solutions and other applications.Background: What is Nehalem?Nehalem is the codename that Intel uses to refer to their latest microarchitecture. They have deployed this technology into the desktop market under the product name Core i7. The same microarchitecture has also been used in the more powerful Xeon 3500 and 5500 series workstation/server processors. The Xeon editions of the processors are tweaked to allow higher memory speeds and dual processor configurations.
The result showed from the previous study showed a dramatic 3x performance improvement for processors released in the same calendar year, as shown in Figure 2 below. In the current work, we have expanded the computer test systems to include workstation and HPC server grade Nehalem processors for those seeking even larger performance gains.
The SystemsThe test systems are designed to cover three distinct categories of processors: the desktop processors (Core i7); the workstation processor (Xeon 3500 series); and the performance workstation and HPC server processors (Xeon 5500 series). The most notable difference between the Core i7 and Xeon processors is the maximum RAM speed. The Xeons can run the RAM at speeds up to 1333MHz, whereas the maximum speed for the Core i7 processor is 1066MHz. All processors use three channels of DDR3 RAM, however the added clock rate for the Xeon processors give them an extra boost in memory bandwidth. The most notable difference between the Xeon 3500 series and Xeon 5500 series is that the 5500 series can be used in a two processor configuration while the 3500 series can not.
Figure 3 below shows the results of running our CMOS image sensor and solar cell benchmarks on the various test systems. We have included as a reference the results from our last article of the Xeon E3110, which represents typical performance from Q1-2008. All results have been normalized to the E3110 system. In all cases below, the results represent the performance achieved running FDTD Solutions in parallel mode using all of the logical processors available to the operating system.
» FDTD Solutions scales well on Xeon X5550 system The Importance of Parallel ProcessingAll of the results presented so far have measured the performance of FDTD Solutions running in parallel mode. With processor vendors focused on delivering performance gains through increased numbers of cores and processors in a system, it is necessary to have parallelized software to realize these gains. This is a dramatic shift from the last several decades where the main source of performance gains was realized through increasing clock rates. In the past five years we have seen clock rates peak (see Figure 4) as parallelism becomes more available.
With increasing numbers of cores and processors in systems, properly architected parallelized software is required to realize the potential of this hardware. To illustrate this point, we also measured the performance of FDTD Solutions running as a single process on a single processor of each of our test systems. We used these results to come up with a Parallel Performance Gain factor, which is simply the ratio of the parallel performance to the single processor performance of FDTD Solutions running on a particular system. The parallel performance gain results are shown in Figure 5.
ConclusionsThe advancements contained in Intel's Nehalem processor offers tremendous benefits for memory intensive applications like FDTD Solutions. Benchmarking of real applications that are representative of the types of problems current users are intersted in indicate that six-fold speed improvements are readily available over workstations that are only 12-18 months old. The benefits available to FDTD Solutions users with systems incorporating Intel's new processor architecture are generally available regardless of the operating system used, from a wide array of hardware vendors at prices similar to those for older systems. Appendix: Nehalem microarchitecture explainedSo what has Intel done that has enabled these advancements? The main changes made to the Nehalem microarchitecture can include: DDR3DDR3 is the next generation of the Double Data Rate SDRAM technology. It offers more memory bandwidth per module than its predecessor, DDR2. The revisions of DDR3 that the initial Nehalem processors support provide peak transfer rates from 8533MB/s to 10667MB/s. The peak transfer rate for DDR2 modules used in most systems is 6400MB/s so DDR3 results in a 33% to 66% gain in memory bandwidth. Three memory channelsThe standard for most systems for the past number of years has been dual channel RAM. In this configuration paired modules of DDR or DDR2 RAM are used to effectively double the available memory bandwidth. Nehalem has extended this idea to a 3 (triple) channel RAM configuration. In this configuration, 3 DDR3 modules are used to achieve 3x the memory bandwidth that a single module provides. This represents another gain of 50% over previous generations of processors. Integrated memory controllerNehalem integrates the memory controller, which was formerly a separate chip on the motherboard in previous microarchitectures, onto each processor. This provides a definite advantage when scaling to multiple processor configurations. In this case, each processor has its own memory controller and can have its own bank of memory. This means that if you have 2 processors, you also have 2 memory controllers and 2 banks of RAM, so both the processing power and memory bandwidth scale similarly. To make these 2 processors work as a single system, Nehalem adds a high speed Quick Path Interconnect (QPI) which is a high speed link that allows each processor to access the other processor's memory and communicate. A schematic of this architecture is shown in Figure 6.
The multi-processor Nehalem architecture falls under the category of Non-Uniform Memory Access (NUMA). This means that the speed at which a processor can access different parts of memory is different depending on the location of that memory. Memory attached to the processor itself is fastest to access, but that processor can also access the memory of the other processor at a little slower rate. The NUMA architecture is certainly not novel to Intel. AMD has been using a very similar architecture with their Opteron processor for several years. The AMD Opteron based systems support up to 8 processors. AMD uses a HyperTransport (HT) link to communicate between processors which is analogous to Intel's QPI.
Note: Nehalem, Xeon, Core i7, Opteron, Windows Vista, and RedHat Enterprise Linux are registered trademarks of their respective companies. |
|
||
|
||
|
![]() |
Evaluate fully functional Lumerical Software |
| Request a Price Email/sales@lumerical.com Tel/1.604.733.9006 x100 Find a Local Representative |
|
||
|
||
|
"[Lumerical's] technical support is second to none."
E. Chow, Agilent
"[I get] support from physicists with a deep understanding of my research questions."
M. McCutcheon, Harvard University
"Lumerical's technical support is excellent and very responsive."
M. Webster, Lightwire