Friday, December 6, 2002
Christmas Wish: A New Bus For My XScale
Posted by Andy Sjostrom in "HARDWARE" @ 03:15 AM
As we head into the Holiday Season I enjoy seeing more Pocket PC models based on the XScale processor making their entry into the market. But the new models remind me of the XScale discussions we had this summer, and I can't help re-visiting...
...Jason's post "XScale and the Pocket PC – what’s going on?" in which Ed Suwanjindar, from the Microsoft Mobile Devices group, responded to Jason's questions, and Chris De Herrera's article "Improving the Speed of XScale" in which Chris lists some recommendations on what Microsoft, Intel and Hardware Manufacturers should change.
I still felt I needed more details to understand what is really going on, and turned to Sven Myhre, CEO of Amazing Games. Sven is an extremely talented coder, artist, and 3D modeller, and I asked him the bottom line question: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?" Read on to find out why I wish not for a faster processor or an optimized operating system, but a new bus!
Andy Sjostrom: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?"
Sven Myhre: "In theory, a 400 MHz XScale will always be faster than a 206 MHz StrongARM. In real-life however, CPU performance depends on a lot more than raw MHz. The CPU needs to do something useful with all its raw speed – and that means we need to feed it with code instructions and data to process, and we need to make sure the result of all its processing is stored. This is where the memory bus comes into the picture.
XScale and StrongARM Pocket PC designs use a 16 bit memory bus. Both CPU families are 32 bit RISC processors that use 32 bit code instructions as well as (usually) 32 bit chunks of data. Just to feed the CPU with enough code instructions to keep it running at full speed we really need a memory bus running at twice the speed of the CPU, since the bus needs to transfer two 16 bit chunks to feed the CPU one 32 bit code instruction. And since we want the CPU to live a meaningful life, we also need to memory bus to transfer some data back and forth between the CPU and memory. For most applications, increasing the memory bus speed with another 25% should pretty much cover normal data traffic.
So, to keep our CPU running at full speed, our beloved Pocket PC should have a bus that runs at least 250 % faster than the CPU. A 400 MHz XScale should have a 1000 MHz bus, a 300 MHz XScale should have a 750 MHz bus, a 200 MHz XScale should have a 500 MHz bus and a 206 MHz StrongARM should have a 515 MHz bus..."
Reality Is Different
Sven continues. "In reality the speed factor between CPU and the memory bus is the opposite of what we just described. XScale Pocket PCs running at 400 MHz, 300 MHz and 200 MHz all use a 100 MHz bus, and the 206 MHz StrongARM use a 103 MHz bus.
I guess you just spotted the main bottleneck, and the reason why XScale running at 400 MHz, 300 MHz and 200 MHz get almost identical benchmark results for tests that involve shuffling memory around – typical applications are graphics and multimedia. (Note: some Pocket PCs incorporate graphics accelerators that might confuse this picture a little bit). And since the StrongARM use a bus that is 3 % faster than the bus used in XScale, we also find a logical explanation for why StrongARM based Pocket PC’s are sometimes slightly faster than XScale based Pocket PC’s in some tests."
Now It Gets Complicated
"Hardware designers knew they had to come up with a way to feed the CPU all the code instructions and data it needs, faster than the slow memory bus can provide them. So they added a cache to the CPU. The StrongARM has 8 Kb of code cache and 8 Kb of data cache, while the XScale use a 32 Kb code cache and a 32 Kb data cache. Whenever the CPU tries to load a code instruction or a chunk of data, it will first search its cache to see if it is already loaded. If it finds it in the cache, it can access the code instruction or data chunk at full speed. Bang! Your 400 MHz XScale roars, and will chew up code instructions at a blazing speed – 400 millions of them per second.
But what happens if the information the CPU is looking for is not found in cache? This brings us to the flipside of the cache – it becomes a double-edged sword that turns around and hits you hard if you don’t pay attention as a coder. Since the CPU needs to search the cache very quickly, the cache is organized into what we call cache lines. The cache line is in fact the smallest unit that can be read from memory under normal conditions. On both XScale and StrongARM, the cache line happens to be 16 words (in the realm of ARM architecture, a word equals 32 bit or 4 bytes). So a cache line is 64 bytes, and even if the CPU just need to access a single byte, it still has to read 64 bytes from memory to fill an entire cache line before returning with the single byte."
A Real World Example
"Joe Coder decides to make the worlds best PIM. He needs to store records (or structures) of all his contacts – and Joe Coder is popular so he has 1000 contacts. For each contact he needs to store their first name, surname and phone number, so he sets aside 64 bytes to store each contact. Then he wants to sort them by their surname and present them nicely on the screen. For each contact he probably just needs to read the first few letters in their surname in order to sort them correctly.
The problem is that even if Joe Coder just reads a few bytes from each contact record, the CPU will read 64 bytes from memory to the cache, every time he access a new surname. And if Joe Coder was a lazy coder, he might not have bothered to check that each record was aligned on 64 byte addresses – so a surname might actually span two cache lines, meaning the CPU will read 128 bytes for every access to a new surname. But even if we assume he did his homework and aligned the memory correct, a StrongARM will have used all of its data cache after reading just 128 surnames (8192 bytes / 64 bytes = 128 cache lines). An XScale would be able to fit 512 surnames (32768 bytes / 64 bytes = 512 cache lines) before it had to start writing over previously read cache lines. But Joe Coder needed to read through the entire list of 1000 contacts before starting over again – so neither the StrongARM nor the XScale would be able to use their cache to their advantage.
All Joe Coder wanted was to read 4 bytes from each surname, for a total of 4000 bytes. But the CPU ended up transferring a total of 64000 bytes from memory to the cache. A 206 MHz StrongARM would have spent 64000 cycles waiting, while a 400 MHz XScale would have spent 128000 cycles waiting. The deciding factor was the 103 MHz vs. 100 MHz bus, and the StrongARM would have been slightly faster.
Joe Coder made the cache design work against him. He forgot that a cycle is a terrible thing to waste. If Joe Coder had been clever, he might have reorganized his data structures. By storing all the surnames in a separate list, he could have made the cache work for him instead. Let us say he thinks 16 bytes are enough for a good surname, so 4 surnames would fit sequentially in a cache line (64 bytes). He would still have the penalty of waiting for the cache lines to fill up when he reads the first surname, but when he reads surname no 2, 3 and 4 - they would be present in the cache and he could have read them at full speed. So this time around, the CPU ended up transferring just 16000 bytes in total. And - if Joe Coder was lucky enough to own a 400 MHz XScale, they would all still be present in his cache when he finished - so he could go over them again - and this time they could all be accessed at full speed. Poor Joe Coder, however, he owns a StrongARM – so he still could not fit everything in the cache and the second run through them would take the same amount of time.
Joe Coder is faced with such dilemmas every day and the decisions he makes, have a huge impact on how your Pocket PC performs. Maybe the Joe Coder decides that an inefficient memory layout is the best way to go, since the code might be more easy to read and maintain - or that it has to be compatible width other versions of the software which runs on other platforms with other hardware constrains."
Bottom Line
"The main problem with slow XScales has nothing to do with XScale (which are based upon ARM v5) “emulating” StrongARM code (which is ARM v4) no more than you would say a Pentium 4 “emulates” a Pentium 3 when running Windows XP.
And it is NOT a question of simply “optimizing” Windows CE for XScale. Of course it might give you a few percentages faster code - but it’s not worth the trouble going through the entire Windows CE source code and check where we could reorganise structures or access patterns to make better use of the 32 Kb data cache on the XScale. We would probably end up with a highly unstable version of Windows CE were no one new the entire implications of all the changes they made.
Unless we get a faster and/or wider memory bus, we can increase the internal speed on the CPU to the speed of light (and it would probably be blazingly fast in calculating prime numbers or something) - but our real world applications would not really see the difference. As goes for purchase decisions – it is very much up to what you want your Pocket PC to do.
If you want to spend most time doing stuff that involves shuffling lots of memory around (typical use is graphics, multimedia, music and some games) you might find that a 300 MHz XScale gives you just as much bang for the buck as a 400 MHz. But please note that this will change from application to application. Sometimes you can blame Joe Coder, but at other times the datasets are just too big fit any cache."
The Horizon
"The most exiting news with the launch of the XScale family was an extension called Wireless MMX, which lets the code perform operations commonly used in multimedia processing on several data units simultaneously. Right now there are few (if any) tools available to the developer community to take advantage of this extension. But Intel’s upcoming C/C++ compiler (currently in beta) for XScale includes functionality to access of Wireless MMX from high-level C/C++ code without resorting to assembler."
...Jason's post "XScale and the Pocket PC – what’s going on?" in which Ed Suwanjindar, from the Microsoft Mobile Devices group, responded to Jason's questions, and Chris De Herrera's article "Improving the Speed of XScale" in which Chris lists some recommendations on what Microsoft, Intel and Hardware Manufacturers should change.
I still felt I needed more details to understand what is really going on, and turned to Sven Myhre, CEO of Amazing Games. Sven is an extremely talented coder, artist, and 3D modeller, and I asked him the bottom line question: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?" Read on to find out why I wish not for a faster processor or an optimized operating system, but a new bus!
Andy Sjostrom: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?"
Sven Myhre: "In theory, a 400 MHz XScale will always be faster than a 206 MHz StrongARM. In real-life however, CPU performance depends on a lot more than raw MHz. The CPU needs to do something useful with all its raw speed – and that means we need to feed it with code instructions and data to process, and we need to make sure the result of all its processing is stored. This is where the memory bus comes into the picture.
XScale and StrongARM Pocket PC designs use a 16 bit memory bus. Both CPU families are 32 bit RISC processors that use 32 bit code instructions as well as (usually) 32 bit chunks of data. Just to feed the CPU with enough code instructions to keep it running at full speed we really need a memory bus running at twice the speed of the CPU, since the bus needs to transfer two 16 bit chunks to feed the CPU one 32 bit code instruction. And since we want the CPU to live a meaningful life, we also need to memory bus to transfer some data back and forth between the CPU and memory. For most applications, increasing the memory bus speed with another 25% should pretty much cover normal data traffic.
So, to keep our CPU running at full speed, our beloved Pocket PC should have a bus that runs at least 250 % faster than the CPU. A 400 MHz XScale should have a 1000 MHz bus, a 300 MHz XScale should have a 750 MHz bus, a 200 MHz XScale should have a 500 MHz bus and a 206 MHz StrongARM should have a 515 MHz bus..."
Reality Is Different
Sven continues. "In reality the speed factor between CPU and the memory bus is the opposite of what we just described. XScale Pocket PCs running at 400 MHz, 300 MHz and 200 MHz all use a 100 MHz bus, and the 206 MHz StrongARM use a 103 MHz bus.
I guess you just spotted the main bottleneck, and the reason why XScale running at 400 MHz, 300 MHz and 200 MHz get almost identical benchmark results for tests that involve shuffling memory around – typical applications are graphics and multimedia. (Note: some Pocket PCs incorporate graphics accelerators that might confuse this picture a little bit). And since the StrongARM use a bus that is 3 % faster than the bus used in XScale, we also find a logical explanation for why StrongARM based Pocket PC’s are sometimes slightly faster than XScale based Pocket PC’s in some tests."
Now It Gets Complicated
"Hardware designers knew they had to come up with a way to feed the CPU all the code instructions and data it needs, faster than the slow memory bus can provide them. So they added a cache to the CPU. The StrongARM has 8 Kb of code cache and 8 Kb of data cache, while the XScale use a 32 Kb code cache and a 32 Kb data cache. Whenever the CPU tries to load a code instruction or a chunk of data, it will first search its cache to see if it is already loaded. If it finds it in the cache, it can access the code instruction or data chunk at full speed. Bang! Your 400 MHz XScale roars, and will chew up code instructions at a blazing speed – 400 millions of them per second.
But what happens if the information the CPU is looking for is not found in cache? This brings us to the flipside of the cache – it becomes a double-edged sword that turns around and hits you hard if you don’t pay attention as a coder. Since the CPU needs to search the cache very quickly, the cache is organized into what we call cache lines. The cache line is in fact the smallest unit that can be read from memory under normal conditions. On both XScale and StrongARM, the cache line happens to be 16 words (in the realm of ARM architecture, a word equals 32 bit or 4 bytes). So a cache line is 64 bytes, and even if the CPU just need to access a single byte, it still has to read 64 bytes from memory to fill an entire cache line before returning with the single byte."
A Real World Example
"Joe Coder decides to make the worlds best PIM. He needs to store records (or structures) of all his contacts – and Joe Coder is popular so he has 1000 contacts. For each contact he needs to store their first name, surname and phone number, so he sets aside 64 bytes to store each contact. Then he wants to sort them by their surname and present them nicely on the screen. For each contact he probably just needs to read the first few letters in their surname in order to sort them correctly.
The problem is that even if Joe Coder just reads a few bytes from each contact record, the CPU will read 64 bytes from memory to the cache, every time he access a new surname. And if Joe Coder was a lazy coder, he might not have bothered to check that each record was aligned on 64 byte addresses – so a surname might actually span two cache lines, meaning the CPU will read 128 bytes for every access to a new surname. But even if we assume he did his homework and aligned the memory correct, a StrongARM will have used all of its data cache after reading just 128 surnames (8192 bytes / 64 bytes = 128 cache lines). An XScale would be able to fit 512 surnames (32768 bytes / 64 bytes = 512 cache lines) before it had to start writing over previously read cache lines. But Joe Coder needed to read through the entire list of 1000 contacts before starting over again – so neither the StrongARM nor the XScale would be able to use their cache to their advantage.
All Joe Coder wanted was to read 4 bytes from each surname, for a total of 4000 bytes. But the CPU ended up transferring a total of 64000 bytes from memory to the cache. A 206 MHz StrongARM would have spent 64000 cycles waiting, while a 400 MHz XScale would have spent 128000 cycles waiting. The deciding factor was the 103 MHz vs. 100 MHz bus, and the StrongARM would have been slightly faster.
Joe Coder made the cache design work against him. He forgot that a cycle is a terrible thing to waste. If Joe Coder had been clever, he might have reorganized his data structures. By storing all the surnames in a separate list, he could have made the cache work for him instead. Let us say he thinks 16 bytes are enough for a good surname, so 4 surnames would fit sequentially in a cache line (64 bytes). He would still have the penalty of waiting for the cache lines to fill up when he reads the first surname, but when he reads surname no 2, 3 and 4 - they would be present in the cache and he could have read them at full speed. So this time around, the CPU ended up transferring just 16000 bytes in total. And - if Joe Coder was lucky enough to own a 400 MHz XScale, they would all still be present in his cache when he finished - so he could go over them again - and this time they could all be accessed at full speed. Poor Joe Coder, however, he owns a StrongARM – so he still could not fit everything in the cache and the second run through them would take the same amount of time.
Joe Coder is faced with such dilemmas every day and the decisions he makes, have a huge impact on how your Pocket PC performs. Maybe the Joe Coder decides that an inefficient memory layout is the best way to go, since the code might be more easy to read and maintain - or that it has to be compatible width other versions of the software which runs on other platforms with other hardware constrains."
Bottom Line
"The main problem with slow XScales has nothing to do with XScale (which are based upon ARM v5) “emulating” StrongARM code (which is ARM v4) no more than you would say a Pentium 4 “emulates” a Pentium 3 when running Windows XP.
And it is NOT a question of simply “optimizing” Windows CE for XScale. Of course it might give you a few percentages faster code - but it’s not worth the trouble going through the entire Windows CE source code and check where we could reorganise structures or access patterns to make better use of the 32 Kb data cache on the XScale. We would probably end up with a highly unstable version of Windows CE were no one new the entire implications of all the changes they made.
Unless we get a faster and/or wider memory bus, we can increase the internal speed on the CPU to the speed of light (and it would probably be blazingly fast in calculating prime numbers or something) - but our real world applications would not really see the difference. As goes for purchase decisions – it is very much up to what you want your Pocket PC to do.
If you want to spend most time doing stuff that involves shuffling lots of memory around (typical use is graphics, multimedia, music and some games) you might find that a 300 MHz XScale gives you just as much bang for the buck as a 400 MHz. But please note that this will change from application to application. Sometimes you can blame Joe Coder, but at other times the datasets are just too big fit any cache."
The Horizon
"The most exiting news with the launch of the XScale family was an extension called Wireless MMX, which lets the code perform operations commonly used in multimedia processing on several data units simultaneously. Right now there are few (if any) tools available to the developer community to take advantage of this extension. But Intel’s upcoming C/C++ compiler (currently in beta) for XScale includes functionality to access of Wireless MMX from high-level C/C++ code without resorting to assembler."