The SSE instruction set can be a very useful tool in developing high performance applications. SSE, or Streaming SIMD Extensions, is particularly helpful when you need to perform the same instructions over and over again on different pieces of data. SSE vectors are 128-bits wide, and allow you to perform calculations for 4 different floating point numbers at the same time. SSE can also be configured to work on 2, 64-bit floating point numbers concurrently, 4, 32-bit integers, or even 16, 8-bit chars. By properly utilizing the SSE instruction set, you gain access to a large amount of parallel computing power that would otherwise go untapped. In some cases, using SSE optimization can provide just as much speedup as going from a single core to a quad core processor.
Example for this tutorial
For this tutorial, we are going to write a simple program that will calculate Y = sqrt(x)/x, for values of x ranging from 1 -> 64000. This is a fairly typical scenario for any application that would need to plot a graph. For this tutorial, we are going to be using compiler intrinsics, which only seem to be available on Microsoft Visual Studio. Generally, I encourage cross-platform capability, but Microsoft’s compiler intrinsics are simply the easiest and overall best way to write applications utilizing SSE instructions. Of course, there are similar SSE intrinsics which can be used for other processors, so this is just a good place to start.
Here you can download file SSScheduler. 2shared gives you an excellent opportunity to store your files here and share them with others. Join our community just now to flow with the file SSScheduler and make our shared file collection even more complete and exciting. SSScheduler.exe file is infected with malicious adware, Trojans, spyware and malware. The associated entries of SSScheduler.exe are corrupted in the registry. SSScheduler.exe files are deleted by another program by mistake. The software that uses SSScheduler.exe is not downloaded completely and installed correctly.
Starting the project
Create a new project with Visual Studio. A console project will work just fine. Be sure to include xmmintrin.h. It is important to note that this header file may be named differently depending on which compiler y ou are using.
Before you scan, your McAfee Security Scan Plus software automatically checks your software to ensure it's up to date. By default, McAfee automatically downloads and installs updates on your computer to keep it protected. A Channel for 3d modelling tutorials.
#include <xmmintrin.h> // Need this for SSE compiler intrinsics
Alignment
For this tutorial, we will be calculating the values of Y with SSE and storing them into a floating point array. It is of the utmost importance that the array be created and aligned to a 16-byte boundary. This will insure that the values can be loaded into, and stored from the SSE hardware as efficiently as possible. Failure to properly align data being used with SSE instructions will result in a huge performance hit. Since large datasets are generally required in order to justify using SSE, we’ll be dynamically allocating our array so it doesn’t reside on the stack.
float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16); // align to 16-byte for SSE
For this tutorial, we won’t worry about storing the actual values of X, since they could easily be derived at a later time if needed. So instead of using an array to store X, we’ll be using a special vector variable which is explicitly meant to be used for SSE instructions.
__m128 x;
We have declared variable ‘x’ as a 128 bit vector variable. We haven’t told the compiler that we’re going to be using it as 4 floating point numbers because it doesn’t matter at this point. Before we continue, here is a brief look at what we’re trying to accomplish:
Set X to <1,2,3,4>
For (int i=0; I < total/4; i++)
{
Y[i] = sqrt(x)/x; // Y[i] is also a vector of 4 floats
X = X+4; // add 4 to each element
}
To modify the values of the SSE variables, we’ll need to use special intrinsic functions which can be easily looked up on the MSDN library. To initialize X to contain the values <1,2,3,4> we can use the _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f). We’ll get to why these are in the reverse order later.
Performing computation with SSE vectors
The computation we will be performing is sqrt(x)/x. The best way to perform the square root is to use an intrinsic function. Using special hardware, four square roots will be calculated simultaneously, which will provide a great speedup.
__m128 xSqrt = _mm_sqrt_ps(x);
Now that we have a variable holding our square roots, we need to divide by X. There are two says to do this. The first way is to simply use a divide intrinsic. This will provide the most accurate result, so I usually recommend you take this approach.
pResultSSE[i] = _mm_div_ps(xSqrt, x);
As it turns out, division is relatively slow. A faster approach would be to take the reciprocal of X using the _mm_rcp_ps intrinsic, then multiply the reciprocal of X to the square root using the _mm_mul_ps intrinsic.
__m128 xRecip = _mm_rcp_ps(x);
pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);
How regular arrays interact with SSE vectors
Ssscheduler
Previously, we discussed how the result array had to be aligned to the 16-byte boundary. When we declared pResult, we declared it as a floating point array. In order to store SSE results into this array, we will need to use an intermediate pointer.
__m128 *pResultSSE = (__m128*) pResult;
By using a pointer of type __m128 take on the same address as the floating point array, we effectively created a way for the SSE results to be stored directly into our floating point array.
BEWARE! Remember earlier, when we decided to initialize the X vector with <4,3,2,1> which is backwards from what we might expect? Well, as it turns out when using this approach, to store SSE results into a regular floating point array, the 4 floating point numbers will actually get reversed. For example, if the result of _mm_mul_ps is <2,4,6,8>, the value of pResult[0] will be 8, pResult[1] will be 6, and so on. So be very careful when storing results using pointers.
Incrementing each value in a vector by 4
SSE numbers typically do not play well with regular numbers. Any interaction between an SSE vector and a normal 32-bit floating point number can be an expensive performance hit. For this reason, we’re going to create another __m128 variable, and set the value to <4,4,4,4>. That way, inside the for loop, we will simply add the two __m128 vectors together in order to increment them.
x = _mm_add_ps(x, xDelta); // Advance x to the next set of numbers
Results
Below are results which were obtained by testing this program using the reciprocal-multiply method, the division method, and the division method without SSE instructions. As you can see, the difference in run time is dramatic. Please keep in mind that the division has more accurate results than the reciprocal-multiply method.
Compatibility
Virtually all x86 processors made within the past 10 years fully support SSE. Still, it is worth noting that if you are running your program on a processor which does not support SSE, it will crash. There are ways to check whether or not a computer supports SSE during program runtime. However, because it is exceptionally rare to find an x86 processor without SSE support, I won’t go into further details on how to do this.
Conclusion
That just about wraps it up for this tutorial. There are a few key points you should keep in mind when deciding to use the SSE instruction set. Often, algorithmic changes to a program can make the most impact. You should only start writing code for SSE if and only if improving the main program algorithm is not an option. Writing code that uses SSE is more challenging, but the possible rewards in speedup to your program are potentially great. In this case, the benefit of using SSE would be greater than the benefit of using two or even four processors without SSE!
For a slightly more practical example of how to use SSE, feel free to check out the intro to image processing with SSE article.
Schedule Tutorial
Full Source Code Listing
Scheduling Tutorials
Download SSE source code here.