User ManualTable of ContentsContents3Tables11Figures13Revision History15Chapter 1 Introduction171.1 Intended Audience171.2 Getting Started Quickly171.3 Using This Guide181.4 Important New Terms201.5 Key Optimizations22Chapter 2 C and C++ Source-Level Optimizations232.1 Declarations of Floating-Point Values252.2 Using Arrays and Pointers262.3 Unrolling Small Loops292.4 Expression Order in Compound Branch Conditions302.5 Long Logical Expressions in If Statements322.6 Arrange Boolean Operands for Quick Expression Evaluation332.7 Dynamic Memory Allocation Consideration352.8 Unnecessary Store-to-Load Dependencies362.9 Matching Store and Load Size382.10 SWITCH and Noncontiguous Case Expressions412.11 Arranging Cases by Probability of Occurrence442.12 Use of Function Prototypes452.13 Use of const Type Qualifier462.14 Generic Loop Hoisting472.15 Local Static Functions502.16 Explicit Parallelism in Code512.17 Extracting Common Subexpressions532.18 Sorting and Padding C and C++ Structures552.19 Sorting Local Variables572.20 Replacing Integer Division with Multiplication592.21 Frequently Dereferenced Pointer Arguments602.22 Array Indices622.23 32-Bit Integral Data Types632.24 Sign of Integer Operands642.25 Accelerating Floating-Point Division and Square Root662.26 Fast Floating-Point-to-Integer Conversion682.27 Speeding Up Branches Based on Comparisons Between Floats702.28 Improving Performance in Linux Libraries73Chapter 3 General 64-Bit Optimizations753.1 64-Bit Registers and Integer Arithmetic763.2 64-Bit Arithmetic and Large-Integer Multiplication783.3 128-Bit Media Instructions and Floating-Point Operations833.4 32-Bit Legacy GPRs and Small Unsigned Integers84Chapter 4 Instruction-Decoding Optimizations874.1 DirectPath Instructions884.2 Load-Execute Instructions894.3 Branch Targets in Program Hot Spots924.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction934.5 Take Advantage of x86 and AMD64 Complex Addressing Modes944.6 Short Instruction Encodings964.7 Partial-Register Reads and Writes974.8 Using LEAVE for Function Epilogues994.9 Alternatives to SHLD Instruction1014.10 8-Bit Sign-Extended Immediate Values1034.11 8-Bit Sign-Extended Displacements1044.12 Code Padding with Operand-Size Override and NOP105Chapter 5 Cache and Memory Optimizations1075.1 Memory-Size Mismatches1085.2 Natural Alignment of Data Objects1115.3 Cache-Coherent Nonuniform Memory Access (ccNUMA)1125.4 Multiprocessor Considerations1155.5 Store-to-Load Forwarding Restrictions1165.6 Prefetch Instructions1205.7 Streaming-Store/Non-Temporal Instructions1285.8 Write-combining1295.9 L1 Data Cache Bank Conflicts1305.10 Placing Code and Data in the Same 64-Byte Cache Line1325.11 Sorting and Padding C and C++ Structures1335.12 Sorting Local Variables1355.13 Memory Copy1365.14 Stack Considerations1385.15 Cache Issues when Writing Instruction Bytes to Memory1395.16 Interleave Loads and Stores140Chapter 6 Branch Optimizations1416.1 Density of Branches1426.2 Two-Byte Near-Return RET Instruction1446.3 Branches That Depend on Random Data1466.4 Pairing CALL and RETURN1486.5 Recursive Functions1496.6 Nonzero Code-Segment Base Values1516.7 Replacing Branches with Computation1526.8 The LOOP Instruction1576.9 Far Control-Transfer Instructions158Chapter 7 Scheduling Optimizations1597.1 Instruction Scheduling by Latency1607.2 Loop Unrolling1617.3 Inline Functions1657.4 Address-Generation Interlocks1677.5 MOVZX and MOVSX1697.6 Pointer Arithmetic in Loops1707.7 Pushing Memory Data Directly onto the Stack173Chapter 8 Integer Optimizations1758.1 Replacing Division with Multiplication1768.2 Alternative Code for Multiplying by a Constant1808.3 Repeated String Instructions1838.4 Using XOR to Clear Integer Registers1858.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode1868.6 Efficient Implementation of Population-Count Function in 32-Bit Mode1958.7 Efficient Binary-to-ASCII Decimal Conversion1978.8 Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants2028.9 Optimizing Integer Division208Chapter 9 Optimizing with SIMD Instructions2099.1 Ensure All Packed Floating-Point Data are Aligned2119.2 Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS When Loading Data from Memory2129.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access2149.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS2159.5 Structuring Code with Prefetch Instructions to Hide Memory Latency2169.6 Avoid Moving Data Directly Between General-Purpose and MMX™ Registers2229.7 Use MMX™ Instructions to Construct Fast Block- Copy Routines in 32-Bit Mode2239.8 Passing Data between MMX™ and 3DNow!™ Instructions2249.9 Storing Floating-Point Data in MMX™ Registers2259.10 EMMS and FEMMS Usage2269.11 Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots2279.12 Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions2319.13 Clearing MMX™ and XMM Registers with XOR Instructions2329.14 Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!™ Instructions2339.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™ Instructions2349.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions2379.17 Optimized 4 X 4 Matrix Multiplication on 4 X 1 Column Vector Routines246Chapter 10 x87 Floating-Point Optimizations25310.1 Using Multiplication Rather Than Division25410.2 Achieving Two Floating-Point Operations per Clock Cycle25510.3 Floating-Point Compare Instructions26010.4 Using the FXCH Instruction Rather Than FST/FLD Pairs26110.5 Floating-Point Subexpression Elimination26210.6 Accumulating Precision-Sensitive Quantities in x87 Registers26310.7 Avoiding Extended-Precision Data264Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors265A.1 Key Microarchitecture Features266A.2 Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors267A.3 Superscalar Processor267A.4 Processor Block Diagram267A.5 L1 Instruction Cache268A.6 Branch-Prediction Table269A.7 Fetch-Decode Unit270A.8 Instruction Control Unit270A.9 Translation-Lookaside Buffer270A.10 L1 Data Cache271A.11 Integer Scheduler272A.12 Integer Execution Unit272A.13 Floating-Point Scheduler273A.14 Floating-Point Execution Unit274A.15 Load-Store Unit274A.16 L2 Cache275A.17 Write-combining276A.18 Buses for AMD Athlon™ 64 and AMD Opteron™ Processor276A.19 Integrated Memory Controller276A.20 HyperTransport™ Technology Interface276Appendix B Implementation of Write-Combining279B.1 Write-Combining Definitions and Abbreviations279B.2 Programming Details280B.3 Write-combining Operations280B.4 Sending Write-Buffer Data to the System282B.5 Write-Combining Optimization on Revision D and E AMD Athlon™ 64 and AMD Opteron™ Processors282Appendix C Instruction Latencies285C.1 Understanding Instruction Entries286C.2 Integer Instructions289C.3 MMX™ Technology Instructions319C.4 x87 Floating-Point Instructions323C.5 3DNow!™ Technology Instructions330C.6 3DNow!™ Technology Extensions332C.7 SSE Instructions333C.8 SSE2 Instructions342C.9 SSE3 Instructions358Appendix D AGP Considerations361D.1 Fast-Write Optimizations361D.2 Fast-Write Optimizations for Graphics-Engine Programming362D.3 Fast-Write Optimizations for Video-Memory Copies365D.4 Memory Optimizations367D.5 Memory Optimizations for Graphics-Engine Programming Using the DMA Model368D.6 Optimizations for Texture-Map Copies to AGP Memory369D.7 Optimizations for Vertex-Geometry Copies to AGP Memory369Appendix E SSE and SSE2 Optimizations371E.1 Half-Register Operations372E.2 Zeroing Out an XMM Register373E.3 Reuse of Dead Registers375E.4 Moving Data Between XMM Registers and GPRs376E.5 Saving and Restoring Registers of Unknown Format377E.6 SSE and SSE2 Copy Loops378E.7 Explicit Load Instructions379E.8 Data Conversion380Index383Size: 3.31 MBPages: 384Language: EnglishOpen manual