AMD 250 User Manuals

AMD

User Manual (English)

User Manual

Table of Contents

Contents
3
Tables
11
Figures
13
Revision History
15
Chapter 1 Introduction
17
- 1.1 Intended Audience
  17
- 1.2 Getting Started Quickly
  17
- 1.3 Using This Guide
  18
- 1.4 Important New Terms
  20
- 1.5 Key Optimizations
  22
Chapter 2 C and C++ Source-Level Optimizations
23
- 2.1 Declarations of Floating-Point Values
  25
- 2.2 Using Arrays and Pointers
  26
- 2.3 Unrolling Small Loops
  29
- 2.4 Expression Order in Compound Branch Conditions
  30
- 2.5 Long Logical Expressions in If Statements
  32
- 2.6 Arrange Boolean Operands for Quick Expression Evaluation
  33
- 2.7 Dynamic Memory Allocation Consideration
  35
- 2.8 Unnecessary Store-to-Load Dependencies
  36
- 2.9 Matching Store and Load Size
  38
- 2.10 SWITCH and Noncontiguous Case Expressions
  41
- 2.11 Arranging Cases by Probability of Occurrence
  44
- 2.12 Use of Function Prototypes
  45
- 2.13 Use of const Type Qualifier
  46
- 2.14 Generic Loop Hoisting
  47
- 2.15 Local Static Functions
  50
- 2.16 Explicit Parallelism in Code
  51
- 2.17 Extracting Common Subexpressions
  53
- 2.18 Sorting and Padding C and C++ Structures
  55
- 2.19 Sorting Local Variables
  57
- 2.20 Replacing Integer Division with Multiplication
  59
- 2.21 Frequently Dereferenced Pointer Arguments
  60
- 2.22 Array Indices
  62
- 2.23 32-Bit Integral Data Types
  63
- 2.24 Sign of Integer Operands
  64
- 2.25 Accelerating Floating-Point Division and Square Root
  66
- 2.26 Fast Floating-Point-to-Integer Conversion
  68
- 2.27 Speeding Up Branches Based on Comparisons Between Floats
  70
- 2.28 Improving Performance in Linux Libraries
  73
Chapter 3 General 64-Bit Optimizations
75
- 3.1 64-Bit Registers and Integer Arithmetic
  76
- 3.2 64-Bit Arithmetic and Large-Integer Multiplication
  78
- 3.3 128-Bit Media Instructions and Floating-Point Operations
  83
- 3.4 32-Bit Legacy GPRs and Small Unsigned Integers
  84
Chapter 4 Instruction-Decoding Optimizations
87
- 4.1 DirectPath Instructions
  88
- 4.2 Load-Execute Instructions
  89
- 4.3 Branch Targets in Program Hot Spots
  92
- 4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction
  93
- 4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes
  94
- 4.6 Short Instruction Encodings
  96
- 4.7 Partial-Register Reads and Writes
  97
- 4.8 Using LEAVE for Function Epilogues
  99
- 4.9 Alternatives to SHLD Instruction
  101
- 4.10 8-Bit Sign-Extended Immediate Values
  103
- 4.11 8-Bit Sign-Extended Displacements
  104
- 4.12 Code Padding with Operand-Size Override and NOP
  105
Chapter 5 Cache and Memory Optimizations
107
- 5.1 Memory-Size Mismatches
  108
- 5.2 Natural Alignment of Data Objects
  111
- 5.3 Cache-Coherent Nonuniform Memory Access (ccNUMA)
  112
- 5.4 Multiprocessor Considerations
  115
- 5.5 Store-to-Load Forwarding Restrictions
  116
- 5.6 Prefetch Instructions
  120
- 5.7 Streaming-Store/Non-Temporal Instructions
  128
- 5.8 Write-combining
  129
- 5.9 L1 Data Cache Bank Conflicts
  130
- 5.10 Placing Code and Data in the Same 64-Byte Cache Line
  132
- 5.11 Sorting and Padding C and C++ Structures
  133
- 5.12 Sorting Local Variables
  135
- 5.13 Memory Copy
  136
- 5.14 Stack Considerations
  138
- 5.15 Cache Issues when Writing Instruction Bytes to Memory
  139
- 5.16 Interleave Loads and Stores
  140
Chapter 6 Branch Optimizations
141
- 6.1 Density of Branches
  142
- 6.2 Two-Byte Near-Return RET Instruction
  144
- 6.3 Branches That Depend on Random Data
  146
- 6.4 Pairing CALL and RETURN
  148
- 6.5 Recursive Functions
  149
- 6.6 Nonzero Code-Segment Base Values
  151
- 6.7 Replacing Branches with Computation
  152
- 6.8 The LOOP Instruction
  157
- 6.9 Far Control-Transfer Instructions
  158
Chapter 7 Scheduling Optimizations
159
- 7.1 Instruction Scheduling by Latency
  160
- 7.2 Loop Unrolling
  161
- 7.3 Inline Functions
  165
- 7.4 Address-Generation Interlocks
  167
- 7.5 MOVZX and MOVSX
  169
- 7.6 Pointer Arithmetic in Loops
  170
- 7.7 Pushing Memory Data Directly onto the Stack
  173
Chapter 8 Integer Optimizations
175
- 8.1 Replacing Division with Multiplication
  176
- 8.2 Alternative Code for Multiplying by a Constant
  180
- 8.3 Repeated String Instructions
  183
- 8.4 Using XOR to Clear Integer Registers
  185
- 8.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
  186
- 8.6 Efficient Implementation of Population-Count Function in 32-Bit Mode
  195
- 8.7 Efficient Binary-to-ASCII Decimal Conversion
  197
- 8.8 Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants
  202
- 8.9 Optimizing Integer Division
  208
Chapter 9 Optimizing with SIMD Instructions
209
- 9.1 Ensure All Packed Floating-Point Data are Aligned
  211
- 9.2 Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS When Loading Data from Memory
  212
- 9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
  214
- 9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS
  215
- 9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency
  216
- 9.6 Avoid Moving Data Directly Between General-Purpose and MMX™ Registers
  222
- 9.7 Use MMX™ Instructions to Construct Fast Block- Copy Routines in 32-Bit Mode
  223
- 9.8 Passing Data between MMX™ and 3DNow!™ Instructions
  224
- 9.9 Storing Floating-Point Data in MMX™ Registers
  225
- 9.10 EMMS and FEMMS Usage
  226
- 9.11 Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots
  227
- 9.12 Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions
  231
- 9.13 Clearing MMX™ and XMM Registers with XOR Instructions
  232
- 9.14 Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!™ Instructions
  233
- 9.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™ Instructions
  234
- 9.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions
  237
- 9.17 Optimized 4 X 4 Matrix Multiplication on 4 X 1 Column Vector Routines
  246
Chapter 10 x87 Floating-Point Optimizations
253
- 10.1 Using Multiplication Rather Than Division
  254
- 10.2 Achieving Two Floating-Point Operations per Clock Cycle
  255
- 10.3 Floating-Point Compare Instructions
  260
- 10.4 Using the FXCH Instruction Rather Than FST/FLD Pairs
  261
- 10.5 Floating-Point Subexpression Elimination
  262
- 10.6 Accumulating Precision-Sensitive Quantities in x87 Registers
  263
- 10.7 Avoiding Extended-Precision Data
  264
Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
265
- A.1 Key Microarchitecture Features
  266
- A.2 Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
  267
- A.3 Superscalar Processor
  267
- A.4 Processor Block Diagram
  267
- A.5 L1 Instruction Cache
  268
- A.6 Branch-Prediction Table
  269
- A.7 Fetch-Decode Unit
  270
- A.8 Instruction Control Unit
  270
- A.9 Translation-Lookaside Buffer
  270
- A.10 L1 Data Cache
  271
- A.11 Integer Scheduler
  272
- A.12 Integer Execution Unit
  272
- A.13 Floating-Point Scheduler
  273
- A.14 Floating-Point Execution Unit
  274
- A.15 Load-Store Unit
  274
- A.16 L2 Cache
  275
- A.17 Write-combining
  276
- A.18 Buses for AMD Athlon™ 64 and AMD Opteron™ Processor
  276
- A.19 Integrated Memory Controller
  276
- A.20 HyperTransport™ Technology Interface
  276
Appendix B Implementation of Write-Combining
279
- B.1 Write-Combining Definitions and Abbreviations
  279
- B.2 Programming Details
  280
- B.3 Write-combining Operations
  280
- B.4 Sending Write-Buffer Data to the System
  282
- B.5 Write-Combining Optimization on Revision D and E AMD Athlon™ 64 and AMD Opteron™ Processors
  282
Appendix C Instruction Latencies
285
- C.1 Understanding Instruction Entries
  286
- C.2 Integer Instructions
  289
- C.3 MMX™ Technology Instructions
  319
- C.4 x87 Floating-Point Instructions
  323
- C.5 3DNow!™ Technology Instructions
  330
- C.6 3DNow!™ Technology Extensions
  332
- C.7 SSE Instructions
  333
- C.8 SSE2 Instructions
  342
- C.9 SSE3 Instructions
  358
Appendix D AGP Considerations
361
- D.1 Fast-Write Optimizations
  361
- D.2 Fast-Write Optimizations for Graphics-Engine Programming
  362
- D.3 Fast-Write Optimizations for Video-Memory Copies
  365
- D.4 Memory Optimizations
  367
- D.5 Memory Optimizations for Graphics-Engine Programming Using the DMA Model
  368
- D.6 Optimizations for Texture-Map Copies to AGP Memory
  369
- D.7 Optimizations for Vertex-Geometry Copies to AGP Memory
  369
Appendix E SSE and SSE2 Optimizations
371
- E.1 Half-Register Operations
  372
- E.2 Zeroing Out an XMM Register
  373
- E.3 Reuse of Dead Registers
  375
- E.4 Moving Data Between XMM Registers and GPRs
  376
- E.5 Saving and Restoring Registers of Unknown Format
  377
- E.6 SSE and SSE2 Copy Loops
  378
- E.7 Explicit Load Instructions
  379
- E.8 Data Conversion
  380
Index
383

Size:

3.31 MB
Pages:

384
Language:

English