c++ - Which opcodes are faster at the CPU level?

In every programming language there are sets of opcodes that are recommended over others. I've tried to list them here, in order of speed.

Bitwise

Integer Addition / Subtraction

Integer Multiplication / Division

Comparison

Control flow

Float Addition / Subtraction

Float Multiplication / Division

Where you need high-performance code, C++ can be hand optimized in assembly, to use SIMD instructions or more efficient control flow, data types, etc. So I'm trying to understand if the data type (int32 / float32 / float64) or the operation used (*, +, &) affects performance at the CPU level.

Is a single multiply slower on the CPU than an addition?

In MCU theory you learn that speed of opcodes is determined by the number of CPU cycles it takes to execute. So does it mean that multiply takes 4 cycles and add takes 2?

Exactly what are the speed characteristics of the basic math and control flow opcodes?

If two opcodes take the same number of cycles to execute, then both can be used interchangeably without any performance gain / loss?

Any other technical details you can share regarding x86 CPU performance is appreciated

Answer

Agner Fog's optimization guides are excellent. He has guides, tables of instruction timings, and docs on the microarchitecture of all recent x86 CPU designs (going back as far as Intel Pentium). See also some other resources linked from https://stackoverflow.com/tags/x86/info

Just for fun, I'll answer some of the questions (numbers from recent Intel CPUs). Choice of ops is not the major factor in optimizing code (unless you can avoid division.)

Is a single multiply slower on the CPU than an addition?

Yes (unless it's by a power of 2). (3-4x the latency, with only one per clock throughput on Intel.) Don't go far out of your way to avoid it, though, since it's as fast as 2 or 3 adds.

Exactly what are the speed characteristics of the basic math and control flow opcodes?

See Agner Fog's instruction tables and microarchitecture guide if you want to know exactly :P. Be careful with conditional jumps. Unconditional jumps (like function calls) have some small overhead, but not much.

If two opcodes take the same number of cycles to execute, then both can be used interchangeably without any performance gain / loss?

Nope, they might compete for the same execution port as something else, or they might not. It depends on what other dependency chains the CPU can be working on in parallel. (In practice, there's not usually any useful decision to be made. It occasionally comes up that you could use a vector shift or a vector shuffle, which run on different ports on Intel CPUs. But shift-by-bytes of the whole register (PSLLDQ etc.) runs in the shuffle unit.)

Any other technical details you can share regarding x86 CPU performance is appreciated

Agner Fog's microarch docs describe the pipelines of Intel and AMD CPUs in enough detail to work out exactly how many cycles a loop should take per iteration, and whether the bottleneck is uop throughput, a dependency chain, or contention for one execution port. See some of my answers on StackOverflow, like this one or this one.

Also, http://www.realworldtech.com/haswell-cpu/ (and similar for earlier designs) is fun reading if you like CPU design.

Here's your list, sorted for a Haswell CPU, based on my best guestimates. This isn't really a useful way of thinking about things for anything but tuning an asm loop, though. Cache / branch-prediction effects usually dominate, so write your code to have good patterns. Numbers are very hand-wavey, and try to account for high latency, even if throughput is not an issue, or for generating more uops which clog up the pipe for other things to happen in parallel. Esp. the cache / branch numbers are very made-up. Latency matters for loop-carried dependencies, throughput matters when each iteration is independent.

TL:DR these numbers are made-up based on what I'm picturing for a "typical" use-case, as far as tradeoffs between latency, execution-port bottlenecks, and front-end throughput (or stalls for things like branch misses). Please don't use these numbers for any kind of serious perf analysis.

0.5 to 1 Bitwise / Integer Addition / Subtraction /

shift and rotate (compile-time const count) /
vector versions of all of these (1 to 4 per cycle throughput, 1 cycle latency)

1 vector min, max, compare-equal, compare-greater (to create a mask)

1.5 vector shuffles. Haswell and newer only have one shuffle port, and it seems to me it's common to need a lot of shuffling if you need any, so I'm weighting it slightly higher to encourage thinking about using fewer shuffles. They're not free, esp. if you need a pshufb control mask from memory.

1.5 load / store (L1 cache hit. throughput better than latency)

1.75 Integer Multiplication (3c latency/one per 1c tput on Intel, 4c lat on AMD and only one per 2c tput). Small constants are even cheaper using LEA and/or ADD/SUB/shift. But of course compile-time constants are always good, and can often optimize into other things. (And multiply in a loop can often be strength-reduced by the compiler to tmp += 7 in a loop instead of tmp = i*7)

1.75 some 256b vector shuffle (extra latency on insns that can move data between 128b lanes of an AVX vector). (Or 3 to 7 on Ryzen where lane crossing shuffles need many more uops)

2 fp add/sub (and vector versions of the same) (1 or 2 per cycle throughtput, 3 to 5 cycle latency). Can be slow if you bottleneck on latency, e.g. summing an array with only 1 sum variable. (I could weight this and fp mul as low as 1 or as high as 5 depending on use-case).

2 vector fp mul or FMA. (x*y + z is as cheap as either a mul or an add if you compile with FMA support enabled).

2 inserting/extracting general-purpose registers into vector elements (_mm_insert_epi8, etc.)

2.25 vector int mul (16-bit elements or pmaddubsw doing 8*8 -> 16-bit). Cheaper on Skylake, with better throughput than scalar mul

2.25 shift / rotate by variable count (2c latency, one per 2c throughput on Intel, faster on AMD or with BMI2)

2.5 Comparison without branching (y = x ? a : b, or y = x >= 0) (test / setcc or cmov)

3 int-> float conversion

3 perfectly predicted Control flow (predicted branch, call, return).

4 vector int mul (32-bit elements) (2 uops, 10c latency on Haswell)

4 integer division or % by a compile-time constant (non-power of 2).

7 vector horizontal ops (e.g. PHADD adding values within a vector)

11 (vector)FP Division (10-13c latency, one per 7c throughput or worse). (Can be cheap if used rarely, but throughput is 6 to 40x worse than FP mul)

13? Control Flow (poorly-predicted branch, maybe 75% predictable)

13 int division (yes really, it's slower than FP division, and can't vectorize). (note that compilers divide by a constant using mul/shift/add with a magic constant, and div/mod by powers of 2 is very cheap.)

16 (vector)FP sqrt

25? load (L3 cache hit). (cache-miss stores are cheaper than loads.)

50? FP trig / exp / log. If you need a lot of exp/log and don't need full accuracy, you can trade accuracy for speed with a shorter polynomial and/or a table. You can also SIMD vectorize.

50-80? always-mispredicted branch, costing 15-20 cycles

200-400? load/store (cache miss)

3000??? read page from file (OS disk cache hit) (making up numbers here)

20000??? disk read page (OS disk-cache miss, fast SSD) (totally made-up number)

I totally made this up based on guesswork. If something looks wrong, it's either because I was thinking of a different use-case, or an editing error.

The relative cost of things on AMD CPUs will be similar, except they have faster integer shifters when the shift-count is variable. AMD Bulldozer-family CPUs are of course slower on most code, for a variety of reasons. (Ryzen is pretty good at a lot of stuff).

Keep in mind that it's really impossible to boil things down to a one-dimensional cost. Other than cache-misses and branch mispredicts, the bottleneck in a block of code can be latency, total uop throughput (frontend), or throughput of a specific port (execution port).

A "slow" operation like FP division can be very cheap if the surrounding code keeps the CPU busy with other work. (vector FP div or sqrt are 1 uop each, they just have bad latency and throughput. They only block the divide unit, not the whole execution port that it's on. Integer div is several uops.) So if you only have one FP divide for every ~20 mul and add, and there's other work for the CPU to do (e.g. an independent loop iteration), then the "cost" of the FP div could be about the same as an FP mul. This is probably the best example of something that's low throughput when it's all you're doing, but mixes very well with other code (when latency isn't a factor), because of low total uops.

Note that integer division is not nearly as friendly to surrounding code: On Haswell, it's 9 uops, with one per 8-11c throughput, and 22-29c latency. (64bit division is much slower, even on Skylake.) So the latency and throughput numbers are somewhat similar to FP div, but FP div is only one uop.

For examples of analysing a short sequence of insns for throughput, latency, and total uops, see some of my SO answers:

IDK if other people write SO answers including this kind of analysis. I have a much easier time finding my own, because I know I go into this detail often, and I can remember what I've written.

Blog

Monday, March 23, 2015

c++ - Which opcodes are faster at the CPU level?

No comments:

Post a Comment

Simple past, Present perfect Past perfect