background image

14-22 Vol. 1

PROGRAMMING WITH AVX, FMA AND AVX2

14.5.1 FMA 

Instruction 

Operand 

Order and Arithmetic Behavior

FMA instruction mnemonics are defined explicitly with an ordered three digits, e.g. VFMADD132PD. The value of 
each digit refers to the ordering of the three source operand as defined by instruction encoding specification:

‘1’: The first source operand (also the destination operand) in the syntactical order listed in this specification.

‘2’: The second source operand in the syntactical order. This is a YMM/XMM register, encoded using VEX prefix.

‘3’: The third source operand in the syntactical order. The first and third operand are encoded following ModR/M 
encoding rules. 

The ordering of each digit within the mnemonic refers to the floating-point data listed on the right-hand side of the 
arithmetic equation of each FMA operation (see Table 14-17):

The first position in the three digits of a FMA mnemonic refers to the operand position of the first FP data 
expressed in the arithmetic equation of FMA operation, the multiplicand.

The second position in the three digits of a FMA mnemonic refers to the operand position of the second FP data 
expressed in the arithmetic equation of FMA operation, the multiplier.

The third position in the three digits of a FMA mnemonic refers to the operand position of the FP data being 
added/subtracted to the multiplication result. 

Note the non-numerical result of an FMA operation does not resemble the mathematically-defined commutative 
property between the multiplicand and the multiplier values (see Table 14-17). Consequently, software tools (such 
as an assembler) may support a complementary set of FMA mnemonics for each FMA instruction for ease of 
programming to take advantage of the mathematical property of commutative multiplications. For example, an 
assembler may optionally support the complementary mnemonic “VFMADD312PD” in addition to the true 
mnemonic “VFMADD132PD“. The assembler will generate the same instruction opcode sequence corresponding to 
VFMADD132PD. The processor executes VFMADD132PD and report any NAN conditions based on the definition of 
VFMADD132PD. Similarly, if the complementary mnemonic VFMADD123PD is supported by an assembler at source 
level, it must generate the opcode sequence corresponding to VFMADD213PD; the complementary mnemonic 
VFMADD321PD must produce the opcode sequence defined by VFMADD231PD. In the absence of FMA operations 
reporting a NAN result, the numerical results of using either mnemonic with an assembler supporting both 
mnemonics will match the behavior defined in Table 14-17. Support for the complementary FMA mnemonics by 
software tools is optional. 

14.5.2 

Fused-Multiply-ADD (FMA) Numeric Behavior

FMA instructions can perform fused-multiply-add operations (including fused-multiply-subtract, and other vari-
eties) on packed and scalar data elements in the instruction operands. Separate FMA instructions are provided to 
handle different types of arithmetic operations on the three source operands.
FMA instruction syntax is defined using three source operands and the first source operand is updated based on the 
result of the arithmetic operations of the data elements of 128-bit or 256-bit operands, i.e. The first source operand 
is also the destination operand.
The arithmetic FMA operation performed in an FMA instruction takes one of several forms, r=(x*y)+z, r=(x*y)-z, 
r=-(x*y)+z, or r=-(x*y)-z. Packed FMA instructions can perform eight single-precision FMA operations or four 
double-precision FMA operations with 256-bit vectors. 
Scalar FMA instructions only perform one arithmetic operation on the low order data element. The content of the 
rest of the data elements in the lower 128-bits of the destination operand is preserved. the upper 128bits of the 
destination operand are filled with zero. 

VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD

xmm0, xmm1, xmm2/m64

Fused Negative Multiply-Subtract of Scalar Double-Precision 

Floating-Point Values

VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS

xmm0, xmm1, xmm2/m32

Fused Negative Multiply-Subtract of Scalar Single-Precision 

Floating-Point Values

Table 14-15.  FMA Instructions

Instruction

Description