14-10 Vol. 1
PROGRAMMING WITH AVX, FMA AND AVX2
AVX introduces 18 new data processing instructions that operate on 256-bit vectors, Table 14-4. These new primi-
tives cover the following operations:
•
Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching
primitives:
— broadcast of single or multiple data elements into a 256-bit destination,
— masked move primitives to load or store SIMD data elements conditionally,
•
Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data
manipulation primitives:
— insert/extract multiple SIMD floating-point data elements to/from 256-bit SIMD registers
— permute primitives to facilitate efficient manipulation of floating-point data elements in 256-bit SIMD
registers
•
Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
— new variable blend instructions supports four-operand syntax with non-destructive source syntax. This is
more flexible than the equivalent SSE4 instruction syntax which uses the XMM0 register as the implied
mask for blend selection.
— Packed TEST instructions for floating-point data.
yes
yes
UNPCKHPD, UNPCKHPS, UNPCKLPD
yes
yes
BLENDPS, BLENDPD
yes
yes
SHUFPD, SHUFPS, UNPCKLPS
yes
yes
BLENDVPS, BLENDVPD
yes
yes
PTEST, MOVMSKPD, MOVMSKPS
yes
yes
XORPS, XORPD, ORPS, ORPD
yes
yes
ANDNPD, ANDNPS, ANDPD, ANDPS
Table 14-4. 256-bit AVX Instruction Enhancement
Instruction
Description
VBROADCASTF128 ymm1, m128
Broadcast 128-bit floating-point values in mem to low and high 128-bits in ymm1.
VBROADCASTSD ymm1, m64
Broadcast double-precision floating-point element in mem to four locations in ymm1.
VBROADCASTSS ymm1, m32
Broadcast single-precision floating-point element in mem to eight locations in ymm1.
VEXTRACTF128 xmm1/m128, ymm2,
imm8
Extracts 128-bits of packed floating-point values from ymm2 and store results in
xmm1/mem.
VINSERTF128 ymm1, ymm2,
xmm3/m128, imm8
Insert 128-bits of packed floating-point values from xmm3/mem and the remaining val-
ues from ymm2 into ymm1
VMASKMOVPS ymm1, ymm2, m256
Load packed single-precision values from mem using mask in ymm2 and store in ymm1
VMASKMOVPD ymm1, ymm2, m256
Load packed double-precision values from mem using mask in ymm2 and store in ymm1
VMASKMOVPS m256, ymm1, ymm2
Store packed single-precision values from ymm2 mask in ymm1
VMASKMOVPD m256, ymm1, ymm2
Store packed double-precision values from ymm2 using mask in ymm1
VPERMILPD ymm1, ymm2, ymm3/m256
Permute Double-Precision Floating-Point values in ymm2 using controls from xmm3/mem
and store result in ymm1
Table 14-3. Promoted 256-bit and 128-bit Data Movement AVX Instructions
VEX.256 Encoding
VEX.128 Encoding
Legacy Instruction Mnemonic