Page 338

14-10 Vol. 1

PROGRAMMING WITH AVX, FMA AND AVX2

AVX introduces 18 new data processing instructions that operate on 256-bit vectors, Table 14-4. These new primi-
tives cover the following operations:

•

Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching
primitives:
— broadcast of single or multiple data elements into a 256-bit destination,
— masked move primitives to load or store SIMD data elements conditionally,

•

Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data
manipulation primitives:
— insert/extract multiple SIMD floating-point data elements to/from 256-bit SIMD registers
— permute primitives to facilitate efficient manipulation of floating-point data elements in 256-bit SIMD

registers

•

Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
— new variable blend instructions supports four-operand syntax with non-destructive source syntax. This is

more flexible than the equivalent SSE4 instruction syntax which uses the XMM0 register as the implied
mask for blend selection.

— Packed TEST instructions for floating-point data.

yes

UNPCKHPD, UNPCKHPS, UNPCKLPD

yes

BLENDPS, BLENDPD

yes

SHUFPD, SHUFPS, UNPCKLPS

yes

BLENDVPS, BLENDVPD

yes

PTEST, MOVMSKPD, MOVMSKPS

yes

XORPS, XORPD, ORPS, ORPD

yes

ANDNPD, ANDNPS, ANDPD, ANDPS

Table 14-4. 256-bit AVX Instruction Enhancement

Instruction

Description

VBROADCASTF128 ymm1, m128

Broadcast 128-bit floating-point values in mem to low and high 128-bits in ymm1.

VBROADCASTSD ymm1, m64

Broadcast double-precision floating-point element in mem to four locations in ymm1.

VBROADCASTSS ymm1, m32

Broadcast single-precision floating-point element in mem to eight locations in ymm1.

VEXTRACTF128 xmm1/m128, ymm2,

imm8

Extracts 128-bits of packed floating-point values from ymm2 and store results in

xmm1/mem.

VINSERTF128 ymm1, ymm2,

xmm3/m128, imm8

Insert 128-bits of packed floating-point values from xmm3/mem and the remaining val-

ues from ymm2 into ymm1

VMASKMOVPS ymm1, ymm2, m256

Load packed single-precision values from mem using mask in ymm2 and store in ymm1

VMASKMOVPD ymm1, ymm2, m256

Load packed double-precision values from mem using mask in ymm2 and store in ymm1

VMASKMOVPS m256, ymm1, ymm2

Store packed single-precision values from ymm2 mask in ymm1

VMASKMOVPD m256, ymm1, ymm2

Store packed double-precision values from ymm2 using mask in ymm1

VPERMILPD ymm1, ymm2, ymm3/m256

Permute Double-Precision Floating-Point values in ymm2 using controls from xmm3/mem

and store result in ymm1

Table 14-3. Promoted 256-bit and 128-bit Data Movement AVX Instructions

VEX.256 Encoding

VEX.128 Encoding

Legacy Instruction Mnemonic