第8章 BasicMathFunctions的使用(一)
本期教程开始学习ARM官方的DSP库,这里我们先从基本数学函数开始。本期教程主要讲绝对值,加法,点乘和乘法四种运算。
8.1 绝对值(VectorAbsolute Value)
8.2 求和(VectorAddition)
8.3 点乘(VectorDot Product)
8.4 乘法(VectorMultiplication)
8.1 绝对值(Vector Absolute Value)
这部分函数主要用于求绝对值,公式描述如下:
pDst[n] = abs(pSrc[n]), 0 <= n < blockSize.
特别注意,这部分函数支持目标指针和源指针指向相同的缓冲区。
8.1.1 arm_abs_f32
这个函数用于求32位浮点数的绝对值,源代码分析如下:
- /**
- * @brief Floating-point vector absolute value. (1)
- * @param[in] *pSrc points to the input buffer
- * @param[out] *pDst points to the output buffer
- * @param[in] blockSize number of samples in each vector
- * [url=home.php?mod=space&uid=1141835]@Return[/url] none.
- */
-
- void arm_abs_f32( (2)
- float32_t * pSrc,
- float32_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY (3)
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- float32_t in1, in2, in3, in4; /* temporary variables */
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u; (4)
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Calculate absolute and then store the results in the destination buffer. */
- /* read sample from source */
- in1 = *pSrc;
- in2 = *(pSrc + 1);
- in3 = *(pSrc + 2);
-
- /* find absolute value */
- in1 = fabsf(in1); (5)
-
- /* read sample from source */
- in4 = *(pSrc + 3);
-
- /* find absolute value */
- in2 = fabsf(in2);
-
- /* read sample from source */
- *pDst = in1;
-
- /* find absolute value */
- in3 = fabsf(in3);
-
- /* find absolute value */
- in4 = fabsf(in4);
-
- /* store result to destination */
- *(pDst + 1) = in2;
-
- /* store result to destination */
- *(pDst + 2) = in3;
-
- /* store result to destination */
- *(pDst + 3) = in4;
-
-
- /* Update source pointer to process next sampels */ (6)
- pSrc += 4u;
-
- /* Update destination pointer to process next sampels */
- pDst += 4u;
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- #else (7)
-
- /* Run the below code for Cortex-M0 */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
- while(blkCnt > 0u) (8)
- {
- /* C = |A| */
- /* Calculate absolute and then store the results in the destination buffer. */
- *pDst++ = fabsf(*pSrc++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
- }
1. 在这里简单的跟大家介绍一下DSP库中函数的通用格式,后面就不再赘述了。
(1) 基本所有的函数都是可重入的。
(2) 大部分函数都支持一组数的计算,比如这个函数arm_abs_f32就可以计算一组数的绝对值。所以如果只是就几个数的绝对值,用这个库函数就没有什么优势了。
(3) 库函数基本是CM0,CM3和CM4都支持的(最新的DSP库已经添加CM7的支持)。
(4) 每组数据基本上都是以4个数为一个单位进行计算,不够四个再单独计算。
(5) 大部分函数都是配有f32,Q31,Q15和Q7四种格式。
2. 函数参数,支持输入一个数组进行计算绝对值。
3. 这部分代码是用于CM3和CM4内核。
4. 左移两位从而实现每4个数据为一组进行计算。
5. fabsf:这个函数不是用Cortex-M4F支持的DSP指令实现的,而是用C语言实现的,这个函数是被MDK封装起来的。
6. 切换到下一组数据。
7. 这部分代码用于CM0.
8. 用于不够4个数据的计算或者CM0内核。
8.1.2 arm_abs_q31
这个函数用于求32位定点数的绝对值,源代码分析如下:
- /**
- * @brief Q31 vector absolute value.
- * @param[in] *pSrc points to the input buffer
- * @param[out] *pDst points to the output buffer
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The function uses saturating arithmetic.
- * The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
- */
-
- void arm_abs_q31(
- q31_t * pSrc,
- q31_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
- q31_t in; /* Input value */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- q31_t in1, in2, in3, in4;
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Calculate absolute of input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
- in1 = *pSrc++;
- in2 = *pSrc++;
- in3 = *pSrc++;
- in4 = *pSrc++;
-
- *pDst++ = (in1 > 0) ? in1 : (q31_t)__QSUB(0, in1); (2)
- *pDst++ = (in2 > 0) ? in2 : (q31_t)__QSUB(0, in2);
- *pDst++ = (in3 > 0) ? in3 : (q31_t)__QSUB(0, in3);
- *pDst++ = (in4 > 0) ? in4 : (q31_t)__QSUB(0, in4);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Calculate absolute value of the input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
- in = *pSrc++;
- *pDst++ = (in > 0) ? in : ((in == INT32_MIN) ? INT32_MAX : -in);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- }
1. 这个函数使用了饱和运算,其实不光这个函数,后面很多函数都是使用了饱和运算的,关于什么是饱和运算,大家看Cortex-M3权威指南中文版的4.3.6 小节:汇编语言:饱和运算即可。
对于Q31格式的数据,饱和运算会使得数据0x80000000变成0x7fffffff(这个数比较特殊,算是特殊处理,记住即可)。
2. 这里重点说一下函数__QSUB,其实这个函数算是Cortex-M4/M3的一个指令,用于实现饱和减法。
比如函数:__QSUB(0,in1) 的作用就是实现0 – in1并返回结果。这里__QSUB实现的是32位数的饱和减法。还有__QSUB16和__QSUB8实现的是16位和8位数的减法。
8.1.3 arm_abs_q15
这个函数用于求15位定点数的绝对值,源代码分析如下:
- /**
- * @brief Q15 vector absolute value.
- * @param[in] *pSrc points to the input buffer
- * @param[out] *pDst points to the output buffer
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b>
- * \par
- * The function uses saturating arithmetic.
- * The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF. (1)
- */
-
- void arm_abs_q15(
- q15_t * pSrc,
- q15_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY
- __SIMD32_TYPE *simd; (2)
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
-
- q15_t in1; /* Input value1 */
- q15_t in2; /* Input value2 */
-
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- simd = __SIMD32_CONST(pDst); (3)
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Read two inputs */
- in1 = *pSrc++;
- in2 = *pSrc++;
-
-
- /* Store the Absolute result in the destination buffer by packing the two values, in a single cycle */
- #ifndef ARM_MATH_BIG_ENDIAN
- *simd++ =
- __PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), (4)
- ((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
-
- #else
-
-
- *simd++ =
- __PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
- ((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
-
- #endif /* #ifndef ARM_MATH_BIG_ENDIAN */
-
- in1 = *pSrc++;
- in2 = *pSrc++;
-
-
- #ifndef ARM_MATH_BIG_ENDIAN
-
- *simd++ =
- __PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)),
- ((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
-
- #else
-
-
- *simd++ =
- __PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
- ((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
-
- #endif /* #ifndef ARM_MATH_BIG_ENDIAN */
-
- /* Decrement the loop counter */
- blkCnt--;
- }
- pDst = (q15_t *)simd;
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Read the input */
- in1 = *pSrc++;
-
- /* Calculate absolute value of input and then store the result in the destination buffer. */
- *pDst++ = (in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
- q15_t in; /* Temporary input variable */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Read the input */
- in = *pSrc++;
-
- /* Calculate absolute value of input and then store the result in the destination buffer. */
- *pDst++ = (in > 0) ? in : ((in == (q15_t) 0x8000) ? 0x7fff : -in);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
- }
1. 对于Q15格式的数据,饱和运算会使得数据0x8000变成0x7fff。
2. __SIMD32_TYPE的定义在文件arm_math.h中,具体定义如下:
#define __SIMD32_TYPE int32_t __packed
SIMD就是咱们上期教程所将的单指令多数据流。简单的理解就是__SIMD32_TYPE就是定义了一个int32_t类型的数据,__packed的含义就是实现字节的对齐功能,方便两个16位数据的都存入到这个数据类型中。
3. 函数__SIMD32_CONST的定义如下:
#define __SIMD32_CONST(addr) ((__SIMD32_TYPE *)(addr))
4. 函数__PKHBT的定义在文件core_cm4_simd.h,定义如下:
#define __PKHBT(ARG1,ARG2,ARG3) ( ((((uint32_t)(ARG1)) ) &0x0000FFFFUL) | \
((((uint32_t)(ARG2)) <<(ARG3)) & 0xFFFF0000UL) )
这个宏定义的作用就是将将两个16位的数据合并成32位数据。但是有一点要特别说明__PKHBT也是CM4内核支持的SIMD指令,上面的宏定义的C函数会被MDK自动识别并调用相应的PKHBT指令。
__QSUB16用于实现16位数据的饱和减法。
8.1.4 arm_abs_q7
这个函数用于求8位定点数的绝对值,源代码分析如下:
- /**
- * @brief Q7 vector absolute value.
- * @param[in] *pSrc points to the input buffer
- * @param[out] *pDst points to the output buffer
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * \par Conditions for optimum performance
- * Input and output buffers should be aligned by 32-bit
- *
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The function uses saturating arithmetic.
- * The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
- */
-
- void arm_abs_q7(
- q7_t * pSrc,
- q7_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
- q7_t in; /* Input value1 */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- q31_t in1, in2, in3, in4; /* temporary input variables */
- q31_t out1, out2, out3, out4; /* temporary output variables */
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Read inputs */
- in1 = (q31_t) * pSrc;
- in2 = (q31_t) * (pSrc + 1);
- in3 = (q31_t) * (pSrc + 2);
-
- /* find absolute value */
- out1 = (in1 > 0) ? in1 : (q31_t)__QSUB8(0, in1); (2)
-
- /* read input */
- in4 = (q31_t) * (pSrc + 3);
-
- /* find absolute value */
- out2 = (in2 > 0) ? in2 : (q31_t)__QSUB8(0, in2);
-
- /* store result to destination */
- *pDst = (q7_t) out1;
-
- /* find absolute value */
- out3 = (in3 > 0) ? in3 : (q31_t)__QSUB8(0, in3);
-
- /* find absolute value */
- out4 = (in4 > 0) ? in4 : (q31_t)__QSUB8(0, in4);
-
- /* store result to destination */
- *(pDst + 1) = (q7_t) out2;
-
- /* store result to destination */
- *(pDst + 2) = (q7_t) out3;
-
- /* store result to destination */
- *(pDst + 3) = (q7_t) out4;
-
- /* update pointers to process next samples */
- pSrc += 4u;
- pDst += 4u;
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
- #else
-
- /* Run the below code for Cortex-M0 */
- blkCnt = blockSize;
-
- #endif // #define ARM_MATH_CM0_FAMILY
-
- while(blkCnt > 0u)
- {
- /* C = |A| */
- /* Read the input */
- in = *pSrc++;
-
- /* Store the Absolute result in the destination buffer */
- *pDst++ = (in > 0) ? in : ((in == (q7_t) 0x80) ? 0x7f : -in);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
- }
1. 由于饱和运算,0x80求绝对值将变成数据0x7F。
2. __QSUB8用以实现8位数的饱和减法运算。
8.1.5 实例讲解
实验目的:
1. 四种数据类型数据绝对值求解
实验内容:
1. 按下按键K1, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
(1)到(4)实现相应格式下绝对值的求解。这里只求了一个数,大家可以尝试求解一个数组的绝对值。
8.2 求和(Vector Addition)
这部分函数主要用于求和,公式描述如下:
pDst[n] = pSrcA[n] + pSrcB[n], 0 <= n < blockSize.
8.2.1 arm_add_f32
这个函数用于求32位浮点数的和,源代码分析如下:
- /**
- * @brief Floating-point vector addition.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[out] *pDst points to the output vector
- * @param[in] blockSize number of samples in each vector
- * @return none.
- */
-
- void arm_add_f32(
- float32_t * pSrcA,
- float32_t * pSrcB,
- float32_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- float32_t inA1, inA2, inA3, inA4; /* temporary input variabels */
- float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
-
- /* read four inputs from sourceA and four inputs from sourceB */
- inA1 = *pSrcA;
- inB1 = *pSrcB;
- inA2 = *(pSrcA + 1);
- inB2 = *(pSrcB + 1);
- inA3 = *(pSrcA + 2);
- inB3 = *(pSrcB + 2);
- inA4 = *(pSrcA + 3);
- inB4 = *(pSrcB + 3);
-
- /* C = A + B */ (1)
- /* add and store result to destination */
- *pDst = inA1 + inB1;
- *(pDst + 1) = inA2 + inB2;
- *(pDst + 2) = inA3 + inB3;
- *(pDst + 3) = inA4 + inB4;
-
- /* update pointers to process next samples */
- pSrcA += 4u;
- pSrcB += 4u;
- pDst += 4u;
-
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (*pSrcA++) + (*pSrcB++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
- }
1. 这部分的代码比较简单,只是求解两个数的和。
8.2.2 arm_add_q31
这个函数用于求32位定点数的和,源代码分析如下:
- /**
- * @brief Q31 vector addition.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[out] *pDst points to the output vector
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The function uses saturating arithmetic.
- * Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
- */
-
- void arm_add_q31(
- q31_t * pSrcA,
- q31_t * pSrcB,
- q31_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- q31_t inA1, inA2, inA3, inA4;
- q31_t inB1, inB2, inB3, inB4;
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- inA1 = *pSrcA++;
- inA2 = *pSrcA++;
- inB1 = *pSrcB++;
- inB2 = *pSrcB++;
-
- inA3 = *pSrcA++;
- inA4 = *pSrcA++;
- inB3 = *pSrcB++;
- inB4 = *pSrcB++;
-
- *pDst++ = __QADD(inA1, inB1); (2)
- *pDst++ = __QADD(inA2, inB2);
- *pDst++ = __QADD(inA3, inB3);
- *pDst++ = __QADD(inA4, inB4);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = __QADD(*pSrcA++, *pSrcB++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
-
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ + *pSrcB++); (3)
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
- }
1. 这个函数也是饱和运算,输出结果的范围[0x800000000x7FFFFFFF],超出这个结果将产生饱和结果。
2. __QADD实现32位数的加法。
3. 函数clip_q63_to_q31的定义在文件arm_math.h里面
static __INLINE q31_t clip_q63_to_q31(
q63_t x)
{
return ((q31_t) (x >> 32) != ((q31_t)x >> 31)) ?
((0x7FFFFFFF ^ ((q31_t) (x >>63)))) : (q31_t) x;
}
这个函数的作用是实现饱和结果。
8.2.3 arm_add_q15
这个函数用于求16位定点数的和,源代码分析如下:
- /**
- * @brief Q15 vector addition.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[out] *pDst points to the output vector
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The function uses saturating arithmetic.
- * Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
- */
-
- void arm_add_q15(
- q15_t * pSrcA,
- q15_t * pSrcB,
- q15_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- q31_t inA1, inA2, inB1, inB2;
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A + B */ (2)
- /* Add and then store the results in the destination buffer. */
- inA1 = *__SIMD32(pSrcA)++;
- inA2 = *__SIMD32(pSrcA)++;
- inB1 = *__SIMD32(pSrcB)++;
- inB2 = *__SIMD32(pSrcB)++;
-
- *__SIMD32(pDst)++ = __QADD16(inA1, inB1);
- *__SIMD32(pDst)++ = __QADD16(inA2, inB2);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (q15_t) __QADD16(*pSrcA++, *pSrcB++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
-
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ + *pSrcB++), 16); (3)
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
-
- }
1. 这个函数也是饱和运算,输出结果的范围[0x80000x7FFF],超出这个结果将产生饱和结果。
2. 函数inA1 = *__SIMD32(pSrcA)++仅需要一条SIMD指令即可完成将两个16位数存到32位的变量inA1中。
3. __SSAT也是SIMD指令,这里是将结果饱和到16位精度。
8.2.4 arm_add_q7
这个函数用于求8位定点数的绝对值,源代码分析如下:
- /**
- * @brief Q7 vector addition.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[out] *pDst points to the output vector
- * @param[in] blockSize number of samples in each vector
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The function uses saturating arithmetic.
- * Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
- */
-
- void arm_add_q7(
- q7_t * pSrcA,
- q7_t * pSrcB,
- q7_t * pDst,
- uint32_t blockSize)
- {
- uint32_t blkCnt; /* loop counter */
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
-
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */ (2)
- *__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (q7_t) __SSAT(*pSrcA++ + *pSrcB++, 8);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
-
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- while(blkCnt > 0u)
- {
- /* C = A + B */
- /* Add and then store the results in the destination buffer. */
- *pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ + *pSrcB++, 8);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
-
- }
1. 这个函数也是饱和运算,输出结果的范围[0x800x7F],超出这个结果将产生饱和。
2. 这里通过SIMD指令实现4组8位数的加法。
8.2.5 实例讲解
实验目的:
1. 四种类似数据的求和
实验内容:
1. 按下按键K2, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
8.3 点乘(Vector Dot Product)
这部分函数主要用于点乘,公式描述如下:
sum =pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... +pSrcA[blockSize-1]*pSrcB[blockSize-1]
8.3.1 arm_dot_prod_f32
这个函数用于求32位浮点数的点乘,源代码分析如下:
- /**
- * @defgroup dot_prod Vector Dot Product
- *
- * Computes the dot product of two vectors.
- * The vectors are multiplied element-by-element and then summed.
- *
- * <pre>
- * sum = pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... + pSrcA[blockSize-1]*pSrcB[blockSize-1]
- * </pre>
- *
- * There are separate functions for floating-point, Q7, Q15, and Q31 data types.
- */
-
- /**
- * @addtogroup dot_prod
- * @{
- */
-
- /**
- * @brief Dot product of floating-point vectors.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[in] blockSize number of samples in each vector
- * @param[out] *result output result returned here
- * @return none.
- */
-
-
- void arm_dot_prod_f32(
- float32_t * pSrcA,
- float32_t * pSrcB,
- uint32_t blockSize,
- float32_t * result)
- {
- float32_t sum = 0.0f; /* Temporary result storage */ (1)
- uint32_t blkCnt; /* loop counter */
-
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
- /* Calculate dot product and then store the result in a temporary buffer */
- sum += (*pSrcA++) * (*pSrcB++); (2)
- sum += (*pSrcA++) * (*pSrcB++);
- sum += (*pSrcA++) * (*pSrcB++);
- sum += (*pSrcA++) * (*pSrcB++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
-
- while(blkCnt > 0u)
- {
- /* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
- /* Calculate dot product and then store the result in a temporary buffer. */
- sum += (*pSrcA++) * (*pSrcB++);
-
- /* Decrement the loop counter */
- blkCnt--;
- }
- /* Store the result back in the destination buffer */
- *result = sum;
- }
1. 由于CM4上带的FPU是单精度的,所以初始化float32_t类型的浮点数时需要在数据的末尾加上f。
2. 类似函数sum += (*pSrcA++) * (*pSrcB++)最终会通过浮点的MAC(乘累加)实现,从而加快执行时间。
8.3.2 arm_dot_prod_q31
这个函数用于求32位定点数的点乘,源代码分析如下:
- /**
- * @brief Dot product of Q31 vectors.
- * @param[in] *pSrcA points to the first input vector
- * @param[in] *pSrcB points to the second input vector
- * @param[in] blockSize number of samples in each vector
- * @param[out] *result output result returned here
- * @return none.
- *
- * <b>Scaling and Overflow Behavior:</b> (1)
- * \par
- * The intermediate multiplications are in 1.31 x 1.31 = 2.62 format and these
- * are truncated to 2.48 format by discarding the lower 14 bits.
- * The 2.48 result is then added without saturation to a 64-bit accumulator in 16.48 format.
- * There are 15 guard bits in the accumulator and there is no risk of overflow as long as
- * the length of the vectors is less than 2^16 elements.
- * The return result is in 16.48 format.
- */
-
- void arm_dot_prod_q31(
- q31_t * pSrcA,
- q31_t * pSrcB,
- uint32_t blockSize,
- q63_t * result)
- {
- q63_t sum = 0; /* Temporary result storage */
- uint32_t blkCnt; /* loop counter */
-
-
- #ifndef ARM_MATH_CM0_FAMILY
-
- /* Run the below code for Cortex-M4 and Cortex-M3 */
- q31_t inA1, inA2, inA3, inA4;
- q31_t inB1, inB2, inB3, inB4;
-
- /*loop Unrolling */
- blkCnt = blockSize >> 2u;
-
- /* First part of the processing with loop unrolling. Compute 4 outputs at a time.
- ** a second loop below computes the remaining 1 to 3 samples. */
- while(blkCnt > 0u)
- {
- /* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
- /* Calculate dot product and then store the result in a temporary buffer. */
- inA1 = *pSrcA++;
- inA2 = *pSrcA++;
- inA3 = *pSrcA++;
- inA4 = *pSrcA++;
- inB1 = *pSrcB++;
- inB2 = *pSrcB++;
- inB3 = *pSrcB++;
- inB4 = *pSrcB++;
-
- sum += ((q63_t) inA1 * inB1) >> 14u; (2)
- sum += ((q63_t) inA2 * inB2) >> 14u;
- sum += ((q63_t) inA3 * inB3) >> 14u;
- sum += ((q63_t) inA4 * inB4) >> 14u;
-
- /* Decrement the loop counter */
- blkCnt--;
- }
-
- /* If the blockSize is not a multiple of 4, compute any remaining output samples here.
- ** No loop unrolling is used. */
- blkCnt = blockSize % 0x4u;
-
- #else
-
- /* Run the below code for Cortex-M0 */
-
- /* Initialize blkCnt with number of samples */
- blkCnt = blockSize;
-
- #endif /* #ifndef ARM_MATH_CM0_FAMILY */
-
-
- whil