HKG15-408: ARM v8-A NEON optimization

Name: HKG15-408: ARM v8-A NEON optimization
Uploaded: 2015-02-18T16:02:58.000Z
Duration: 1 h 24 min 3 s
Description: ARM v8-A NEON optimization, with the following outline - Zhongwei/Phil Wang With FFT optimization as an example, following topics are discussed. a) Performance boost using ARM v8-A NEON b) NEON-optimization workflow for Ne10 c) Some tips with example of Ne10 FFT and Android libraries d) Performance comparison between assembly and intrinsic

Session Start

The session begins with the introduction and discussion of the topic on V8H new optimization with an Eaton FFG optimization as an example. Quick questions are asked about the usage of neon and Jacek, followed by a presentation of the agenda for this section.

Benchmark Data of V8H with or without Neon Optimization

Benchmark data is presented to compare the performance of V8H with and without neon optimization.

The value of neon optimization is discussed based on the benchmark results.

The process of locating hotspots and optimizing them using neon is explained.

Useful tips for neon optimization are shared, along with examples from FFG and enjoy libraries.

Performance comparison between us and break a mean Jacek is discussed to help in decision-making.

Performance Ratio of FFG

A performance ratio graph is shown, indicating the performance boost achieved through neon optimization.

The graph demonstrates that neon optimization can result in a performance improvement ranging from 70% to over 100%.

Comparison with C Implementation

Questions are raised regarding vectorization and whether any C implementation was compared against.

It is clarified that there is a C implementation, but only FFT routine comparisons were made, not including allocation time.

Optimization Steps for Compact FFT

The steps taken to optimize compact FFT are explained.

The most inner loop is optimized first, followed by identifying hotspots where reorder routines are not friendly to neon.

Adjusting Algorithm for Petri Order

An adjustment in algorithm is made for Petri order to make it more compatible with neon optimization.

Homepage of Antenna

The homepage link for Antenna library is provided.

New Optimization Step - Realized Way

A new step in the optimization process is introduced, focusing on the realized way of performing FFT.

Performance Improvement through Parallel Operations

The parallel nature of arrays in the realized way is highlighted, leading to optimized operations.

A 20% performance improvement is achieved through this optimization step.

Challenges with Petri Order

The Petri order in V3 order is found to be incompatible with neon optimization.

Adjustments are made in the algorithm to separate and store elements in a well-designed position for better compatibility.

Clearing Doubts

Doubts regarding the optimization steps are addressed and clarified.

Overall, this section focuses on the benchmark data, performance comparison, and optimization steps for V8H with neon optimization. It also highlights the challenges faced with Petri order and how adjustments were made to improve compatibility.

FF Key Function Optimization

This section discusses the optimization of the FF key function and its impact on performance.

Optimizing the FF Key Function

The FF key function has been optimized by an additional 20%.

The hotspots in the function were identified and adjusted for better performance.

Useful tips for neon optimization libraries were included in the process.

Experience can help in choosing between different implementations.

General-Purpose Algorithm Performance Comparison

This section compares the performance of a general-purpose algorithm with neon optimization.

Performance Comparison

The horizontal axis represents the intensity of FG from performance ratio of neon to PC version.

Neon eyes is 70% faster than the PC version, as shown by the figure.

X64 Wrapper Performance

This section discusses the performance of a x64 wrapper.

X64 Wrapper Performance

The figure shows that there is a significant improvement in performance compared to previous versions.

Optimization Question

This section addresses a question regarding optimization.

Optimization Question

There is a question about optimizing for more than 100% fine young.

It is mentioned that some library standard library has an implementation of C that can be compared for allocation time.

FFT Routine Optimization

This section focuses on optimizing FFT routines.

Optimizing FFT Routines

When performing an FFT, it involves more than just one or two function calls.

Twiddles need to be reallocated, and only the FFT routine needs to be unlocked.

The inclusion of twiddles and proper ordering can improve performance.

FFT Optimization Steps

This section discusses the steps involved in optimizing FFT routines.

FFT Optimization Steps

The FFT contains lasting loops of reorder routine.

Two major optimizations are performed to identify hotspots and adjust the algorithm accordingly.

By optimizing these operations, a 20% improvement in performance is achieved.

Reordering Operations for Performance Improvement

This section explains how reordering operations can improve performance.

Reordering Operations for Performance Improvement

A figure illustrates the difference between normal and optimized ways of performing operations.

By optimizing the organization of operations, time is saved and parallelization is achieved.

Algorithm Optimization for Neon

This section focuses on algorithm optimization for neon.

Algorithm Optimization for Neon

The first step involves adjusting the algorithm to optimize specific subroutines.

By optimizing these subroutines, significant performance improvements can be achieved.

Element Adjustment in Algorithm

This section discusses element adjustment in the algorithm for better performance.

Element Adjustment in Algorithm

Adjustments are made to elements by loading them together and rearranging their positions.

By properly designing the position of elements, better performance can be obtained.

Further Algorithm Optimization

This section explores additional optimization techniques for algorithms.

Further Algorithm Optimization

Additional optimizations involve adjusting from 0 to 4 and then from 2 to 0.

Well-designed positioning of elements can lead to improved performance.

Useful Tips for Neon Optimization Libraries

This section provides useful tips for optimizing neon libraries.

Useful Tips for Neon Optimization Libraries

Free useful tips are shared for neon in Jacek Android libraries.

Registering during optimization and floating-point arithmetic are discussed.

Performance Comparison with C's Libm

This section compares performance with C's Libm.

Performance Comparison with C's Libm

A figure shows the performance comparison using 20 small operations.

It is mentioned that some Skylake processors use neon at a time, while others perform two or four operations simultaneously.

The use of more 128-bit neon registers on v8 AR 64 architecture can improve performance.

Portability and Compiler Optimization

This section discusses portability and compiler optimization.

Portability and Compiler Optimization

The use of neon intrinsics ensures portability between various arm compilers.

There may be differences in optimization based on the compiler used.

Intrinsic support allows the use of neon instructions for improved performance.

Compiler Optimizations and Intrinsic Functions

This section explores compiler optimizations and intrinsic functions.

Compiler Optimizations and Intrinsic Functions

Compilers like GCC can automatically optimize variables inside the code.

The use of intrinsic functions like eccentric can improve performance.

ARMv8.2 supports these optimizations, but they may not be portable to other architectures.

CPU Compliance and Vector Instructions

This section discusses CPU compliance and vector instructions.

CPU Compliance and Vector Instructions

Strict compliance with I Triple E standards cannot be guaranteed by all CPUs by default.

Some compilers may have optimizations that can automatically handle vector instructions.

The use of neon intrinsics and vector instructions can improve performance.

Neon Intrinsic Support and Portability

This section highlights the support for neon intrinsics and portability.

Neon Intrinsic Support and Portability

Neon intrinsics provide support for various operations, including addition.

The code using neon intrinsics will be portable between different arm compilers.

Conclusion

This section concludes the discussion on optimization techniques.

Conclusion

The importance of understanding compiler optimizations and utilizing intrinsic functions is emphasized.

Proper algorithm design, reordering operations, and utilizing neon intrinsics can lead to significant performance improvements.

New Section

Portuguese transcript.

Subtopic Title

The height of the higher half of a q0 is discussed.