Waarom gebruiken we niet overal artificiële intelligentie?

Thomas

Faingnaert

Artificiële intelligentie (AI) is haast niet meer weg te denken in de wereld van vandaag. AI wordt al in vele domeinen toegepast. Denk maar aan de medische wereld, beeldherkenning, het aanbevelingssysteem van Netflix, de Google zoekmachine, … Toch gebruiken we AI nog lang niet overal. Hoe komt dit, en hoe kunnen we daar verandering in brengen?

Het landschap van artificiële intelligentie

Eerst en vooral is het belangrijk om te begrijpen dat er verschillende vormen van artificiële intelligentie bestaan. Artificiële algemene intelligentie (artificial general intelligence) kan elke taak leren die een mens ook kan. Deze vorm van AI is populair als onderwerp in sciencefiction, maar is nog niet klaar voor gebruik in de praktijk. Huidige vormen van AI zijn gericht op één specifieke toepassing, en worden daarom smalle artificiële intelligentie (narrow artificial intelligence) genoemd. Je kan je wel voorstellen dat Netflix’s aanbevelingssysteem en beeldherkenning een andere aanpak nodig hebben. Elke nieuwe toepassing van artificiële intelligentie vereist dus onderzoek naar en experimentatie met nieuwe technieken.

Vaak wordt er gebruik gemaakt van machinaal leren (ML), een deelgebied van de artificiële intelligentie. ML bestudeert technieken om computers problemen te laten oplossen zonder daarvoor expliciet te zijn geprogrammeerd. ML laat computers automatisch leren uit data, door er patronen in te herkennen en deze dan te generaliseren. Een voorbeeld: geef je computer een reeks foto’s van vogels en vissen, en geef aan voor elke foto of het om een vogel of een vis gaat. Het ML algoritme leert dan wat de kenmerken van vogels en vissen zijn, en leert ze te herkennen op foto’s die het nog nooit heeft gezien. Deze leerstap vereist veel berekeningen, en kan daarom lang duren. Pas op het einde ervan weten onderzoekers of het algoritme goede voorspellingen maakt of niet. Om onderzoek te verrichten naar nieuwe ML technieken is het dus van belang dat deze leerstap zo snel mogelijk wordt uitgevoerd.

Het probleem met huidige oplossingen

Onderzoekers in het domein van ML gebruiken dikwijls standaard bibliotheken. Deze bibliotheken bevatten verzamelingen code die kunnen worden hergebruikt. Om de grote eis aan snelheid aan te kunnen, bevatten ze manueel geoptimaliseerde versies voor de meest gebruikte ML technieken. Het probleem is dat deze technieken niet voor alle toepassingen even goed presteren. Onderzoekers willen dus experimenteren met nieuwe technieken, die niet door deze bibliotheken worden ondersteund. Voor onderzoekers zit er dan niets anders op dan zelf een eigen implementatie vanaf nul te schrijven. Dit neemt niet alleen veel kostbare tijd in beslag, maar het is ook bijzonder moeilijk om aan dezelfde snelheid te geraken als de standaard bibliotheken.

Op weg naar een alternatief

Dit gebrek aan flexibiliteit in bestaande bibliotheken belemmert de wetenschappelijke vooruitgang in ML aanzienlijk. Thomas Faingnaert, toen student burgerlijk ingenieur aan de UGent, pakte het probleem aan. Hij ontwikkelde een raamwerk dat het mogelijk maakt om een breed scala aan berekeningen die gebruikelijk zijn in ML op een efficiënte manier uit te voeren.

Het raamwerk focust op varianten van matrixvermenigvuldiging, de berekening die aan de kern ligt van ML. Bij het ontwerp van dit raamwerk waren twee vereisten van belang. Enerzijds moest het raamwerk flexibel genoeg zijn zodat het naar believen kon aangepast worden door onderzoekers. Anderzijds moesten de berekeningen natuurlijk snel genoeg uitgevoerd worden, zodat onderzoekers snel kunnen experimenteren met nieuwe ideeën.

Om de nodige flexibiliteit te bekomen, wordt de matrixvermenigvuldiging in het raamwerk gesplitst in een verzameling onafhankelijke puzzelstukken. Elk van die puzzelstukken correspondeert met één welbepaald aspect van de matrixvermenigvuldiging, zoals bijvoorbeeld de manier waarop de matrices zijn opgeslagen in het computergeheugen. Een onderzoeker die een variant van matrixvermenigvuldiging wil uitvoeren, kan ofwel een van de voorgedefinieerde componenten hergebruiken, ofwel zijn eigen implementatie voorzien. Dit is al een hele stap voorwaarts in vergelijking met het herbeginnen vanaf nul.

Naast de flexibiliteit is natuurlijk ook de snelheid van de matrixvermenigvuldiging van belang. Daarom is het raamwerk gericht op grafische processoren (GPU's), die tot vele malen sneller zijn dan traditionele processoren (CPU's). Bovendien maakt het raamwerk gebruik van NVIDIA’s Tensor Cores, verwerkingseenheden die speciaal zijn ontworpen om zeer snel matrices te vermenigvuldigen. Het optimaal benutten van deze Tensor Cores is complex, en vereist heel wat onderzoek. Zo is het bijvoorbeeld nodig om de geheugentransfers goed te coördineren, zodat de Tensor Cores niet moeten wachten tot de data geladen is. Gelukkig schermt het raamwerk de gebruiker af van deze complexiteit, zodat hij/zij zich kan focussen op het ontwerpen van nieuwe ML technieken.

En het uiteindelijke eindresultaat? De flexibiliteit en snelheid van het raamwerk werden geëvalueerd voor drie varianten van matrixvermenigvuldiging. In termen van snelheid haalt het gelijkaardige resultaten als bestaande bibliotheken voor operaties die ze ondersteunen. Daar komt natuurlijk bij dat het raamwerk veel flexibeler is. Zo ondersteunt het ook operaties die met de bestaande bibliotheken ofwel onmogelijk zijn, ofwel niet zonder snelheidsverlies kunnen worden uitgevoerd.

En nu?

Gaan we van vandaag op morgen in alle domeinen ML toepassen? Dat nu niet, maar het ontwikkelde raamwerk kan wel door onderzoekers gebruikt worden om hun onderzoek efficiënter uit te voeren. Het raamwerk en een aantal voorgedefinieerde puzzelstukken zijn vrijgegeven onder een open-bron licentie, en kunnen door iedereen bestudeerd en aangepast worden. Nu is het wachten op onderzoeksgroepen in ML om innovatieve technieken te ontwikkelen voor nieuwe toepassingen.

Bibliografie

[1] Martin Abadi et al. ‘TensorFlow: A system for large-scale machine learning’. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016, pp. 265–283. url: https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

[2] A. Abdelfattah, S. Tomov and J. Dongarra. ‘Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs’. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2019, pp. 111–122.

[3] A. Abdelfattah, S. Tomov and J. Dongarra. ‘Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs’. In: 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). 2019, pp. 17–24.

[4] Ahmad Abdelfattah et al. ‘Performance, Design, and Autotuning of Batched GEMM for GPUs’. In: High Performance Computing. Ed. by Julian M Kunkel, Pavan Balaji and Jack Dongarra. Cham: Springer International Publishing, 2016, pp. 21–38. isbn: 978-3-319-41321-1.

[5] Jeremy Appleyard and Scott Yokim. Programming Tensor Cores in CUDA 9. Oct. 2017. url: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.

[6] E. Aprà, M. Klemm and K. Kowalski. ‘Efficient Implementation of Many-Body Quantum Chemical Methods on the Intel® Xeon Phi Coprocessor’. In: SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2014, pp. 674–684.

[7] Alexander Auer et al. ‘Automatic code generation for many-body electronic structure methods: The tensor contraction engine’. In: Molecular Physics, R. J. Bartlett Festschrift Special Issue 104 (Jan. 2006). doi: 10.1080/00268970500275780.

[8] Brett W. Bader and Tamara G. Kolda. ‘Algorithm 862: MATLAB Tensor Classes for Fast Algorithm Prototyping’. In: ACM Transactions on Mathematical Software 32.4 (Dec. 2006), pp. 635–653. doi: 10.1145/1186785.1186794.

[9] Paul Barham and Michael Isard. ‘Machine Learning Systems Are Stuck in a Rut’. In: Proceedings of the Workshop on Hot Topics in Operating Systems. HotOS ’19. Bertinoro, Italy: Association for Computing Machinery, 2019, pp. 177–183. isbn: 9781450367271. doi: 10.1145/3317550.3321441. url: https://doi.org/10.1145/3317550.3321441.

[10] T. Besard, C. Foket and B. De Sutter. ‘Effective Extensible Programming: Unleashing Julia on GPUs’. In: IEEE Transactions on Parallel and Distributed Systems 30.4 (2019), pp. 827–841.

[11] Tim Besard. LLVM.jl: Julia wrapper for the LLVM C API. 2020. url: https://github.com/maleadt/LLVM.jl.

[12] Tim Besard et al. ‘Rapid software prototyping for heterogeneous and distributed platforms’. In: Advances in Engineering Software 132 (2019), pp. 29–46.

[13] Valentin Churavy. GPUifyLoops.jl: Support for writing loop-based code that executes both on CPU and GPU. 2020. url: https://github.com/vchuravy/GPUifyLoops.jl.

[14] BLAS contributors. BLAS (Basic Linear Algebra Subprograms). 2017. url: http://www.netlib.org/blas/.

[15] Andy Ferris. Statically sized arrays for Julia. 2016. url: https://github.com/JuliaArrays/StaticArrays.jl.

[16] Geetika Gupta. Using Tensor Cores for Mixed-Precision Scientific Computing. 2019. url: https://devblogs.nvidia.com/tensor-cores-mixed-precision-scientific-com….

[17] Azzam Haidar et al. ‘Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers’. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. SC ’18. Dallas, Texas: IEEE Press, 2018. doi: 10.1109/SC.2018.00050. url: https://doi.org/10.1109/SC.2018.00050.

[18] Azzam Haidar et al. ‘Harnessing Tensor Cores FP16 Arithmetic to Accelerate Linear Solvers and HPC Scientific Applications’. NVIDIA GPU Technology Conference. 2018. url: http://on-demand.gputechconf.com/supercomputing/2018/video/sc1826-harne….

[19] Geoffrey Hinton, Sara Sabour and Nicholas Frosst. ‘Matrix capsules with EM routing’. In: International Conference on Learning Representations. 2018.

[20] Tim Holy. TiledIteration.jl: A Julia package to facilitate writing multithreaded, multidimensional, cache-efficient code. 2020. url: https://github.com/JuliaArrays/TiledIteration.jl.

[21] Jianyu Huang, Chenhan D. Yu and Robert A. van de Geijn. Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. 2018. arXiv: 1808.07984 [cs.MS].

[22] Tsuyoshi Ichimura et al. ‘A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing’. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. SC ’18. Dallas, Texas: IEEE Press, 2018.

[23] Intel. Intel Math Kernel Library. 2020. url: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-….

[24] Zhe Jia et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. 2018. arXiv: 1804.06826 [cs.DC].

[25] JuliaLang.org. Julia Micro-Benchmarks. 2020. url: https://julialang.org/benchmarks.

[26] JuliaLang.org. The Julia Language. 2020. url: https://julialang.org.

[27] JuliaLang.org. The Julia Language Official Documentation. 2020. url: https://docs.julialang.org/en/v1.

[28] Andrew Kerr. ‘Developing CUDA kernels to push Tensor Cores to the absolute limit on NVIDIA A100’. May 2020. url: https://developer.nvidia.com/gtc/2020/video/s21745.

[29] Andrew Kerr et al. ‘CUTLASS: CUDA Template Library for Dense Linear Algebra at all levels and scales’. Mar. 2018. url: http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-so….

[30] Andrew Kerr et al. ‘Programming Tensor Cores: Native Volta Tensor Cores with CUTLASS’. Mar. 2019. url: https://developer.nvidia.com/gtc/2019/video/S9593.

[31] Khronos Group. OpenCL: An open standard for parallel programming of heterogeneous systems. 2020. url: https://www.khronos.org/opencl.

[32] Khronos Group. OpenGL: The Industry’s Foundation for High Performance Graphics. 2020. url: https://www.opengl.org.

[33] Jinsung Kim et al. ‘A Code Generator for High-Performance Tensor Contractions on GPUs’. In: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. CGO 2019. Washington, DC, USA: IEEE Press, 2019, pp. 85–95. isbn: 9781728114361.

[34] Ronny Krashinsky et al. NVIDIA Ampere Architecture In-Depth. May 2020. url: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/.

[35] J. Lai and A. Seznec. ‘Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs’. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2013, pp. 1–10.

[36] J. Li et al. ‘An input-adaptive and in-place approach to dense tensor-times-matrix multiply’. In: SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, pp. 1–12.

[37] LLVM contributors. The LLVM Compiler Infrastructure Project. 2020. url: https://llvm.org.

[38] LLVM contributors. The LLVM Target-Independent Code Generator. 2020. url: https://llvm.org/docs/CodeGenerator.html.

[39] LLVM contributors. User Guide for the NVPTX Back-end. 2020. url: https://llvm.org/docs/NVPTXUsage.html.

[40] Wenjing Ma et al. ‘GPU-Based Implementations of the Noniterative Regularized-CCSD(T) Corrections: Applications to Strongly Correlated Systems’. In: Journal of Chemical Theory and Computation 7.5 (2011), pp. 1316–1327. doi: 10.1021/ct1007247. url: https://doi.org/10.1021/ct1007247.

[41] Stefano Markidis et al. ‘NVIDIA tensor core programmability, performance & precision’. In: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 (2018), pp. 522–531. doi: 10.1109/IPDPSW.2018.00091.

[42] Devin A. Matthews. High-Performance Tensor Contraction without Transposition. 2016. arXiv: 1607.00291 [cs.MS].

[43] Vishal Mehta. ‘Getting Started with Tensor Cores in HPC’. NVIDIA GPU Technology Conference. 2019. url: https://on-demand.gputechconf.com/supercomputing/2019/video/sc1909-gett….

[44] Paulius Micikevicius et al. ‘Mixed Precision Training’. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. url: https://openreview.net/forum?id=r1gs9JgRZ.

[45] Microsoft. Compute Shader Overview. May 2018. url: https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-a….

[46] Microsoft. DirectX graphics and gaming. May 2018. url: https://docs.microsoft.com/en-us/windows/win32/directx.

[47] Edoardo [Di Napoli] et al. ‘Towards an efficient use of the BLAS library for multilinear tensor contractions’. In: Applied Mathematics and Computation 235 (2014), pp. 454–468. issn: 0096-3003. doi: https://doi.org/10.1016/j.amc.2014.02.051. url: http://www.sciencedirect.com/science/article/pii/S0096300314002902.

[48] Rajib Nath, Stanimire Tomov and Jack Dongarra. ‘An Improved MAGMA GEMM For Fermi Graphics Processing Units’. In: International Journal of High Performance Computing Applications 24.4 (Nov. 2010), pp. 511–515. issn: 1094-3420. doi: 10.1177/1094342010385729. url: http://dx.doi.org/10.1177/1094342010385729.

[49] T. Nelson et al. ‘Generating Efficient Tensor Contractions for GPUs’. In: 2015 44th International Conference on Parallel Processing. 2015, pp. 969–978.

[50] NVIDIA. Automatic Mixed Precision for Deep Learning. 2020. url: https://developer.nvidia.com/automatic-mixed-precision.

[51] NVIDIA. cuBLAS: CUDA Toolkit Documentation. 2020. url: https://docs.nvidia.com/cuda/cublas/index.html.

[52] NVIDIA. CUDA C++ Programming Guide. 2020. url: https://docs.nvidia.com/cuda/cuda-c-programming-guide.

[53] NVIDIA. cuDNN Developer Guide: NVIDIA Deep Learning SDK Documentation. 2020. url: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.ht….

[54] NVIDIA. cuTENSOR: A High-Performance CUDA Library for Tensor Primitives. 2020. url: https://docs.nvidia.com/cuda/cutensor/index.html.

[55] NVIDIA. CUTLASS: CUDA Templates for Linear Algebra Subroutines. 2020. url: https://github.com/NVIDIA/cutlass.

[56] NVIDIA. Deep Learning Performance Guide. June 2019. url: https://docs.nvidia.com/deeplearning/sdk/pdf/Deep-Learning-Performance-….

[57] NVIDIA. NVIDIA Turing Architecture whitepaper. 2018. url: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization….

[58] NVIDIA. NVIDIA Unveils CUDA – The GPU Computing Revolution Begins. Nov. 2006. url: https://www.nvidia.com/object/IO_37226.html.

[59] NVIDIA. NVIDIA V100. 2020. url: https://www.nvidia.com/en-us/data-center/v100.

[60] NVIDIA. Parallel Thread Execution ISA Version 6.5. 2020. url: https://docs.nvidia.com/cuda/parallel-thread-execution.

[61] Adam Paszke et al. ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-pe….

[62] Roman Poya, Antonio J Gil and Rogelio Ortigosa. ‘A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics’. In: Computer Physics Communications 216 (2017), pp. 35–52. issn: 0010-4655. doi: https://doi.org/10.1016/j.cpc.2017.02.016. url: http://www.sciencedirect.com/science/article/pii/S0010465517300681.

[63] Pramod Ramarao. CUDA 11 Features Revealed. May 2020. url: https://devblogs.nvidia.com/cuda-11-features-revealed/.

[64] J. Revels, M. Lubin and T. Papamarkou. ‘Forward-Mode Automatic Differentiation in Julia’. In: arXiv:1607.07892 [cs.MS] (2016). url: https://arxiv.org/abs/1607.07892.

[65] Jarrett Revels. Cassette.jl: Overdub your Julia Code. 2020. url: https://github.com/jrevels/Cassette.jl.

[66] Norman Rink et al. ‘CFDlang: High-level code generation for high-order methods in fluid dynamics’. In: Real World Domain Specific Languages Workshop 2018. Feb. 2018, pp. 1–10. doi: 10.1145/3183895.3183900.

[67] E. Solomonik et al. ‘Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions’. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013, pp. 813–824.

[68] Paul Springer and Paolo Bientinesi. Design of a high-performance GEMM-like Tensor-Tensor Multiplication. 2016. arXiv: 1607.00145 [cs.MS].

[69] Paul Springer and Paolo Bientinesi. The Landscape of High-Performance Tensor Contractions. Feb. 2017. url: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-20….

[70] Paul Springer and Chen-Han Yu. ‘cuTENSOR: High-Performance CUDA Tensor Primitives’. In: NVIDIA GPU Technology Conference 2019. Mar. 2019.

[71] Field G. Van Zee and Robert A. van de Geijn. ‘BLIS: A Framework for Rapidly Instantiating BLAS Functionality’. In: ACM Trans. Math. Softw. 41.3 (June 2015). issn: 0098-3500. doi: 10.1145/2764454. url: https://doi.org/10.1145/2764454.

[72] R. C. Whaley and J. J. Dongarra. ‘Automatically Tuned Linear Algebra Software’. In: SC ’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. 1998, pp. 38–38.

[73] Zhang Xianyi. OpenBLAS: An optimized BLAS library. 2020. url: https://www.openblas.net.

[74] Da Yan, Wei Wang and Xiaowen Chu. ‘Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply’. To appear in: Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium. 2020. url: https://www.cse.ust.hk/~weiwa/papers/yan-ipdps20.pdf.

[75] Xiuxia Zhang et al. ‘Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning’. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’17. Austin, Texas, USA: Association for Computing Machinery, 2017, pp. 31–43. isbn: 9781450344937. doi: 10.1145/3018743.3018755. url: https://doi.org/10.1145/3018743. 3018755.

Download scriptie (1.01 MB)

Universiteit of Hogeschool

Universiteit Gent

Thesis jaar

2020

Promotor(en)

Bjorn De Sutter

Kernwoorden

GPU,

matrixvermenigvuldiging,

programmeren,

flexibel raamwerk,

machinaal leren