UNIVERSITY OF VAASA FACULTY OF TECHNOLOGY SOFTWARE ENGINEERING Abdi-Hakim Yasin Ararse COMPARISON OF THE PERFORMANCE OF 3G SECURITY ALGORITHMS IN THE NAS LAYER Master’s thesis for the degree of Master of Science in Technology submitted for inspection, Vaasa, (15th March 2013). Supervisor D.Sc. (Econ.) Jouni Lampinen Instructor M.Sc. (Tech.) Jani Kokkonen 2 ACKNOWLEDGEMENTS First I would like to express my deepest appreciation to Professor Jouni Lampinen and Jani Kokkonen for their guidance and valuable advice throughout this Master’s thesis work. I am also thankful to Lab. Engineer Jari Kallio for helping and providing the tools required for the test setup. Thanks to my family for their endless encouragement. Special thanks to my loving wife Shamsa Mohamed, and my beloved children Nawar and Muhammed. Without their mental support this thesis work would not have been completed on time. Abdi-Hakim Yasin Ararse Helsinki, Finland, 15th of March, 2013 3 TABLE OF CONTENTS ACKNOWLEDGEMENTS .......................................................................................................... 2 TABLE OF CONTENTS.............................................................................................................. 3 CONCEPTS .................................................................................................................................. 7 LIST OF SYMBOLS .................................................................................................................... 7 ABBREVIATIONS....................................................................................................................... 8 TABLE OF FIGURES ................................................................................................................ 11 LIST OF TABLE ........................................................................................................................ 12 ABSTRACT................................................................................................................................ 13 TIIVISTELMÄ ........................................................................................................................... 14 1. INTRODUCTION................................................................................................................... 15 2. UMTS...................................................................................................................................... 17 2.1 Introduction...................................................................................................................... 17 2.2 The Evolution of Mobile Telephony Systems.................................................................. 17 2.3 UMTS Network Architecture........................................................................................... 18 2.4 UMTS Radio Network ..................................................................................................... 20 2.4.1 Security Architecture UMTS ................................................................................... 21 2.4.2 Network Access Security to UMTS......................................................................... 22 2.4.2.1 User Identity Confidentiality ...................................................................... 22 2.4.2.2 Mutual Entity Authentication ..................................................................... 23 2.4.2.3 Data Integrity .............................................................................................. 23 2.4.3 Cryptographic Algorithms for UMTS...................................................................... 24 2.4.3.1 Confidentiality Algorithm .......................................................................... 25 2.4.3.2 Integrity Algorithm ..................................................................................... 28 2.5 KASUMI.......................................................................................................................... 31 2.5.1 Description of KASUMI.......................................................................................... 32 4 2.5.2 KASUMI Components and Encryption Function .................................................... 33 2.5.2.1 ƒ Functions.................................................................................................. 34 2.5.2.2 Function FL ................................................................................................ 34 2.5.2.3 Function FO ................................................................................................ 35 2.5.2.4 Function FI.................................................................................................. 35 2.5.2.5 S-boxes ....................................................................................................... 36 2.5.3 Key Schedule ........................................................................................................... 36 2.5.4 KASUMI Operations ............................................................................................... 37 3. THEORETICAL ANALYSIS OF CRYPTOGRAPHIC PERFORMANCE.......................... 38 3.1 Influential Variables for Cryptographic Performance Measurements ............................. 39 3.1.1 Application or Protocol Stack Software Overheads................................................. 39 3.1.2 Software Overheads ................................................................................................. 40 3.1.3 Bus Bandwidth......................................................................................................... 40 3.1.4 Data Size .................................................................................................................. 40 3.1.5 Iteration .................................................................................................................... 40 4. EXPERIMENTAL SET-UP AND TOOLS ............................................................................ 41 4.1 OCTEON Hardware Architecture.................................................................................... 41 4.1.1 Cavium Networks MIPS (cnMIPS).......................................................................... 42 4.1.1.1 OCTEON Product Overview ...................................................................... 42 4.1.1.2 OCTEON Architecture Overview .............................................................. 42 4.1.1.3 Other Units and Functionalities .................................................................. 46 4.2. OCTEON Software ......................................................................................................... 46 4.2.1 Simple Executive API .............................................................................................. 47 4.2.2 Runtime Environment Choices for cnMIPS Cores .................................................. 49 4.2.2.1 Simple Executive ........................................................................................ 50 4.2.2.2 Application Configurations......................................................................... 51 5 4.2.3 Other Software Issue................................................................................................ 51 4.2.3.1 Application Entry Point and Start up Code ................................................ 51 4.2.3.2 Booting SE-S Application .......................................................................... 51 4.2.4 Software Architecture .............................................................................................. 52 4.2.4.1 Control-plane versus Data-plane applications ............................................ 53 4.2.4.2 Event-driven Loop (polling) versus Interrupt-driven loop ......................... 53 4.2.4.3 Packet Processing Steps in OCTEON ........................................................ 54 4.2.4.4 Using Work Groups in Packet Processing .................................................. 55 4.2.4.5 Passing Work from one Core to another Core ............................................ 56 4.2.4.6 Pipelined versus Run-to-Completing Software Architecture ..................... 58 4.3 Software under the Test ................................................................................................... 59 5. TEST EXECUTION AND RESULT COLLECTION............................................................ 60 5.1 Test Scenarios Selection .................................................................................................. 60 5.2 Running the Test Cases.................................................................................................... 61 5.3 Hardware-based KASUMI Acceleration Case Simulation .............................................. 62 5.3.1 Case A: KASUMI HW Encryption and Decryption of packets of 40 bytes ............ 62 5.3.2 Case B: KASUMI HW Encryption and Decryption of packets of 88 bytes ............ 63 5.3.3 Case C: KASUMI HW Encryption and Decryption of packets of 500 bytes .......... 64 5.3.4 Case D: KASUMI HW Encryption /Decryption of packets of 1000 bytes.............. 64 5.4 Software-based KASUMI Acceleration Case Simulation ............................................... 65 5.4.1 Case A: KASUMI SW Encryption and Decryption on 40 bytes packets ................ 65 5.4.2 Case B: KASUMI SW Enc and Dec on 88 bytes packets........................................ 66 5.4.3 Case C: KASUMI SW Encryption and Decryption on 500 bytes packets............... 67 5.4.4 Case D: KASUMI SW Encryption and Decryption on 1000 bytes packets ............ 67 5.5 Key Generation/Expansion Effects on Test Cases........................................................... 68 5.6 Multi-core Test Cases ...................................................................................................... 68 6 5.6.1 Shorter Data Size for Multi-core Test Cases ........................................................... 68 5.6.1.1 KASUMI HW Encryption and Decryption ................................................ 68 5.6.1.2 KASUMI SW Encryption and Decryption ................................................. 69 5.6.2 Higher Data Size for Multi-core Test Cases ............................................................ 70 5.6.2.1 KASUMI HW Encryption and Decryption ................................................ 70 5.6.2.2 KASUMI SW Encryption and Decryption ................................................. 71 6. RESULT COMPARISONS AND DETAIL ANALYSIS...................................................... 72 6.1 Comparisons between KASUMI HW and KASUMI SW ............................................... 72 6.1.1 Lower Data Rate with One Core.............................................................................. 72 6.1.1.1 KASUMI HW/SW Enc & Dec Comparisons with 40 bytes....................... 72 6.1.1.2 KASUMI HW/SW Enc and Dec Comparisons with 40 bytes ................... 73 6.1.1.3 KASUMI SW Enc/Dec vs. only SW Enc/SW Dec for 40 bytes ................ 74 6.1.1.4 KASUMI HW Enc/Dec vs. only HW Enc & HW Dec on 40 bytes ........... 74 6.1.1.5 KASUMI HW and SW Enc/Dec for 1000 bytes ........................................ 75 6.1.1.6 KASUMI HW Enc/Dec vs. only Enc and Dec for 1000 bytes ................... 76 6.1.1.7 KASUMI SW Enc/Dec vs. only SW Enc & SW Dec on 1000 bytes ......... 77 6.1.2 Higher Data Rate with Multi-core ........................................................................... 77 6.1.2.1 Comparison between KASUMI HW/SW Enc/Dec with 16K .................... 77 6.1.2.2 Comparison between KASUMI HW Enc/Dec with 24K ........................... 78 6.1.2.3 Comparison between KASUMI HW Enc/Dec with 64K ........................... 79 7. CONCLUSIONS AND FUTURE WORK ............................................................................. 81 REFRENCES .............................................................................................................................. 83 APPENDIX 1. Testing Environment .......................................................................................... 87 APPENDIX 2. More test case scenarios ..................................................................................... 91 7 CONCEPTS Cellular Concept: the cellular concept is a system-level idea which calls for a single high-power transmitter (large cell) to be replaced by many low-power transmitters (small cells), each providing coverage to only a small portion of the service area. Domain: the highest-level group of physical entities. Reference points are defined between domains. Stratum: the grouping of protocols related to one aspect of the services provided by one or several domains. Stream Cipher: the stream concept is that Plaintext data are added bit by bit to random- looking mask data that are generated by the Cipher Key (CK) and a few other parameters. Processor: this word is used in this thesis to denote the entire chip with all of the different functional hardware blocks, including all the cores on the chip. A Cryptographic Accelerator: is a device that performs processor-intensive decrypting/encrypting while freeing the host CPU to perform other tasks. Message authentication code: Is a short piece of information used to authenticate a message and to provide integrity and authenticity assurance on the message. LIST OF SYMBOLS = The assignment operator.  The bitwise exclusive -OR operation. || The concatenation of the two operands. 8 ABBREVIATIONS 1G 1st Generation 2G 2nd Generations 3G 3rd Generations 3GPP 3rd Generations Partnership Project cnMIPS core networks standardized MIPS AKA Authentication and Key Agreement AN Access Network AMPS Advanced Mobile Phone Service CBC Cipher Block Chaining CFB Cipher Feedback CK Cipher Key CN Core Network CPU Central Processing Unit CDMA Code Division Multiple Access CS Circuit Switched DES Data Encryption Standard GPRS General Packet Radio Service GSM Global System for Mobile Communication IETF Internet Engineering Task Force IK Integrity Key I/O Input and Output HE Home Environment HW Hardware HSPA High Speed Downlink Packet Access 9 HSPA+ Evolved High Speed Packet Access HSUPA High Speed Uplink Packet Access K Key KM Key Modifier OFB Output Feedback PDA Personal Digital Cellular PS Packet Switched PS Padded String LTE Long-Term Evolution MAC Message Authentication Code MIPS Microprocessor without Interlocked Pipeline Stages MM Mobility Management MME Mobility Management Entity NAS Non-Access Stratum NIST National Institute of Standard and Technology Node B UMTS Base Station RAN Radio Access Network RANAP Radio Access Network Application Protocol RNC Radio Network Controller RRC Radio Resource Control SAGE Security Algorithms Group of Experts SN Serving Network SoC System-On-Chip SW Software TDMA Time Division Multiple Access 10 TR Technical Report TS Technical Specification UE User Equipment UEA UMTS Encryption Algorithm UIA UMTS Integrity Algorithm UMTS Universal Mobile Telecommunication System USA United State of America USIM Universal Subscriber Identity Module UTRAN UMTS Terrestrial Radio Access Network WCDMA Wideband Code Division Multiple Access 11 TABLE OF FIGURES FIGURE 1. UMTS SYSTEM..........................................................................................................................19 FIGURE 2. UTRAN, ITS NETWORKS AND INTERFACES ................................................................................20 FIGURE 3. OVERVIEW OF SECURITY ARCHITECTURE...................................................................................22 FIGURE 4. CIPHERING USER AND SIGNALING DATA TRANSMITTED OVER THE RADIO ACCESS LINK. ............26 FIGURE 5. THE Ƒ8 STREAM CIPHER MODE ...................................................................................................26 FIGURE 6. DERIVATION OF MAC-I AND/OR XMAC-I O .............................................................................30 FIGURE 7. THE Ƒ9 INTEGRITY FUNCTION.....................................................................................................31 FIGURE 8. (A) KASUMI; (B) FOI FUNCTION; (C) FII,J FUNCTION; AND (D) FLI FUNTION............................32 FIGURE 9. KASUMI OPERATIONS ..............................................................................................................37 FIGURE 10. CAVIUM NETWORKS MIPS ARCHITECTURE (CNMIPS) ...........................................................44 FIGURE 11. COMMUNICATION BETWEEN CORE AND ITS CO-PROCESSOR .....................................................45 FIGURE 12. HIGH LEVEL TEST SYSTEM ARCHITECTURE ..............................................................................47 FIGURE 12. SIMPLE EXECUTIVE API ..........................................................................................................48 FIGURE 13. SIMPLE EXECUTIVE API RUNTIME ENVIRONMENTS .................................................................49 FIGURE 14. SIMPLE EXECUTIVE STAND-ALONE APPLICATION (SE-S) ........................................................50 FIGURE 15. SIMPLE EXECUTIVE CALLS FROM KERNEL MODE.....................................................................50 FIGURE 16. SIMPLE EXECUTIVE USER-MODE (SE-UM) APPLICATION .......................................................50 FIGURE 17. 8-CORE SIMPLE SYSTEM RUNNING A SE-S APPLICATION ..........................................................51 FIGURE 18. SIMPLE EXECUTIVE STAND-ALONE MODE (SE-S) APPLICATION..............................................52 FIGURE 19. DOWNLOADING AND BOOTING SE-S APPLICATIONS ...............................................................52 FIGURE 20. SE-S USED BOTH CONTROL PLANE AND DATA PLANE .............................................................53 FIGURE 21. CORE MAY ACCEPT WORK FROM ANY AND ALL GROUPS ..........................................................56 FIGURE 22. KASUMI HW ENC AND DEC ON 40 AND 88 BYTES PACKETS ..................................................63 FIGURE 23. KASUMI HW ENC AND DEC ON 500 AND 1000 BYTE PACKETS ..............................................65 FIGURE 24. KASUMI SW ENC AND D EC ON 40 AND 88 BYTES PACKETS ..................................................66 FIGURE 25. MULTICORE FOR KASUMI HW ENC AND DEC ON 64 BYTE PACKETS ....................................68 FIGURE 26. REFERENCE TIME FOR KASUMI HW ENC AND DEC ON 64 BYTE PACKETS............................69 FIGURE 27. MULTICORE FOR KASUMI SW ENC AND DEC ON 64 BYTE PACKETS ......................................69 FIGURE 28. REFERENCE TIME FOR KASUMI SW ENC AND DEC ON 64 BYTE PACKETS .............................70 FIGURE 29. MULTICORE FOR KASUMI HW ENC AND DEC ON 512 BYTE PACKETS ...................................70 FIGURE 30. REFERENCE TIME FOR KASUMI HW ENC/DEC ON 512 BYTE PACKETS...................................71 FIGURE 31. MULTICORE FOR KASUMI SW ENC/DEC ON 512 BYTE PACKETS ...........................................71 FIGURE 32. REFERENCE TIME FOR KASUMI SW ENC/DEC ON 512 BYTE PACKETS..................................71 FIGURE 33. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 40 BYTE PACKETS ...........................................72 FIGURE 34. NUMBER OF CYLES PER BIT ON 40 BYTE PACKETS ....................................................................73 FIGURE 35. COMPARISON BETWEEN KASUMI SW IMPLEMENTATIONS .....................................................74 FIGURE 36. COMPARISON BETWEEN KASUMI HW IMPLEMENTATIONS ....................................................75 FIGURE 37. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 1000 BYTE PACKETS .......................................76 FIGURE 38. COMPARISON BETWEEN KASUMI HW IMPLEMENTATIONS ....................................................76 FIGURE 39. COMPARISON BETWEEN KASUMI SW IMPLEMENTATIONS .....................................................77 FIGURE 40. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 512 BYTE PACKETS WITH 16K.........................78 FIGURE 41. KASUMI HW ENC/DEC ON 512 BYTE WITH DATA BLOCKS OF 24K ........................................79 FIGURE 42. KASUMI HW ENC/DEC ON 512 BYTE DATA PACKET OF 64K .................................................80 12 LIST OF TABLE TABLE 1. CONSTANTS CJ.............................................................................................................................36 TABLE 2. SUBKEYS (KL, KO, AND KI)........................................................................................................36 TABLE 4. KASUMI HW ENC AND DEC ON 40 BYTE PACKETS ....................................................................62 TABLE 5.KASUMI HW ENC AND DEC ON 88 BYTE PACKETS .....................................................................63 TABLE 6. KASIMI HW ENC AND DEC ON 500 BYTES PACKETS ..................................................................64 TABLE 8. KASUMI SW ENC AND DEC ON 40 BYTE PACKETS .....................................................................65 TABLE 9. KASUMI SW ENC AND DEC ON 88 BYTE PACKETS .....................................................................66 TABLE 10. KASUMI SW BASED IMPLEMENTATION ON 500 BYTE PACKETS................................................67 TABLE 11. KASUMI SW ENC AND DEC ON 1000 BYTE PACKETS ...............................................................67 TABLE 12. NUMBER OF CYCLES PER BIT ON 40 BYTE PACKETS ...................................................................73 TABLE 13. KASUMI SW ENC/DEC VS. ONLY ENC OR DEC ON 40 BYTES ..................................................74 TABLE 14. KASUMI HW ENC/DEC VS. ENC OR DEC ON 40 BYTES ............................................................75 TABLE 15. KASUMI HW/SW ENC AND DEC ON 512 BYTES WITH 16K DATA BLOCKS...............................78 TABLE 16. KASUMI HW ENC/DEC ON 512 BYTE WITH 24K DATA BLOCKS ...............................................79 TABLE 17. KASUMI HW ENC/DEC ON 512 BYTE WITH 64K DATA BLOCKS ...............................................80 TABLE A1. KASUMI HW ENCRYPTION ON 40 BYTE PACKETS ...................................................................91 TABLE A2. KASUMI HW ENCRYPTION ON 88 BYTE PACKETS ...................................................................91 TABLE A3. KASUMI HW ENCRYPTION ON 500 BYTES PACKETS ...............................................................92 TABLE A4. KASUMI HW ENCRYPTION ON 1000 BYTE PACKETS ...............................................................93 TABLE A5. KASUMI HW DECRYPTION ON 40 BYTE PACKETS ...................................................................93 TABLE A6. KASUMI HW DECRYPTION ON 88 BYTE PACKETS....................................................................94 TABLE A7. KASUMI HW DECRYPTION ON 500 BYTES PACKETS................................................................95 TABLE A8. KASUMI HW DECRYPTION ON 1000 BYTE PACKETS................................................................95 TABLE A9. KASUMI SW ENCRYPTION ON 40 BYTE PACKETS....................................................................96 TABLE A10. KASUMI SW ENCRYPTION ON 88 BYTES PACKETS ................................................................97 TABLE A11. KASUMI SW ENCRYPTION ON 500 BYTES PACKETS ..............................................................98 TABLE A12. KASUMI SW ENCRYPTION ON 1000 BYTES PACKETS ............................................................99 TABLE A13. KASUMI SW DECRYPTION ON 40 BYTES PACKETS ................................................................99 TABLE A14. KASUMI SW DECRYPTION ON 88 BYTES PACKETS ..............................................................100 TABLE A15. KASUMI SW DECRYPTION ON 500 BYTE PACKETS ..............................................................100 TABLE A16. KASUMI SW DECRYPTION ON 1000 BYTES PACKETS ..........................................................101 13 UNIVERSITY OF VAASA Faculty of technology Author: Abdi-Hakim Yasin Ararse Topic of the Thesis: Comparison of the performance of 3G security algorithms in the NAS layer Supervisor: Professor Jouni Lampinen Instructor: M.Sc.(Tech). Jani Kokkonen Degree: Master of Science in Technology Major of Subject: Software Engineering Year of Entering the University: 2000 Year of Completing the Thesis: 2013 Pages: 101 Key words: KASUMI, 3G, UMTS security, Multi-Core ABSTRACT Cryptographic functionality implementation approaches have evolved over time, first, for running security software on a general-purpose processor, second, employing a separate security co-processor ,and third, using built-in hardware acceleration for security that is a part of a multi-core CPU system. The aim of this study is to do performance tests in order to examine the boost provided by accelerating KASUMI cryptographic functions on a multi-core Cavium OCTEON processor over the same non-accelerating cryptographic algorithm implemented in software. Analysis of the results shows that the KASUMI SW implementation is much slower than the KASUMI HW-based implementation and this difference increases gradually as the packet size is doubled. In detailed comparisons between the encryption and decryption functions, the result indicates that at a lower data rate, neither of the KASUMI implementations shows much difference between encryption or decryption processing, regardless of the increase in the number of data packets that are being processed. When all the 16 cores of the OCTEAN processor are populated, as the number of core increases, the number of processing cycles decreases accordingly. Another observation was that when the number of cores in use exceeds 5 cores, it doesn’t make much difference to the number of decrease of processing cycles. This work illustrates the potential that up to sixteen cnMIPS cores integrated into a single-chip OCTEON processor provides for HW- and SW-based KASUMI implementations. 14 VAASAN YLIOPISTO Teknillinen tiedekunta Tekija: Abdi-Hakim Yasin Ararse Diplomityön nimi: Comparison of the performance of 3G security algorithms in the NAS layer Valvojan nimi: Professor Jouni Lampinen Ohjaajan nimi: M.Sc.(Tech) Jani Kokkonen Tutkinto: Diplomi-insinööri Koulutusohjelma: Tietotekniikka Suunta: Ohjelmistotekniikka Opintojen aloitusvuosi: 2000 Diplomityön valmistusmisvuosi: 2013 Sivumäärä: 101 Key words: KASUMI, 3G, UMTS-Tietoturva, moniydin TIIVISTELMÄ Salaustekniikoiden toteutustavat ovat kehittyneet ajan myötä, kun ensiksi yleiskäyttöisillä suorittimilla ajettavista tietoturvaohjelmista on siirrytty käyttämään erillistä apusuoritinta ja kolmanneksi on siirrytty käyttämään sisäänrakennettua laitteistokiihdytystä, joka on osana keskusyksikön moniydinsuoritinta tietoturvan takaamiseksi. Tämän tutkimuksen tavoitteena on tehdä suorituskykytestejä, joilla tutkitaan suorituskyvyn parannusta, joka saavutetaan KASUMI-salausalgoritmin kiihdytyksellä moniytimisellä Cavium OCTEON suorittimella suhteessa ohjelmistopohjaiseen, kiihdyttämättömään salausalgoritmiin. Tulosten analyysi osoittaa, että KASUMI-algoritmin ohjelmistopohjainen toteutus on paljon hitaampi kuin laitteistopohjainen toteutus, ja että tämä ero kasvaa asteittain, kun paketin koko kaksinkertaistuu. Yksityiskohtaisissa vertailuissa salauksen koodauksen ja koodauksen purkamisen välillä osoittavat, ettei alemmalla datanopeudella kummassakaan KASUMI- toteutuksessa salauksen tai sen purkamisen käsittelyssä ole suuria eroja riippumatta lisääntyvän käsiteltävien datapakettien koosta. Kun kaikki kuusitoista OCTEON-suorittimen ydintä on otettu käyttöön, ja kun ydinten määrä kasvaa, suoritusvaiheiden määrä vähenee vastaavasti. Toinen havainto oli, että kun käytössä olevien ytimien määrä ylittää viisi ydintä, ei sillä ole vaikutusta suoritusvaiheiden vähenemiseen. Tämä tutkimus osoittaa ne mahdollisuudet, jotka jopa 16 cnMIPS ydintä integroituna yhden piirin OCTEON-suorittimeen tarjoaa ohjelmisto- ja laitteistopohjaisissa toteutuksissa. 15 1. INTRODUCTION This chapter introduces the motivation for comparing KASUMI hardware accelerated cryptography with the software-based KASUMI cryptography implementation, which has become the most used option in a system-in-chip embedded system like the OCTEON processor that is studied in this thesis. Afterwards, an overview of the organization of the thesis will be given. The Universal Mobile Telecommunication System (UMTS) is a so-called third- generation (3G) mobile radio system and it is a successor to second-generation (2G) systems such as the Global System for Mobile Communications (GSM) and General Packet Radio Service (GPRS). The current UMTS system is in Release 8 and is called Long-Term Evolution (LTE), which represents a flat architecture solution. According to the 3GPP security specification (3GPP TS 33.102, 2006), the security architecture is made up of a set of security features and security mechanisms that meets one or several security requirements. The requirements for UMTS confidentiality and integrity algorithms are specified by 3GPP in the technical specification document (3GPP TS 33.105, 2007). Within the security architecture of the 3GPP system, there are two mandatory security algorithms, namely, ƒ8 and ƒ9 or the UMTS Encryption Algorithm (UEA1) and UMTS Integrity Algorithm (UIA1), respectively. The core functionalities of these algorithms are based on the KASUMI stream cipher. For the confidentiality algorithm, the UMTS Encryption Algorithm (UEA1) is a stream cipher and it is used to encrypt and decrypt blocks of data under a confidentiality key (CK). The algorithm uses KASUMI in the form of Output Feedback (OFB) mode as a key stream generator. For the integrity algorithm, the UMTS Integrity Algorithm (UIA1) computes the Message Authentication Code (MAC) of a given input message under an integrity key (IK) and imposes no limitation on the length of the input message. The approach that has been adopted uses KASUMI as it is also used by the confidentiality algorithm (UEA1) in the form of Cipher Block Chaining (CBC-MAC) mode. Cryptography is the art and science of encrypting and decrypting data in order to make it impossible for outside parties to access the data. Cryptographic functionality implementation approaches have evolved over time, first, with software running on general-purpose processors, second, employing a separate custom security co-processor, and third, using built-in hardware acceleration for security that is a part of a multi-core CPU system. 16 This thesis examines the use of built-in hardware acceleration for cryptographic functions in silicon, or, to give it another name, cryptographic functions in SoC. The SoC system consists of both a hardware component and a software component, which paves the way for the boundary between hardware and software being shrunk. Security algorithms are generally dependent on intensive computation and they require many bit- manipulating operations in order to transform back and forth between plaintext and ciphertext. Software running on a general-purpose processor is often inefficient in performing such operations since the many instructions needed to implement cryptographic operations consume valuable CPU resources and thus affect the performance of the system. Hardware acceleration provides a better system performance implementation than the software. This thesis work focuses on verifying the cryptographic performance boost that is provided by a Cavium OCTEON processor as an accelerator of KASUMI encryptions and decryptions via the software-based KASUMI implementation. KASUMI encryption and decryption are applied in the UMTS Non-Access Stratum (NAS) layer for ciphering radio link access which is specified in the 3GPP release 7 specifications. The structure of this thesis is as follows. Chapter 2 presents the existing literature work in the field under study. Chapter 3 discusses the basics of performance analysis and how it is applied in this thesis. Chapter 4 develops the relevant test scenarios and test environment setups. Chapter 5 discusses the test execution and presents the test results. Results and analysis of the measurements can be found in Chapter 6, followed by overall conclusions in Chapter 7. 17 2. UMTS 2.1 Introduction This chapter will first provide an insight into the history of mobile telephony systems, from the first generation of mobile systems to the third generation. After taking a closer look at the evolution of mobile systems, the second subchapter will highlight the basic principle of mobile security used in the third-generation system. At the end, a detailed description of the KASUMI cipher and its operations will be given. 2.2 The Evolution of Mobile Telephony Systems The history of mobile telephony goes back to experiments with radio telephony in the USA in the 1920s (Agar 2005). There are three different generations as far as mobile telephony is concerned. The first mobile radio systems, the so-called first generation (1G) networks, were based on an analogue radio path but used digital switching technology. These mobile systems offered basic services for their users and the emphasis was on speech and service-related matter. The other characteristics of these early mobile systems were that networks were mainly national efforts and very often they were specified after the networks were established. For this reason, the 1G networks were incompatible with each other. Later, the first real cellular systems were implemented, such as the analogue Advanced Mobile Phone Service (AMPS) system in the USA. For the first time, frequencies were reused, resulting in the interference inherent to cellular networks. Old analogue systems have a common limitation in terms of wide area coverage and frequency spectrum reuse in the systems as a result of the use of only one single radio transmitter (Walke, Seidenberg &Althoff 2003). In May 1972 Bell Labs introduced and patented a cellular concept that laid the foundations for the second- and third-generation mobile radio systems. The cellular concept is simple: instead of a single base station providing coverage of as large an area as possible, each base station should only cover a small area. Further elaboration of the cellular concept is provided in (Macdonald 1979) and (Rappaport 2002). As the need for mobile communication increased, the need for a more global mobile communication system also increased. The international specification bodies started to specify the 2G; the so-called Second Generation (2G) network uses digital channels, resulting in more efficient use of the spectrum. Two main systems that established themselves in the USA – the Time Division Multiple Access (TDMA) systems IS-54 18 and IS-136 – are based on a time slot structure and are similar to the Global System for Mobile Communications (GSM) (Althoff 2003) . Japan also developed its own standard: Personal Digital Cellular (PDC), which also uses the TDMA technology (3 time slots, 25-MHz channel bandwidth) and operates at 800 MHz and 1500 MHz. The best-known Second Generation system is one that originated in Europe; the Global System for Mobile (GSM) Communications was designed in the late ’80s by the state- owned national telecommunication companies and harmonised for use throughout Europe. GSM also employs the TDMA technology and uses 8 time slots on a 200-kHz wide carrier frequency. GSM900 has a total of 124 frequency channels and GSM1800 has as many as 374. The so-called Third-Generation (3G) mobile radio systems are based on the open interfaces of its successor, GSM. The third generation, 3G, is expected to complete the globalisation process of mobile communication. UMTS (Universal Mobile Telecommunication System) is one 3G implementation. The current UMTS network has been upgraded to High-Speed Downlink Packet Access (HSPA) in order to increase the data rate and capacity for downlink packets and has been introduced as a 3GPP release 5 features, and High-Speed Uplink Packet Access (HSUPA) is introduced in 3GPP release 6 in order to boost uplink performance in a UMTS network. The combination of HSDPA and HSUPA is often referred to as HSPA. However, even with these improvements and the introduction of HSPA, the evolution of UMTS has not reached its end. In a 3GPP release 7, HSPA+ has been introduced, with a significant enhancement and improvement to the performance of HSPA-based radio networks in terms of spectrum efficiency, peak data rate, and latency. 2.3 UMTS Network Architecture There are different ways to visualise a UMTS network, depending on which angle you look from. One angle to look from is the functions of the network in terms of how the traffic is handled. Another approach is to study the functions of the network elements. In this thesis work, the network will be looked at from the point of view of both the physical and functional viewpoints. The physical aspects are modeled using the domain concept and the functional aspects are modeled using the strata concept (3GPP TS 23.110, 2007). Figure 1 (below) illustrates the basic domains in a UMTS system. 19 User Equipment Domain Access Network Domain Core Network Domain Infrastructure Domain Cu Mobile Equipment Domain USIM Domain Home Network Domain Transit Network Domain Uu Iu [Zu] [Yu] Serving Network Domain Figure 1. UMTS System Cu Reference point between USIM and UE Iu Reference point between Access and Serving Network domains Uu Reference point between User Equipment and Infrastructure domains, UMTS radio interface [Yu] Reference point between Serving and Transit Network domains [Zu] Reference point between Serving and Home Network domains According to (3GPP TS 23.101, 2007), the basic UMTS architectural is split into a User Equipment domain and an Infrastructure domain. The main interest of this thesis work lies in the Infrastructure domain, which could be further split into an Access Network domain and a Core Network domain. From functionality point of view, UMTS network infrastructure is logically divided into Core Network (CN) and Access Network (AN) subsystems as specified in (3GPP TS 23.101, 2007) and (3GPP TS 23.110, 2007) respectively. Each subsystem can be further divided into separate technologies. For example, the RAN (Radio Access Network) is compromised of different air interface technologies, such as GERAN (GSM, EDGE, and Radio Access Network), UTRAN (UMTS Terrestrial Radio Access Network), and future solutions such as WLAN and 4G. The core network is today clearly divided into two domains:  the Circuit-Switched (CS) domain and  the Packet-Switched (PS) domain. A more detailed description of the UMTS architecture is provided in the 3GPP Network Architecture document (3GPP TS 23.002, 2007). The interest of this thesis work lies 20 mainly in two interfaces, namely Iu and Iur. More information on UMTS interfaces can be found in specifications (3GPP TR 23.930, 1999) and (3GPP TS 25.420, 2007). 2.4 UMTS Radio Network The UMTS Terrestrial Radio Access Network (UTRAN) consists of a set of Radio Network Subsystems (RNS) connected to the Core Network (CN) through the Iu interface. A RNS consists of a Radio Network Controller and or more Node Bs. A Node B is connected to the RNC through the Iub interface (3GPP TS 25.401, 2002). The main task of the UTRAN is to create and maintain Radio Access Bearers (RAB) for communication between the UMTS User Equipment (UE) and core network (CN). With RAB the CN elements are given an illusion about a fixed communication path to the UE, thus releasing them from the need to take care of radio communication aspects. There is two open interfaces: Uu and Iu, which are used by UTRAN to talk with the UE and CN. Since there are packet switched and circuit switched domains for different services, Iu-interface was separated into the Iu-CS and Iu-PS interfaces. Detail description of UTRAN protocol architecture is specified in (3GPP TS 25.401, 2002) document. The UTRAN architecture is shown following figure. Figure 2. UTRAN, its networks and interfaces The focus of this thesis work is on security parts of Non-Access Stratum (NAS) signaling between UE and CN which passes through UTRAN network. The overall 21 protocol architecture of radio interface is layered into three protocol layers, namely, the physical layer (L1), the data link layer (L2) and network layer (L3) as described in (3GPP TS 25.301, 2007) . L3 provides Uu Stratum services and functions as whole and detail description of L3 protocol, Radio Resource Control (RRC) is given in (3GPP TS 25.331 2006). 2.4.1 Security Architecture UMTS UMTS security builds on the security of GSM, inheriting the proven GSM security features. This maximizes the backward compatibility between GSM /UMTS and UMTS/LTE. UMTS also provides a solution to the weaknesses of GSM security and adds security features for new 3G radio access networks and services. According to specifications, the security architecture is made up of a set of security features and security mechanisms (3GPP TS 33.102, 2006). A security feature is a service capability that meets one or several security requirements that are defined in (3GPP TS 21.133, 2002) and implement the security objectives and principles described in (3GPP TS 33.120, 2001). A Security mechanism is an element that is used to realize a security feature and all the security features and security mechanism taken together to forms the security architecture (3GPP TS 33.102, 2006). UMTS consists of five security feature groups: I) Network Access Security provides users with secure access to UMTS services and protect against attacks on the radio access link. II) Network Domain Security protects against attacks on the wireline network and allows nodes in the provider domain to exchange signaling data securely. III) User Domain Security provides secure access to mobile stations. IV) Application Domain Security allows the secure exchange of messages between applications in the user and in the provider domain. V) Visibility and configurability of security allows the user to observe whether a security feature is currently in operation and if certain services depend on this security feature Figure 3 shows the way security features are grouped together into five different sets of features, each one facing a specific threat and accomplishing certain security objectives. 22 Figure 3. Overview of security architecture The focus of this thesis work is on network access security (group I) and other UMTS security area are outside the scope of this thesis work. 2.4.2 Network Access Security to UMTS Network Access Security features enables users to securely access services provided by the UMTS network. It consist a set of security features that provide users with secure access to UMTS services in which particular protect against on the radio access link. Network access security features can be further classified into the following categories, namely, user identity confidentiality, entity authentication and confidentiality, data integrity and mobile equipment identification as specified in (3GPP TS 33.102, 2006). 2.4.2.1 User Identity Confidentiality The main objectives of the user identity confidentiality features are to provide mechanisms which prevent intruders from eavesdropping on the radio access link. It also makes sure that the user location confidentiality is secured, so the presence or the arrival of the user in a certain area cannot be determined by eavesdropping on the radio access. It also provides user un-traceability mechanism, so that an intruder cannot deduce whether different services are delivered to the same user by eavesdropping on the radio access link. To achieve these objectives, the user is normally identified by a temporary identity by which he is known by the visited serving network, or by an encrypted permanently identity. To avoid user traceability, which may lead to the compromise of the user identity confidentiality, the user should not be identified for a long period by means of the same temporary or encrypted identity. To achieve these security features, in addition it is required that any signaling or user data that might reveal the user’s identity is ciphered on the radio access link. 23 2.4.2.2 Mutual Entity Authentication There are three entities involved in the authentication mechanism of the UMTS system, Home Environment (HE), Serving Network (SN) and the User Entity (UE), more specifically Universal Subscriber Identity Module (USIM). Authentication and Key Agreement (AKA) mechanism accomplishes this mutual authentication of the user and the network using a symmetric key (K) and derives the new cipher and integrity keys. Detail description of Authentication and Key Agreement (AKA) mechanisms are discussed in (Arkko & Haverinen 2006), (Kambourakis, Rouskas & Gritzalis 2004) and (Grecas, Maniatis & Vernieris 2003). Once the user and the network have authenticated each other, they may begin secure communication. Encryption and decryption take place in the UE and in RNC on the network side, which means that Cipher Key (CK) has been transferred from the core network (CN) to the Radio Access Network (RAN). This is done in a specific Radio Access Network Application Protocol (RANAP) message, called the security mode command. After the RNC has obtained the CK, it can switch encryption on the sending a Radio Resource Control (RRC) security code command to the UE. The UMTS encryption mechanism is based on a steam cipher concept [see chapter concepts]. The core of encryption mechanism is the mask generation algorithm, which is denoted as function f8. The specification is available (3GPP TS 35.201, 2007) and it is based on a novel block cipher called KASUMI in Universal Mobile Telecommunication System (UMTS) and uses SNOW 3G in Long-Term Evolution (LTE). 2.4.2.3 Data Integrity The purpose of integrity data protection is to provide a mechanism that the User Equipment (UE) and Serving Network (SN) can securely negotiate the integrity algorithm that they shall use subsequently in order to authenticate individual control messages. It also provides a mechanism that the User Equipment (UE) and Serving Network (SN) agree on an integrity key that they may use subsequently. Lastly, data integrity feature provide a mechanism that receiving entity (UE or SN) is able to verify that signaling data has not been modified in an authorized way since it was sent by the sending entity (UE or SN) and that data origin of the signaling data received is indeed the one claimed. According to (Niemi & Nyberg 2006), this mechanism is important, since separate authentication procedures only give assurance of the identities of the communicating parties at the time of the authentication. A Message Authentication Code (MAC) function is applied to each individual signaling message at the RRC layer of UMTS Terrestrial Radio Access Network (UTRAN) protocol stack (3GPP TS 25.331 2006). The integrity protection of signaling messages, between UE and RNC starts during the security mode set-up as soon as the integrity key and integrity protection algorithm is known. Integrity protection is based on a message authentication code concept [see in concept chapter] which is a one-way function controlled by the secret Integrity Key 24 (IK). The function is devoted by f9 and its output is MAC-I a 32 bit, random-looking bit string. The algorithm for integrity protection is based on the same core function as encryption. This thesis work focus mainly on mechanisms used for confidentiality and data integrity security features, so mobile equipment identification and entity authentication related mechanisms are outside the scope of this thesis work. 2.4.3 Cryptographic Algorithms for UMTS This section will highlight the cryptographic algorithms used in UMTS and explain their functionalities in detail. Further, this section presents the 3GPP UMTS security algorithm specifications and further descriptions of UMTS security functionalities provided by security experts, namely Valtteri Niemi and Kaisa Nyberg in their book: UMTS Security. In cryptography, a block cipher is a symmetric key cipher which operates on fixed- length group of bits. The block cipher transforms a plaintext block of fixed lengths into a sequence of ciphertext blocks of equal length under the control of a secret key K. The operation of transforming a plaintext block into a ciphertext block is called encryption, and the operation of transforming a ciphertext block back to plaintext block is called decryption. A block cipher applies the encryption algorithms and key to an entire block of plaintext to obtain the ciphertext. To encrypt messages longer than the block size or in another word in order to provide confidentiality for messages of arbitrary length, then a mode of operation is used. The four most widely known modes of operation were originally standardized by National Institute of Standard and Technology (NIST) for use with the Data Encryption Standard (DES) algorithm (NIST 1981). A stream cipher is a cryptographic algorithm for encrypting plaintext similarly as to block ciphers but the plaintext is partitioned to sequence of blocks. The main difference between them is that in the block cipher, the current plaintext block is not taken as data input for cryptographic transformation. Instead, a string of bits, often called a “keystram” block, is generated independently of the current plaintext block and then combined with the plaintext using a simple operation, most commonly the XOR operation. Due to this functional difference, stream ciphers typically operate on plaintext blocks of shorten length, which can be just 1 bit or 1 byte. 25 The requirement for UMTS confidentiality and integrity algorithms are specified by 3GPP in the technical specification document (3GPP TS 33.105, 2007). Within the security architecture of the 3GPP system, there are standardized algorithms versions for UMTS confidentiality algorithm UEA1 and integrity algorithm UIA1 There are two versions for UMTS security algorithms: 1.0 and 1.1. Version 1.1 must be used. In following sub-sections confidentiality and Integrity algorithms will be elaborated more as well as their structure and functionalities. 2.4.3.1 Confidentiality Algorithm For data confidentiality of user data and signaling data, UMTS uses a cryptographic function called (ƒ8). Confidentiality algorithm (UEA1) is a stream cipher is used to encrypt and decrypt blocks of data under a confidentiality key (CK). The block of data may be between 1 and 20, 000 bits long. The algorithm use KASUMI in a form of Output Feedback (OFB) mode as a key stream generator in UMTS and uses. The detail description of f8 algorithm is specified in (3GPP TS 35.201, 2007). The 3GPP ƒ8 stream cipher mode is not a standard stream cipher mode of operation of a block cipher. Examples of such standard modes are counter mode and OFB mode (NIST 1981). A counter mode keystream generator makes us of the generator that is updated for each new block and is taken as part of the input to the generator function. The ƒ8 stream cipher mode can be seen as a combination of these two standard modes and makes use of prewhitening of feedback data (Niemi & Nyberg 2006). In UMTS, The ƒ8 algorithm makes use of the KASUMI key-dependent function, which operates on 64-bit data blocks and produces 64-bit blocks under control of a 128-bit key K. In LTE, ƒ8 uses SNOW 3G as a keystream generator, which generates a sequence of 32-bit words under the control of a 128-bit key and a 128-bit initialization variable. The input parameters to ƒ8 are the Cipher Key (CK), the time-dependent input (COUNT-C), the bearer identity (BEARER), the direction of the transmission (DIRECTION) and the length (LENGTH) of the plaintext. The Cipher Key (CK) is renewed at every authentication process. COUNT-C, BEREAR and DIRECTION can be considered as initialization parameters as they are renewed for each keystream block. The time-dependent input COUNT-C also sent in cleartext and used as a synchronization parameter for the synchronous stream cipher. The input parameter LENGTH only affects the length of the keystream, not the actual bits in it. The exact structure of these parameters and further description of ƒ8 algorithm operation it is specified in (3GPP TS 33.102, 2006) and in (3GPP TS 35.201, 2007). 26 Following figure illustrate the use of ƒ8 function to encrypt plaintext by applying a keystream using a bitwise XOR operation. The plaintext may be recovered by generating the same keysteam using the same input parameters and applying it to the ciphertext using a bitwise XOR operation. Figure 4. Ciphering user and signaling data transmitted over the radio access link. The ƒ8 algorithm makes use of two 64-bit registers as described in (Niemi & Nyberg 2006): the static register A and the counter BLKCNT. Register A is initialized using the 64 bit initialization value: IV = COUNT ||BEARER||DIRECTION||0…0 obtained as the concatenation of the 32- bit COUNT, 5-bit BEARER, 1-bit DIRECTION value and a string of 26 zero bits. The counter BLKCNT is set to 0. Figure 5. The ƒ8 stream cipher mode 27 The ƒ8 algorithm makes use of a Key Modifier (KM) constant that is equal to the octet 0 x 55 = 01010101 repeated 16 times. First, a single operation of KASUMI is applied to register A, using a modified version of the CK to compute the prewhiteninig value: W = KASUMICK  KM(IV) (2.1) which is stored in register A. Once the keystream generator has been initialized in this manner, it is ready to be used to generate keystream bits. The plaintext/ciphertext to be encrypted/decrypted consists of LENGTH bits, where LENGTH varies between 1 to 20,000 with granularity of 1 bit, while the keystream generator produces keystream bits in multiples of 64 bits. Between 0 and 63 of the least significant bits are discarded from the last block depending of the total number of bits required by LENGTH. The number of required keystream bits is denoted by BLOCKS, whose value is determined by the value of the LENGTH parameter as follows: the value of LENGTH is divided by 64 and the result is rounded up to the nearest integer. The keystream blocks are denoted as KSB1, KSB2 ,…,KSBblocks. Set KSBo=0 and let n be an integer with 1 ≤ n ≤ BLOCKS, such that n = BLKCNT +1, and set: KSBn = KASUMICK( W  (n-1) KSB n-1) (2.2) Individual bits KS[0],KS[1],…….KS[LENGTH-1] of the keystream are extracted in turn from KSB1 to KSBblocks, with the most significant bit extracted first, by applying the following operation. For n = 1, …., BLOCKS and for each integer i, with 0 ≤ i ≤ 63, set : KS[((n-1) * 64) - i] = KSBn[i] (2.3) Encryption/decryption operations are identical and are carried out by the bitwise XOR of the input data using the generated keystream. The main task with the design of ƒ8 was to make a steam cipher out of the block cipher KASUMI and there are several standard ways for this task, namely, cipher feedback (CFB), courter mode or output feedback (OFB) mode. For the function ƒ8 a combination of () and counter mode used for protection for KASUMI against chosen plaintext attack and protection against collision attacks (Günther, Howard & Niemi 2002). To prevent chosen plaintext attacks the initialization vector (IV) is encrypted with different key CK´ = CK+KM, where KM is a key modifier. This initial encryption is 28 also protection against collision attacks collision attacks as an attacker cannot freely choose the value which is XOR-ed with the block counter (Günther, Howard & Niemi 2002). 2.4.3.2 Integrity Algorithm As specified in (3GPP TS 33.102, 2006), some Radio Resource Control (RRC), Mobility Management (MM) and Call Control (MM) signaling information elements are considered sensitive and must be integrity protected. Cryptographic integrity function (ƒ9) shall be applied on certain signaling information elements transmitted between User Equipment (UE) and Serving Network (SN). For this task, UMTS Integrity Algorithm (UIA) has been specified and should be implemented in UE and in the RNC. The UMTS Integrity Algorithm (UIA) shall be used with an Integrity Key (IK) to compute a message authentication code for a given message. At least the following signaling elements sent by the UE to the RNC should be protected:  The UE capabilities, including authentication mechanism, ciphering algorithm and message authentication function capabilities.  The security mode accept/reject message.  The called party number in mobile originated call.  Periodic message authentication messages.  Various location updates, e.g. cell updates and URA updates As for RNC concerned, at least following signaling message sent by RNC to the UE should be protected:  The security mode command, including whether ciphering is enabled or not and the ciphering and integrity algorithm to be used.  Periodic message authentication message 29 A cryptographic function (ƒ9) is used to protect data integrity and authenticate the data origin of signaling data at the RRC layer. A cryptographic message authentication algorithm generates a fixed length message authentication code (MAC) from a message of arbitrary length, under the control of the secret parameter key and a set of initialization values. The sender and receiver generate the MAC using same function. 3GPP integrity algorithm ƒ9 computes 32-bit Message Authentication Code (MAC) of a given input message under integrity key (IK) and imposes no limitation on the input message length. The approach adopted uses KASUMI as used by the confidentiality algorithm ƒ8 in form of Cipher Block Chaining (CBC-MAC) mode. The input parameter to the integrity algorithm are the Integrity Key (IK), a time- and frame-dependent input (COUNT-I), a random value generated by the network side (FRESH), the direction bit (DIRECTION) and the signaling message (MESSAGE). The IK is cryptographic key that is newly generated at each authentication process. The COUNT-I, FRESH, and DIRECTION parameters can be considered as a set of initialization parameters that are renewed for each message to be authenticated. Based on these inputs parameters the user computes message authentication code for data integrity (MAC-I) using the UMTS Integrity Algorithm (UIA) ƒ9. MAC-I is then appended to the message when sent over the radio access link. The receiver computes XMAC-I on the message received in the same way as the sender computed MAC-I on the message sent and verifies the data integrity of the message by comparing it the received MAC-I. The input parameter COUNT protects against replay during a connection. It is a value incremented at both sides of the radio access link every 10ms layer 1 frame. Its initial value is sent by the user to the network at connection set-up. The user stores the last used COUNT value from the previous connection and increments it by one. In this way the user is assured that no COUNT value is re-used (by the network) with the same integrity key. The input parameter FRESH protects against replay of signaling message by the user. At connection set-up the network generates a random value FRESH and sent it to the user. The value FRESH is subsequently used by both the network and user throughout the during of a single connection. This mechanism assures the network that the user is not replaying and old MAC-Is. Following figure is depicted derivation of MAC-I and/or XMAC-I on a signaling message: 30 Figure 6. Derivation of MAC-I and/or XMAC-I o The 3GPP standard ƒ9 function (Niemi & Nyberg 2006) makes use of two 64-bit registers A and B. The initial value for both registers is set equal to 0: A=0 and B = 0. The function also makes use of a constant value for a 128-bit Key Modifier (KM) that is equal to 16 repetitions of the octet 0xAA = 10101010. The inputs of the ƒ9 function are as described above paragraphs, the value of all inputs are concatenated and then a single “1” bit is appended to this string, followed by between 0 and 63 “0” bits, so that the total length of the resulting string is an integer multiple of 64 bits. This string is called Padded String (PS) then: PS = PSo||PS1||PS2|| . .|| PS BLOCKS -1 (2.4) This Padded String (PS) is the data input to the Message Authentication Code (MAC) algorithm. It would also be possible to interpret the first block PS0 of PS as the initial value, since PSo = COUNT || FRESH, and this initial value would be different for each message. For each integer n with 0 ≤ n ≤ BLOCKS-1 the following operations are performed: A = KASUMIIK(A PSn) (2.5 B = B  A (2.6) Finally a further application of KASUMI is performed using a modified form of the Integrity Key (IK), as follows: B = KASUMIIK  KM(B) (2.7) 31 The output form KASUMI has 64 bits: MAC-I comprises the leftmost 32 bits of the result and the rightmost 32 bits are discarded. Following figure show ƒ9 integrity function Figure 7. The ƒ9 integrity function UIA negotiations, Integrity key lifetime, integrity protection procedures and local authentications are described in (3GPP TS 33.102, 2006) document. 2.5 KASUMI This subchapter gives detail description of the KASUMI cipher. Most text of this subchapter is extracted from 3GPP specifications (3GPP TS 35.201, 2007) and (3GPP TS 35.202) handling of KASUMI and (Niemi & Nyberg 2006). The KASUMI block cipher is used by the confidentiality algorithm (ƒ8), and the integrity algorithm (ƒ9) developed by Security Algorithms Group of Experts (SAGE) Technical Forum (TF) of 3GPP. As mentioned in subchapter 2.4, these algorithms were designed for specific use in the context of UMTS. They were designed for specific block cipher algorithm in mind, which was chosen as a starting point for what was going to be the kernel algorithm, a modified version of MISTY (Mitsure Matsui, 1997). In Parallel with the development of the ƒ8 and ƒ9 modes of operation, adjustments were also made to block cipher algorithm MISTY1 (Mitsure Matsui, 1997) and the final version of the block cipher algorithm is known as KASUMI—kasumi is Japanese for “hazy, dim blurred” (Niemi & Nyberg 2006). 32 2.5.1 Description of KASUMI KASUMI (3GPP TS 33.102, 2006), is a 64-bit block cipher that has a key size of 128bits. KASUMI was designed as modification of MISTY1 (Mitsuru Matsui, 1997), optimized for implementation in hardware. Therefore, the most of the components of KASUMI are similar to the respective components of MISTY1 (Mitsuru Matsui, 1997). KASUMI is a Feistel (3GPP TS 35.202, 2007) cipher with eight rounds. It operates on a 64-bit data block and uses a 128-bit key. The round function (or ƒi() f-function) used in the ith round of the Feistel cipher is denoted by ƒi. The f-function has a 32-bit input and a 32-bit output. Each f-function of KASUMI is composed of two functions an FL- function and an FO-function. An FO-function is defined as a network that makes use of three applications of an FI-function. An FI-function has a 16-bit input and a 16-bit output. Each FI-function comprises a network that makes use of two applications of a function S9 and two applications of a function S7. The functions S7 and S9 are also called “S-boxes of KASUMI”. In this manner KASUMI has similar three- layer nested structured of MISTY1 (Mitsuru Matsui, 1997) . In this manner KASUMI decomposes into a number of subfuntions (FL, FO and FI) that are used in conjunction with associated subkeys (KL, KO and KI). The outmost Feistel network comprises eight rounds, which are called in the specification outer rounds and numbered using index i, i = 1,2, ...., 8. Figure 8. (a) KASUMI; (b) FOi function; (c) FIi,j function; and (d) FLi funtion 33 The FL-functions and FO-functions used at each round of the Feistel network are numbered accordingly (i.e FLi, and FOi are functions used at the ith round of the outer network). Function FLi is used in conjunction with subkey KLi, and function FOi is used conjunction with two subkeys: KOi and Ki. The network formed by the eight FO-functions are called the inner networks and each one has three rounds indexed by j, j=1, 2, 3. Each round of an inner network makes use of a KO-key and an FI-function, the latter is used in conjunction with a KI-key. Consider the ith inner network FOi. The KO-key, FI-function and the KI-key used at the jth round of FOi are denoted as KOi,j FIi,j KIi,j respectively. In addition, the KI-key KIi,j splits into two halves KIi,j 1 and KIi,j 2 . 2.5.2 KASUMI Components and Encryption Function In ƒ8 and ƒ9 mode operation, the kernel function is only computed in one direction (Niemi & Nyberg, 2006). So even if the kernel function is a block cipher, the decryption transformation is never used. The purpose of the 3GPP is only the encryption function of KASUMI and that only has been defined. KASUMI operates on a 64-bit input (INPUT) using a 128-key (K) to produce 64-bit output (OUTPUT) as follows. INPUT is divided into two 32-bit strings L0 and R0, where: INPUT = L0 || R0 (2.8) Then for each integer i with 1 ≤ i ≤ 8 the operation on the ith round of KASUMI is defined as : Ri = Li-1, Li = Ri-1  ƒi(Li-1, RKi ) (2.9) This constitute the ith round function of KASAUMI, where Li-1 || Ri-1 is the input data block, Li || Ri is the output data bock and RKi is the ith round key, defined as a triplet of subkeys (KLi, KOi, KIi). Subkeys are derived from the key K using the key-scheduling algorithm. The output data block (OUTPUT) is defined as: OUTPUT = L8 || R8 (2.10) which is the data block offered at the end of the eighth round. In the specification of ƒ8 and ƒ9 this transformation is also denoted as OUTPUT = KASUMIK [INPUT] (2.11) 34 Components of KASUMI and their functionalities have been defined in (3GPP TS 35.202, 2007) and it is highlighted here shortly. 2.5.2.1 ƒ Functions According to (Niemi & Nyberg 2006) each f-function ƒi takes a 32-bit input I and returns a 32-bit output O under the control of round key RKi , where the round key comprises the triplet ((KLi, KOi, KIi). The f-function ƒi itself is constructed from two subfunctions: an FL-function FLi and an FO-function FOi with associated subkeys KLi (used with FLi) and subkeys KOi, and KIi (used with FOi). The f-function ƒi has two different forms depending on whether it is an even round or an odd round. For odd rounds I = 1, 3, 5 and 7.The f-function ƒi is defined as: ƒi( I, RKi) = FOi (FLi (I, KLi,), KOi, KLi ) (2.12) And for even rounds i = 2,4,6 and 8, The f-function ƒi is defined as: ƒi(I, RKi) = FLi (FOi (I, KOi, KIi), KLi (2.13) i.e for odd rounds first the FL-function and then the FO-function is applied to the round data, while for even rounds the order of the functions is changed. 2.5.2.2 Function FL The input function of FLi comprises a 32-bit data input I and a 32-bit subkey KLi. The subkey is split into two 16-bit subkeys, KLi,1 and KLi,2 , where : KLi = KLi,1 || KLi,2 (2.14) The input data I is split into two 16-bit halves, L and R, where I = L ||R. The FL- funtions make use of the following simple operations: ROL (D) the left circular rotations of a data block D by one bit D1 D2 the bitwise OR operation of two data blocks D1 and D2 D1 D2 the bitwise AND operation of two data blocks D1 and D2 Then the 32-bit output value of the FL-function is defined as L || R || where: L= L  ROL(R´ KLi,2 ) (2.15) R= R  ROL( L KLi,1) (2.16) 35 2.5.2.3 Function FO The input to function FOi comprises a 32-bit data input I and two sets of subkeys: a 48- bit, KOi and 48-bit KIi.The 32-bit data input is split into two halves, L0 and R0.where I = L0 || R0, while the 48-bit are subdivided into three 16-bit subkeys, where: KOi = KOi,1 || KOi,2 || KOi,3 and KIi. = KIi,1|| KIi,2 || KIi,3 (2.17) For each integer j with 1 ≤ j ≤ 3 the operation of the jth round of the function FOi is defined as: Rj = FIi, j(Lj-1 KOi j, KIi j )Rj-1 (2.18) L j = Rj-1 (2.19) Output from the FOi function is defined as the 32-bit data block L3 || R3. 2.5.2.4 Function FI The FI-function is depicted in Figure 8. The thick and thin lines in this diagram are used to emphasize the difference between the 9-bit and 7-bit data paths, respectively. An FI-function FIi, j takes a 16-bit data input I and 16-bit subkey KIi j. The input I is split into two unequal components, a 9-bit left half L0 and a 7-bit right half R0, where I = L0 || R0, . Similarly, the key KIi, j is split into a 7-bit component KIi,j,1 and a 9-bit component KIi j,2, where and KIi,j = KIi, j,1|| KIi, j, 2 . Each FI-function FIi, j uses two S- boxes: S7 which maps a 7-bit input to a 7-bit output and S9 which maps a 9-bit input to a 9-bit output. The FI-funcions also uses two additional functions, which are designated by ZE and TR. These simple functions are defined as following: ZE (D) takes a 7-bit data string D and converts it to a 9-bit data string by appending two zero bits to the most significant end of D. TR (D) takes a 9-bit data string D and converts it to a 7-bit value by discarding the two most significant bits of D. The function FIi, j is defined by the following series of operations: L1 = R0 R1 = S9[L0] ZE (R0) L2 = R1 KIi, j, 2 R2 = S7[L1] TR (R1) KIi,j,1 L3 = R2 R3 = S9[L2]  ZE (R2) L4 = S7 [L3]  TR(R3) R4 = R3 (2.20) The output of the FIi, function is the 16-bit data block L4 || R4, 36 2.5.2.5 S-boxes The two S-boxes (S7 and S9) have been designed so that they may be easily implemented in combinational logic or by a look-up table. Both forms are given for each S-box. The input x comprises either seven or nine bits with a corresponding number of bits in the output y. Therefore: x = x8 || x7 || x6 || x5 || x4 || x3 || x2 || x1 || x0 || and (2.21) y = y8 || y7 || y6 || y5 || y4 || y3 || y2 || y1 || y0 || (2.22) Where the x8, y8 and x7, y7 bits only apply to S9 and the x0, y0 bits are the least significant bits. Gate logical operations of S7 and S9 are described in detail (3GPP TS 35.202, 2007). 2.5.3 Key Schedule KASUMI has a 128-bit key K. Each round of KASUMI uses 128 bits of key that are derived from K. Before the round keys can be calculated two arrays of 16-bit values Kj and K΄j( j = 1, …,8) are derived in following manner. The first array K1, K2, ….K8 is derived by subdivision of K into eight 16-bit sub-blocks such that: K = K1 || K2 || K3 || K4 || K5 || K6 || K7 || K8 (2.23) The second array K`1, K`2, …,K`8 is derived from the first array by adding an array of 16-bit constants Cj as follows: K`j =K`j  Cj (2.24)Where the constants Cj are given in following table 1: Table 1. Constants Cj Then the subkeys (KL, KO and KI) are derived as defined by the following Table 2. Table 2. Subkeys (KL, KO, and KI) 37 2.5.4 KASUMI Operations Since KASUMI is a Feistel cipher with eight rounds, it operates in recursive structure manner in three round functions, namely, FO(), FI() and FL().Both functions of FO() and FI() has a subcomponents which are also have a Feistel-like structure. KASUMI encrypt 64 bits blocks by using a Feistel network of eight rounds. In each round, it alternate between the left and right halves of the state. In first level, one half enters the out round function FO() for processing. In second level the FO() function´s output is then XOR’ed to the opposite half of the state and then operation continues with next round. As explained earlier, FO() function itself consists of three round of inner round function FI() and it happens in similar Feister network recursive manner with half the block size. In third level of recursion FI() itself consists of four rounds of nonlinear S- Box transformations arranged in a Feiste structure. Following figure illustrate the KASUMI operations detail structures. Figure 9. KASUMI operations 38 3. THEORETICAL ANALYSIS OF CRYPTOGRAPHIC PERFORMANCE In this chapter, thesis will highlight the benchmarking process of analyzing cryptographic performance in order to understand the characteristics of the multi-core processor being measured. This chapter also provides a short description of the analysis of the potential variables that affect the cryptographic performance of a multi-core processor when measurements are made on the driver level and explains how these variables manifest themselves in the measured performances of the processor that is used. This chapter also introduces a theoretical hypothesis about the implementations of the KASUMI algorithm in both software and hardware. In this way, the predicted can be compared against measured results. Cryptographic algorithms are used in modern communication systems and have become very important components for ensuring data security. Traditionally, cryptographic algorithms were implemented either by software on a general-purpose processor or incorporated dedicated security hardware in the form of a specialized processor with cryptographic features. In this latter option, the cryptographic functions took place in an off-chip accelerator approach, which means adding another chip or daughter chip and writing low-level code for accessing data between the CPU and this external dedicated chip. Since cryptographic algorithms are computationally demanding, software-based implementations are very slow or their power consumptions are high, and thus, hardware-based implementations have been preferred, but even this hardware-based cryptographic implementations option also had its drawbacks; for instance, in order to have access to CPU functionalities, low-level code has to be written for transferring the data forth and back between the CPUs, system bus, memory, and other subsystems, which could have a huge effect on the overall performance of the systems. The present embedded system development is moving on to the next step of cryptographic algorithm implementations. Hardware-based implementations are moved on-chip or System-on-Chip (SoC), which means that mainstream CPUs incorporate cryptographic function hardware acceleration. These hardware accelerations were developed to offload the heavy processing load from the core CPU. These new SoC systems consist of an embedded CPU and an on-chip bus, a DMA controller, memories, and other subsystem modules, such as dedicated crypto coprocessors used for the accelerations of the cryptographic algorithms. The performance measurements are made for cryptographic algorithms on different levels, on the driver level, the application level, or the protocol stack level. This thesis makes 39 the performance measurements on the driver level. According to (Freescale Semiconductor 2008), there are variables which may influence the results of the measurements. As the demand for faster processing has increased over the years, the idea of multi-core has been born; in this two or more cores are combined into a single Integrated Circuit (IC) package and in this way, the multi-core technology takes advantage of parallel processing rather than increasing the frequency of the clock rate to achieve higher performance. Verifying the cryptographic performance claims of multi-core vendors can be a difficult task since there can be a difference between the theoretical cryptographic performance of the processor and its performance in a given application. Some of the results of the performance are inherited from its architecture, while others are due to the performance of the application itself, the protocol stacks of the software, or the underlying OS. According to (Waters & Stammberg 2009), there are many cryptographic acceleration implementations but there are two common cryptographic acceleration implementation architectures, namely flow-through and look-aside, which most accelerator implementations are based on. The following variables that have been identified in above mentioned reference have an influence on the performance measurements of the cryptographic implementations and would explain shortly as following. 3.1 Influential Variables for Cryptographic Performance Measurements In this subchapter, 4 variables that could affect the performance measurements of different cryptographic-based implementations are discussed. Although a detailed description of how each parameter affects the results lies outside the scope of this thesis work, they are defined here to raise awareness of their presence in the measurements. 3.1.1 Application or Protocol Stack Software Overheads Application and protocol stack overheads are the instructions a processor must execute in order to determine what sort of crypto processing is required. Application/protocol stack overheads are typically the greatest source of performance degradation and often the most shocking to users as they transition from a non-secure version of a given protocol to the secured versions. The reason for this degradation is that security protocols are stateful, so as the cryptographic keys that are used to encrypt and/or decrypt or authenticate the data have lifetimes, which can be measured in terms of the number of bytes encrypted,the number 40 of seconds since the key was first used, the number of seconds since the key was last used, or all of the above. Security protocols also add a header field to packets, allowing the packet to be forwarded to the other end of the security tunnel without revealing information about the parties to the communication. 3.1.2 Software Overheads Crypto API, application stacks, and driver overhead affect the interface between crypto hardware and application software stacks. Efficient interaction between OS and crypto accelerators is crucial in order to minimize their effects. With small packet sizes, software overheads are more evident with a large number of packets processed, but as the packet size increases, all software overheads become less important and the row performance of the look-aside accelerator becomes more critical. 3.1.3 Bus Bandwidth The amount of memory bandwidth consumed during look-aside crypto processing is significantly greater than what is consumed during a plaintext operation. Whether the data to be cryptographically processed are a large file or a small packet, the data must originally be moved from the network or peripheral interface to the system memory, moved from the system memory to the accelerator and back, and then moved from the system memory to a network or peripheral interfaces. In addition to this data movement, the security context, such as the crypto key, must also be fetched. There is also additional memory bandwidth consumption associated with additional look-ups, instruction fetches, or architecturally specific reads or writes. Crypto performance can be constrained if there is not enough bus bandwidth to keep the look-aside accelerator fully utilized. The bus bottleneck could be between the SOC and its system memory. 3.1.4 Data Size Is the test encrypting of a small or large chunk of data matter? The smaller the size of the chunk of data, the more the results will be influenced by the memory latency of the accelerator, the descriptor size, and other "non-data" context it must fetch to perform the operation, and the accuracy of the timer. 3.1.5 Iteration A good way to include accelerator DMA overheads and memory latency in a small amount of data while reducing timer resolution as a variable is to construct the test in such a way that the timer starts before iteration 1 and stops after the nth iteration, where n is a fairly large number. 41 4. EXPERIMENTAL SET-UP AND TOOLS This chapter will describe both the OCTEON HW and SW architectures, the setting up of the test environment, and the tools used; the software under test is also described briefly in this chapter. The experimental tests that are studied in this thesis were performed at the Nokia Siemens Networks laboratory in Espoo, in Finland. KASUMI cipher implementations have been tested in several other studies (Jääskeläinen 2003; Tomás Balderas-Contretras 2004; Tomás, Balderas-Contretras, and René, Cumplido 2005 and H.Kim, Choi, M.Kim and H.Ryu 2002). But none of those involved Cavium OCTEON Multicore processors. The OCTEON Multicore processors family contains a wide range of different product options in which some of the chips will contain a hardware accelerator for security algorithms like KASUMI. According to (Cavium Networks, 2009), a Cavium HW-based security accelerator should provide a substantial performance boost compared to a purely SW-based KASUMI implementation. The intention of the experimental KASUMI tests in this thesis is to verify those claims by comparing SW- and HW-based KASUMI implementations to each other. 4.1 OCTEON Hardware Architecture This section describes the hardware architecture that was used for running the experimental tests. The first subchapter first gives a high-level overview of the OCTEON processor, while the next subchapter describes its architecture in greater detail. This thesis does not intend to describe each and every fact about the OCTEON but just cover it on a high level and touch only on those issues that are germane to this study. For in-depth descriptions of the OCTEON hardware, refer to the OCTEON User’s Manual (Cavium Networks, 2009f). Many embedded applications require a higher performance response time for real-time systems. This demand makes it difficult for a single CPU to satisfy this ever-increasing hunger for high-performance applications, so the use of multiple CPUs has become very popular among embedded system manufacturers. A multi-CPU system is not the same as a multi-core system; the multi-core processor is a processing system that comprises two or more cores (CPUs) which are integrated in a single chip. Historically, more processing power was achieved by increasing the speed of the processor, but current multi-core processors could deliver better performance by forming multiple logical units into a core one single chip. 42 4.1.1 Cavium Networks MIPS (cnMIPS) MIPS (Microprocessor without Inter locked Pipeline Stages) is a RISC (Reduced Instruction Set Computer) processor architecture developed by MIPS Computer System Inc. The early MIPS architecture was 32-bit implementation and the latest MIPS64 is based on 64-bit wide registers and data path implementation. The Cavium Networks OCTEON product family architecture is based on officially licensed MIPS Inc. (MIPS Technology 2001:7-15) technology and uses the multi-core MIPS64 v2 processor instruction set supporting both 32-bit and 64-bit processing. Cavium Networks has added some custom instructions to accelerate common networking operations, such as a bit test branch instructions, security and packet processing instructions or bit-field insert/extract forming the cnMIPS core. These processors are software-compatible processors, with one to sixteen cnMIPS cores on a single chip and OCTEON II Internet application Processor (IAP) scale from 1 to 32 cores according to (Cavium Networks, 2009).The cnMIPS cores is Cavium Networks´s custom implementation of the MIPS64 release 2 instruction set. 4.1.1.1 OCTEON Product Overview The OCTEON product family includes OCTEON, OCTEON Plus and OCTEON II multi-core MIPS64 processors. All OCTEON processors are software compatible and supported by industry-standard software tool-chains and operating systems. All OCTEON products share the same architecture and more detail description of OCTEON products are shown in Cavium Networks Product Table (Cavium Networks, 2009c). In this thesis work, OCTEON Plus CN58XX product family is used for running on the experimental tests that are used for performance tests on KASUMI cipher implementations. This processor has a sixteen cnMIPS cores at up to 750MHz speed on a single chip. The CN58XX processor is provides security hardware acceleration which implements the latest set of algorithms including KASUMI cipher algorithms. 4.1.1.2 OCTEON Architecture Overview According to (Cavium Networks, 2009d) All OCTEON products share the same OCTEON architecture. Individual OCTEON products vary and scale based on the number of integrated cnMIPS cores, the frequency that the cores and the internal interconnects run at, the type and the number of I/O interfaces integrated, and the type and the number of application acceleration hardware units integrated. 43 According to (Cavium Networks, 2009e) key OCTEON features which most OCTEON models provide are:  Up to 32 cores, up to 1.5 GHz: The OCTEON family of multicore processors supports up to 32 cnMIPS with a speed from 300MHz to 1.5GHz.  Hardware Acceleration Units: Multiple hardware acceleration units are integrated into each processor. These acceleration units include; Packet- managment accelerators, Security acceleratiors, Application accelerators and specialized accelerators.  Dedicated DMA Engines: Dedicated DMA Engines are provided for each hardware unit which accesses memory.  High-Speed Interconnects: The hardware units and the cores ae connected by high-speed interconnects. These interconnects run at the same frequency as the cores. Each interconnect is a collection of multiple buses with extensive pipelining and sophisticated hardware arbitration logic.  Industry-Standard Tools chains and Operating Systems: Industry-standard tool chains (GCC, GDB) and operating systems (including SMP Linux) has been modified to utilize the OCTEON processor´s multiple cores, hardware accelerators and Cavium Networks –specific instructions.  Packet Management Acceleration: Packet receive/transmit is automated by software-configurable packet management accelerators.  TCP/UDP acceleration  Per-Core Security Hardware Acceleration. Common security algorithms including KASUMI are accelerated by optional per-core Security Engines. The architectural definition of the CN58XX product line provides a good perspective for illustrating the OCTEON architecture. Following figure is OCTEON Plus CN58XX block diagram. 44 Figure 10. Cavium Networks MIPS Architecture (cnMIPS) Although every units of OCTEON hardware is essential to work of this thesis work, in this subchapter, will be highlighted only some of the hardware –acceleration units that their functionalities are very essential to this thesis work. As described earlier, Hardware-Acceleration units consists of packet-management accelerators, security accelerators, application accelerators and specialized accelerators. Packet management accelerators and security accelerators has huge effect on work and the experimental test results analysis of this thesis work, so their functionalities would be described them in detail following paragraphs. Packet Management Accelerators in its part consists of Schedule/Synchronization and Order (SSO) unit that manages packet scheduling and ordering, Free Pool Allocator (FPA) unit that handles pools of free buffers including Packet Data buffers, and Input Packet Data (IPD) unit which manages packet input. Packet Input Processor (PIP) unit which works closely with IPD a handles with input packets, Packet Output Unit (PKO) manages packet output. PIP (Packet Input Packet) and IPD (Input Packet Data) units work together to receive the packet and perform early processing on it. Each packet is represented as “work” for the cores to do and is represented by a Work Queue Entry (WQE) data structure which 45 contains the tag type, tag value QoS value, group and pointer to Packet Data Buffer. These units are responsible for many layer-2 through layer-7 processing requirements such as exception checking and TCP/UDP checksum verification. Schedule/Synchronization and Oder (SSO) is responsible for scheduling the work to cores and also maintaining the packet ingress order. Cores may request work from SSO either asynchronously (the core continue to do other work while the instruction completes) or synchronously (the core waits for instruction to complete). In another word, SSO unit provides Scheduling, Synchronization and Ordering services. Packet Output (PKO) in its turn is responsible for packet transmission. When the packet processing is complete, the core notifies the PKO that the packet is ready for transmission and PKO manages the transmission priority. Security accelerators consists Random Number Generator (RNC) unit, Key Unit which provides and manages secure on chip memory which can be used to store a hardware key and can be reset using an external pin. Per-Core Security or Security Engine, this is special coprocessor used for accelerating security applications and hash generation and there is one Security-Coprocessor per core. Once the core issue an instruction to the coprocessor, the core can continue to do other work while the coprocessor complete the instruction, or the core can wait for the coprocessor to complete the task. Following figure illustrate the communication between core and its co-processor. Figure 11. Communication between core and its co-processor More detail descriptions of these units and how they functions is handled in (Cavium Networks, 2009e). 46 4.1.1.3 Other Units and Functionalities According to (Cavium Networks 2009c, 2009d) there are also other import units beside those been described in above paragraphs, such as On-Chip Interconnects which joins the different integrated units together. The OCTEON on-chip interconnect architecture involves a network of Coherent Memory Buses (CMB), and another network of interconnects connecting the I/O sub-system (IOB) together. Both of these networks are implemented based on multiple split-transaction and highly pipelined buses. These two networks, CMB, and IOB, are connected together through a high performance I/O bridge. The CMB network connects all the cnMIPS cores, L2 cache controller, memory controller, and I/O bridge together while the I/O sub-system (IOB) includes the controllers for the various I/O interfaces, all the application specific offload and acceleration engines, and the I/O bridge. The OCTEON architecture offers as well a cache hierarchy which includes split instruction and data L1 caches on each cnMIPS core, and a large L2 cache, up to 2MB, that is shared by all cnMIPS cores and the I/O subsystem. The OCTEON architecture includes also, a large variety of application specific offload and acceleration engines. Some of these acceleration features are implemented as individual functional units that are shared among all the cnMIPS cores, while some of these acceleration features are integrated into each of the cnMIPS cores. The decision of stand-alone functional units against integration into cnMIPS core is optimized for each of the application specific acceleration features in the OCTEON architecture. 4.2. OCTEON Software This subchapter highlights the OCTEON processor´s software overview specially software architecture and other software utilities that is used to access the hardware services. The OCTEON Software is a Software Development Kit parts that is intended to provide the tool, drivers Application Protocol Interface (APIs), Libraries and other utilities that is needed to access the underlying HW chipset solution. It also plays crucial role for providing services to higher level applications that run on the OCTEON processors. Following figure shows high level overview of software suits that this test system uses. 47 Figure 12. High level test system architecture The KASUMI Crypto software that is to be tested runs on OCTEON Simple Executive to leverage OCTEON per-core security acceleration to achieve maximum performance possible. 4.2.1 Simple Executive API In this thesis work KASUMI test are performed by using Octeon Simple Executive architecture instead of Linux based application. The reason for this choice is the services that normal Operating Systems provide to applications such as interrupts, context switches, kernel address space copy in/copy out with system calls causes some overhead to pay. The Simple Executive provides a Hardware Abstraction Layer (HAL) in the form of an Application Programming Interface (API) to the underlying hardware units. This API is a very thin layer of simple functions which access the Central Process Unit (CPU) registers. It also provides some convenience routines for block initialization. The API can be used from both kernel and user mode. In another word, Cavium OCTEON Simple Executive (SE) means that the executable SW will run directly on top of OCTEON HW without Operating System (OS) e.g Linux. Following figure is show the role of the Simple Executive API: 48 Figure 12. Simple Executive API The Simple Executive API is used to access the hardware units:  Basic units; FPA, IPD, PIP, SSO and PKO  Intermediate units: FAU and TIM  Advanced units: LLM, ZIP, RNG, DFA, KEY, CIU, etc. Simple Executive API also includes functions and macros for:  System memory allocation (bootmem)  Synchronization between cores  spinlocks  Reader-Writer locks  Atomic set, add, compare and store operations  Barrier functions Simple Executive functions and macros may be used either to create stand-alone Simple Executive application, or may be called from drivers or applications running on an operating system kernel such as Linux. For instance, after the Linux kernel is booted, a 49 Cavium Networks Ethernet driver may be started. This driver uses the Simple Executive API to configure the OCTEON hardware. Simple Executive User-Mode applications may also be started from Linux. Both 32-bit and 64-bit modes are supported, although 64-bit mode should be used whenever possible due to better overall performance. Following figure shows using Simple Executive API from different Runtime Environment: Figure 13. Simple Executive API runtime environments 4.2.2 Runtime Environment Choices for cnMIPS Cores There are several choices for runtime environment. The tree supplied by Cavium Networks is Simple Executive stand-alone mode, Linux and the hardware simulator: 1. When running Simple Executive on multiple cores, the same FLF file is usually run on all of the cores. These cores are all started from one load command. 2. When running Linux on multiple cores (SMP), there is one kernel running, not one kernel per core. Linux applications are schedule to run on different cores. 3. The third runtime environment supplied by Cavium Networks is the Hardware Simulator. The simulator is useful when actual hardware is not available and it also very useful for performance tuning. Performance tuning is most easily done using the tool Viewzilla. This tool analyzes the output of the simulator, so making sure the code will run on the simulator as well as on actual hardware is recommended for performance critical application. 50 In addition to the three runtime environments supplied by Cavium Networks that is described above, several open-source and proprietary operating systems are available. In this thesis work, the Simple Executive running as Linux Stand-alone (SE-S) application is used, so thesis considerate to explain in detail how it works. 4.2.2.1 Simple Executive Simple Executive provides an API to the hardware units. Simple Executive may run Stand-alone (SE-S) application, or as user-mode (SE-U) application on an operating system such as Linux, in another word, Linux kernel and applications may both make Simple Executive API calls. When Simple Executive calls are made from Linux user space, the process is referred to as a Simple Executive User-Mode application (SE-UM). For instance, when run as a user-mode application, different application start code ( main() ) is called and there other minor porting items to consider. The following figure shows a representation of a core running Simple Executive in Stand-alone (SE-S) mode. One core runs a Simple Executive Stand Alone (SE-S) Application. Figure 14. Simple Executive Stand-alone Application (SE-S) Simple Executive calls may be made from kernel mode. For example, the Cavium Networks Ethernet driver, which runs on Linux, makes Simple Executive calls: Figure 15. Simple Executive calls from kernel Mode The following figure shows a representation of a core running Simple Executive as a User-mode application. Figure 16. Simple Executive User-Mode (SE-UM) application SE-S Linux Driver SE-UM 51 To use all available memory, SE-UM applications should be compiled for 64-bit mode. 32-bit mode is sometimes used, but can only access a limited amount of physical memory. SE-S is very fast compared to SE-UM. There no context switches, and all the memory is mapped for fast access. To get maximum performance from OCTEON: Whenever possible, design the application to use a 64-bit Simple Executive application. 4.2.2.2 Application Configurations There are number of possible way to do the multi-core configuration, but one of the easiest configurations to implement is Simple Executive Stand-alone application configurations. Following figure shows example of 8-core simple system running a Simple Executive Stand-alone application uses 1 ELF file, 8 instance of SE-S. Figure 17. 8-core simple system running a SE-S application 4.2.3 Other Software Issue In this subchapter, thesis will go through some other steps and issue that are related to complying and running the software under the test. 4.2.3.1 Application Entry Point and Start up Code An application may be compiled as either SE-S or SE-UM application without modification. The code executed when application is started is not the same: when SE-S is the built target. The makefile $OCTEON_ROOT/executive/cvmx.mk is responsible for making this change. 4.2.3.2 Booting SE-S Application When the application is executed and SE-S object is compiled, then to boot Simple Executive application bootoct bootloader command is used. This command is used to boot application on one or more cores. Following figure shows cores running Simple Executive Stand-alone in a Load Set. 52 Figure 18. Simple Executive Stand-alone Mode (SE-S) application Following figure is shown in detail how booting SE-S applications with Bootoct command works: Figure 19. Downloading and Booting SE-S Applications 4.2.4 Software Architecture In most software design paradigm that intended to run embedded systems follows the separation of control and data planes in order to simplify the software architecture design. In cnMIPS software helps too to separate two basic types of processing: normal packet processing (fast path), and exception processing (slow path). Depending on the 53 number of cores available, different configuration of cores devoted to either fast path or slow path processing can be used to optimize throughput. 4.2.4.1 Control-plane versus Data-plane applications Typical telecom networking SE-S application functions are dived into two categories: control-plane (slow path), and data-path (fast path). In OCTEON processor the control- plane usually handles exceptions while the data-plane handles normal packet processing. Software application used in this thesis work falls into fast path category. According to (MIPS Technology 2001) SE-S applications may be used for both control- plane and data-plane. SE-S applications provide the lowest overhead and highest potential for scaling. Next best solution (a typical solution) is SE-UM for control plane and SE-S for data-plane. Following figure illustrate SE-S application divisions and only one core does the initialization routine. Figure 20. SE-S used both