UNIVERSITY OF VAASA

FACULTY OF TECHNOLOGY

SOFTWARE ENGINEERING

Abdi-Hakim Yasin Ararse

COMPARISON OF THE PERFORMANCE OF 3G SECURITY
ALGORITHMS IN THE NAS LAYER

Master’s thesis for the degree of Master of Science in Technology
submitted for inspection, Vaasa, (15th March 2013).

Supervisor D.Sc. (Econ.) Jouni Lampinen

Instructor M.Sc. (Tech.) Jani Kokkonen


2

ACKNOWLEDGEMENTS

First I would like to express my deepest appreciation to Professor
Jouni Lampinen and Jani Kokkonen for their guidance and valuable
advice throughout this Master’s thesis work.

I am also thankful to Lab. Engineer Jari Kallio for helping and
providing the tools required for the test setup.

Thanks to my family for their endless encouragement. Special thanks
to my loving wife Shamsa Mohamed, and my beloved children Nawar
and Muhammed. Without their mental support this thesis work would
not have been completed on time.

Abdi-Hakim Yasin Ararse

Helsinki, Finland, 15th of March, 2013


3

TABLE OF CONTENTS

ACKNOWLEDGEMENTS .......................................................................................................... 2

TABLE OF CONTENTS.............................................................................................................. 3

CONCEPTS .................................................................................................................................. 7

LIST OF SYMBOLS .................................................................................................................... 7

ABBREVIATIONS....................................................................................................................... 8

TABLE OF FIGURES ................................................................................................................ 11

LIST OF TABLE ........................................................................................................................ 12

ABSTRACT................................................................................................................................ 13

TIIVISTELMÄ ........................................................................................................................... 14

1. INTRODUCTION................................................................................................................... 15

2. UMTS...................................................................................................................................... 17

2.1 Introduction...................................................................................................................... 17

2.2 The Evolution of Mobile Telephony Systems.................................................................. 17

2.3 UMTS Network Architecture........................................................................................... 18

2.4 UMTS Radio Network ..................................................................................................... 20

2.4.1 Security Architecture UMTS ................................................................................... 21

2.4.2 Network Access Security to UMTS......................................................................... 22

2.4.2.1 User Identity Confidentiality ...................................................................... 22

2.4.2.2 Mutual Entity Authentication ..................................................................... 23

2.4.2.3 Data Integrity .............................................................................................. 23

2.4.3 Cryptographic Algorithms for UMTS...................................................................... 24

2.4.3.1 Confidentiality Algorithm .......................................................................... 25

2.4.3.2 Integrity Algorithm ..................................................................................... 28

2.5 KASUMI.......................................................................................................................... 31

2.5.1 Description of KASUMI.......................................................................................... 32


4

2.5.2 KASUMI Components and Encryption Function .................................................... 33

2.5.2.1 ƒ Functions.................................................................................................. 34

2.5.2.2 Function FL ................................................................................................ 34

2.5.2.3 Function FO ................................................................................................ 35

2.5.2.4 Function FI.................................................................................................. 35

2.5.2.5 S-boxes ....................................................................................................... 36

2.5.3 Key Schedule ........................................................................................................... 36

2.5.4 KASUMI Operations ............................................................................................... 37

3. THEORETICAL ANALYSIS OF CRYPTOGRAPHIC PERFORMANCE.......................... 38

3.1 Influential Variables for Cryptographic Performance Measurements ............................. 39

3.1.1 Application or Protocol Stack Software Overheads................................................. 39

3.1.2 Software Overheads ................................................................................................. 40

3.1.3 Bus Bandwidth......................................................................................................... 40

3.1.4 Data Size .................................................................................................................. 40

3.1.5 Iteration .................................................................................................................... 40

4. EXPERIMENTAL SET-UP AND TOOLS ............................................................................ 41

4.1 OCTEON Hardware Architecture.................................................................................... 41

4.1.1 Cavium Networks MIPS (cnMIPS).......................................................................... 42

4.1.1.1 OCTEON Product Overview ...................................................................... 42

4.1.1.2 OCTEON Architecture Overview .............................................................. 42

4.1.1.3 Other Units and Functionalities .................................................................. 46

4.2. OCTEON Software ......................................................................................................... 46

4.2.1 Simple Executive API .............................................................................................. 47

4.2.2 Runtime Environment Choices for cnMIPS Cores .................................................. 49

4.2.2.1 Simple Executive ........................................................................................ 50

4.2.2.2 Application Configurations......................................................................... 51


5

4.2.3 Other Software Issue................................................................................................ 51

4.2.3.1 Application Entry Point and Start up Code ................................................ 51

4.2.3.2 Booting SE-S Application .......................................................................... 51

4.2.4 Software Architecture .............................................................................................. 52

4.2.4.1 Control-plane versus Data-plane applications ............................................ 53

4.2.4.2 Event-driven Loop (polling) versus Interrupt-driven loop ......................... 53

4.2.4.3 Packet Processing Steps in OCTEON ........................................................ 54

4.2.4.4 Using Work Groups in Packet Processing .................................................. 55

4.2.4.5 Passing Work from one Core to another Core ............................................ 56

4.2.4.6 Pipelined versus Run-to-Completing Software Architecture ..................... 58

4.3 Software under the Test ................................................................................................... 59

5. TEST EXECUTION AND RESULT COLLECTION............................................................ 60

5.1 Test Scenarios Selection .................................................................................................. 60

5.2 Running the Test Cases.................................................................................................... 61

5.3 Hardware-based KASUMI Acceleration Case Simulation .............................................. 62

5.3.1 Case A: KASUMI HW Encryption and Decryption of packets of 40 bytes ............ 62

5.3.2 Case B: KASUMI HW Encryption and Decryption of packets of 88 bytes ............ 63

5.3.3 Case C: KASUMI HW Encryption and Decryption of packets of 500 bytes .......... 64

5.3.4 Case D: KASUMI HW Encryption /Decryption of packets of 1000 bytes.............. 64

5.4 Software-based KASUMI Acceleration Case Simulation ............................................... 65

5.4.1 Case A: KASUMI SW Encryption and Decryption on 40 bytes packets ................ 65

5.4.2 Case B: KASUMI SW Enc and Dec on 88 bytes packets........................................ 66

5.4.3 Case C: KASUMI SW Encryption and Decryption on 500 bytes packets............... 67

5.4.4 Case D: KASUMI SW Encryption and Decryption on 1000 bytes packets ............ 67

5.5 Key Generation/Expansion Effects on Test Cases........................................................... 68

5.6 Multi-core Test Cases ...................................................................................................... 68


6

5.6.1 Shorter Data Size for Multi-core Test Cases ........................................................... 68

5.6.1.1 KASUMI HW Encryption and Decryption ................................................ 68

5.6.1.2 KASUMI SW Encryption and Decryption ................................................. 69

5.6.2 Higher Data Size for Multi-core Test Cases ............................................................ 70

5.6.2.1 KASUMI HW Encryption and Decryption ................................................ 70

5.6.2.2 KASUMI SW Encryption and Decryption ................................................. 71

6.  RESULT COMPARISONS AND DETAIL ANALYSIS...................................................... 72

6.1 Comparisons between KASUMI HW and KASUMI SW ............................................... 72

6.1.1 Lower Data Rate with One Core.............................................................................. 72

6.1.1.1 KASUMI HW/SW Enc & Dec Comparisons with 40 bytes....................... 72

6.1.1.2 KASUMI HW/SW Enc and Dec Comparisons with 40 bytes ................... 73

6.1.1.3 KASUMI SW Enc/Dec vs. only SW Enc/SW Dec for 40 bytes ................ 74

6.1.1.4 KASUMI HW Enc/Dec vs. only HW Enc & HW Dec on 40 bytes ........... 74

6.1.1.5 KASUMI HW and SW Enc/Dec for 1000 bytes ........................................ 75

6.1.1.6 KASUMI HW Enc/Dec vs. only Enc and Dec for 1000 bytes ................... 76

6.1.1.7 KASUMI SW Enc/Dec vs. only SW Enc & SW Dec on 1000 bytes ......... 77

6.1.2 Higher Data Rate with Multi-core ........................................................................... 77

6.1.2.1 Comparison between KASUMI HW/SW Enc/Dec with 16K .................... 77

6.1.2.2 Comparison between KASUMI HW Enc/Dec with 24K ........................... 78

6.1.2.3 Comparison between KASUMI HW Enc/Dec with 64K ........................... 79

7. CONCLUSIONS AND FUTURE WORK ............................................................................. 81

REFRENCES .............................................................................................................................. 83

APPENDIX 1. Testing Environment .......................................................................................... 87

APPENDIX 2. More test case scenarios ..................................................................................... 91


7

CONCEPTS

Cellular Concept: the cellular concept is a system-level idea which calls for
a single high-power transmitter (large cell) to be replaced by many low-power
transmitters (small cells), each providing coverage to only a small portion of the service
area.

Domain: the highest-level group of physical entities. Reference points are defined
between domains.

Stratum: the grouping of protocols related to one aspect of the services provided by one
or several domains.

Stream Cipher: the stream concept is that Plaintext data are added bit by bit to random-
looking mask data that are generated by the Cipher Key (CK) and a few other
parameters.

Processor: this word is used in this thesis to denote the entire chip with all of the
different functional hardware blocks, including all the cores on the chip.

A Cryptographic Accelerator: is a device that performs processor-intensive
decrypting/encrypting while freeing the host CPU to perform other tasks.

Message authentication code: Is a short piece of information used to authenticate a
message and to provide integrity and authenticity assurance on the message.

LIST OF SYMBOLS

= The assignment operator.

 The bitwise exclusive -OR operation.

|| The concatenation of the two operands.


8

ABBREVIATIONS

1G 1st Generation

2G 2nd Generations

3G 3rd Generations

3GPP 3rd Generations Partnership Project

cnMIPS core networks standardized MIPS

AKA Authentication and Key Agreement

AN Access Network

AMPS Advanced Mobile Phone Service

CBC Cipher Block Chaining

CFB Cipher Feedback

CK Cipher Key

CN Core Network

CPU Central Processing Unit

CDMA Code Division Multiple Access

CS Circuit Switched

DES Data Encryption Standard

GPRS General Packet Radio Service

GSM Global System for Mobile Communication

IETF Internet Engineering Task Force

IK Integrity Key

I/O Input and Output

HE Home Environment

HW Hardware

HSPA High Speed Downlink Packet Access


9

HSPA+ Evolved High Speed Packet Access

HSUPA High Speed Uplink Packet Access

K Key

KM Key Modifier

OFB Output Feedback

PDA Personal Digital Cellular

PS Packet Switched

PS Padded String

LTE Long-Term Evolution

MAC Message Authentication Code

MIPS Microprocessor without Interlocked Pipeline Stages

MM Mobility Management

MME Mobility Management Entity

NAS Non-Access Stratum

NIST National Institute of Standard and Technology

Node B UMTS Base Station

RAN Radio Access Network

RANAP Radio Access Network Application Protocol

RNC Radio Network Controller

RRC Radio Resource Control

SAGE Security Algorithms Group of Experts

SN Serving Network

SoC System-On-Chip

SW Software

TDMA Time Division Multiple Access


10

TR Technical Report

TS Technical Specification

UE User Equipment

UEA UMTS Encryption Algorithm

UIA UMTS Integrity Algorithm

UMTS Universal Mobile Telecommunication System

USA United State of America

USIM Universal Subscriber Identity Module

UTRAN UMTS Terrestrial Radio Access Network

WCDMA Wideband Code Division Multiple Access


11

TABLE OF FIGURES

FIGURE 1. UMTS SYSTEM..........................................................................................................................19
FIGURE 2. UTRAN, ITS NETWORKS AND INTERFACES ................................................................................20
FIGURE 3. OVERVIEW OF SECURITY ARCHITECTURE...................................................................................22
FIGURE 4. CIPHERING USER AND SIGNALING DATA TRANSMITTED OVER THE RADIO ACCESS LINK. ............26
FIGURE 5. THE Ƒ8 STREAM CIPHER MODE ...................................................................................................26
FIGURE 6. DERIVATION OF MAC-I AND/OR XMAC-I O .............................................................................30
FIGURE 7. THE Ƒ9 INTEGRITY FUNCTION.....................................................................................................31
FIGURE 8. (A) KASUMI; (B) FOI FUNCTION; (C) FII,J FUNCTION; AND (D) FLI FUNTION............................32
FIGURE 9. KASUMI OPERATIONS ..............................................................................................................37
FIGURE 10. CAVIUM NETWORKS MIPS ARCHITECTURE (CNMIPS) ...........................................................44
FIGURE 11. COMMUNICATION BETWEEN CORE AND ITS CO-PROCESSOR .....................................................45
FIGURE 12. HIGH LEVEL TEST SYSTEM ARCHITECTURE ..............................................................................47
FIGURE 12. SIMPLE EXECUTIVE API ..........................................................................................................48
FIGURE 13. SIMPLE EXECUTIVE API RUNTIME ENVIRONMENTS .................................................................49
FIGURE 14. SIMPLE EXECUTIVE STAND-ALONE APPLICATION (SE-S) ........................................................50
FIGURE 15. SIMPLE EXECUTIVE CALLS FROM KERNEL MODE.....................................................................50
FIGURE 16. SIMPLE EXECUTIVE USER-MODE (SE-UM) APPLICATION .......................................................50
FIGURE 17. 8-CORE SIMPLE SYSTEM RUNNING A SE-S APPLICATION ..........................................................51
FIGURE 18. SIMPLE EXECUTIVE STAND-ALONE MODE (SE-S) APPLICATION..............................................52
FIGURE 19. DOWNLOADING AND BOOTING SE-S APPLICATIONS ...............................................................52
FIGURE 20. SE-S USED BOTH CONTROL PLANE AND DATA PLANE .............................................................53
FIGURE 21. CORE MAY ACCEPT WORK FROM ANY AND ALL GROUPS ..........................................................56
FIGURE 22. KASUMI HW ENC AND DEC ON 40 AND 88 BYTES PACKETS ..................................................63
FIGURE 23. KASUMI HW ENC AND DEC ON 500 AND 1000 BYTE PACKETS ..............................................65
FIGURE 24. KASUMI SW ENC AND D EC ON 40 AND 88 BYTES PACKETS ..................................................66
FIGURE 25. MULTICORE FOR KASUMI HW ENC AND DEC ON 64 BYTE PACKETS ....................................68
FIGURE 26. REFERENCE TIME FOR KASUMI HW ENC AND DEC ON 64 BYTE PACKETS............................69
FIGURE 27. MULTICORE FOR KASUMI SW ENC AND DEC ON 64 BYTE PACKETS ......................................69
FIGURE 28. REFERENCE TIME FOR KASUMI SW ENC AND DEC ON 64 BYTE PACKETS .............................70
FIGURE 29. MULTICORE FOR KASUMI HW ENC AND DEC ON 512 BYTE PACKETS ...................................70
FIGURE 30. REFERENCE TIME FOR KASUMI HW ENC/DEC ON 512 BYTE PACKETS...................................71
FIGURE 31. MULTICORE FOR KASUMI SW ENC/DEC ON 512 BYTE PACKETS ...........................................71
FIGURE 32. REFERENCE TIME FOR KASUMI SW ENC/DEC ON 512 BYTE PACKETS..................................71
FIGURE 33. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 40 BYTE PACKETS ...........................................72
FIGURE 34. NUMBER OF CYLES PER BIT ON 40 BYTE PACKETS ....................................................................73
FIGURE 35. COMPARISON BETWEEN KASUMI SW IMPLEMENTATIONS .....................................................74
FIGURE 36. COMPARISON BETWEEN KASUMI HW IMPLEMENTATIONS ....................................................75
FIGURE 37. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 1000 BYTE PACKETS .......................................76
FIGURE 38. COMPARISON BETWEEN KASUMI HW IMPLEMENTATIONS ....................................................76
FIGURE 39. COMPARISON BETWEEN KASUMI SW IMPLEMENTATIONS .....................................................77
FIGURE 40. NUMBER OF CYLES FOR HW/SW ENC/DEC ON 512 BYTE PACKETS WITH 16K.........................78
FIGURE 41. KASUMI HW ENC/DEC ON 512 BYTE WITH DATA BLOCKS OF 24K ........................................79
FIGURE 42. KASUMI HW ENC/DEC ON 512 BYTE DATA PACKET OF 64K .................................................80


12

LIST OF TABLE

TABLE 1. CONSTANTS CJ.............................................................................................................................36
TABLE 2. SUBKEYS (KL, KO, AND KI)........................................................................................................36
TABLE 4. KASUMI HW ENC AND DEC ON 40 BYTE PACKETS ....................................................................62
TABLE 5.KASUMI HW ENC AND DEC ON 88 BYTE PACKETS .....................................................................63
TABLE 6. KASIMI HW ENC AND DEC ON 500 BYTES PACKETS ..................................................................64
TABLE 8. KASUMI SW ENC AND DEC ON 40 BYTE PACKETS .....................................................................65
TABLE 9. KASUMI SW ENC AND DEC ON 88 BYTE PACKETS .....................................................................66
TABLE 10. KASUMI SW BASED IMPLEMENTATION ON 500 BYTE PACKETS................................................67
TABLE 11. KASUMI SW ENC AND DEC ON 1000 BYTE PACKETS ...............................................................67
TABLE 12. NUMBER OF CYCLES PER BIT ON 40 BYTE PACKETS ...................................................................73
TABLE 13. KASUMI SW ENC/DEC VS. ONLY ENC OR DEC ON 40 BYTES ..................................................74
TABLE 14. KASUMI HW ENC/DEC VS. ENC OR DEC ON 40 BYTES ............................................................75
TABLE 15. KASUMI HW/SW ENC AND DEC ON 512 BYTES WITH 16K DATA BLOCKS...............................78
TABLE 16. KASUMI HW ENC/DEC ON 512 BYTE WITH 24K DATA BLOCKS ...............................................79
TABLE 17. KASUMI HW ENC/DEC ON 512 BYTE WITH 64K DATA BLOCKS ...............................................80
TABLE A1. KASUMI HW ENCRYPTION ON 40 BYTE PACKETS ...................................................................91
TABLE A2. KASUMI HW ENCRYPTION ON 88 BYTE PACKETS ...................................................................91
TABLE A3. KASUMI HW ENCRYPTION ON 500 BYTES PACKETS ...............................................................92
TABLE A4. KASUMI HW ENCRYPTION ON 1000 BYTE PACKETS ...............................................................93
TABLE A5. KASUMI HW DECRYPTION ON 40 BYTE PACKETS ...................................................................93
TABLE A6. KASUMI HW DECRYPTION ON 88 BYTE PACKETS....................................................................94
TABLE A7. KASUMI HW DECRYPTION ON 500 BYTES PACKETS................................................................95
TABLE A8. KASUMI HW DECRYPTION ON 1000 BYTE PACKETS................................................................95
TABLE A9. KASUMI SW ENCRYPTION ON 40 BYTE PACKETS....................................................................96
TABLE A10. KASUMI SW ENCRYPTION ON 88 BYTES PACKETS ................................................................97
TABLE A11. KASUMI SW ENCRYPTION ON 500 BYTES PACKETS ..............................................................98
TABLE A12. KASUMI SW ENCRYPTION ON 1000 BYTES PACKETS ............................................................99
TABLE A13. KASUMI SW DECRYPTION ON 40 BYTES PACKETS ................................................................99
TABLE A14. KASUMI SW DECRYPTION ON 88 BYTES PACKETS ..............................................................100
TABLE A15. KASUMI SW DECRYPTION ON 500 BYTE PACKETS ..............................................................100
TABLE A16. KASUMI SW DECRYPTION ON 1000 BYTES PACKETS ..........................................................101


13

UNIVERSITY OF VAASA
Faculty of technology
Author: Abdi-Hakim Yasin Ararse
Topic of the Thesis: Comparison of the performance of 3G security

algorithms in the NAS layer
Supervisor: Professor Jouni Lampinen
Instructor: M.Sc.(Tech). Jani Kokkonen
Degree: Master of Science in Technology
Major of Subject: Software Engineering
Year of Entering the University: 2000
Year of Completing the Thesis: 2013 Pages: 101
Key words: KASUMI, 3G, UMTS security, Multi-Core

ABSTRACT

Cryptographic functionality implementation approaches have evolved over time, first,
for running security software on a general-purpose processor, second, employing a
separate security co-processor ,and third, using built-in hardware acceleration for
security that is a part of a multi-core CPU system. The aim of this study is to do
performance tests in order to examine the boost provided by accelerating KASUMI
cryptographic functions on a multi-core Cavium OCTEON processor over the same
non-accelerating cryptographic algorithm implemented in software.

Analysis of the results shows that the KASUMI SW implementation is much slower
than the KASUMI HW-based implementation and this difference increases gradually as
the packet size is doubled. In detailed comparisons between the encryption and
decryption functions, the result indicates that at a lower data rate, neither of the
KASUMI implementations shows much difference between encryption or decryption
processing, regardless of the increase in the number of data packets that are being
processed.

When all the 16 cores of the OCTEAN processor are populated, as the number of core
increases, the number of processing cycles decreases accordingly. Another observation
was that when the number of cores in use exceeds 5 cores, it doesn’t make much
difference to the number of decrease of processing cycles.

This work illustrates the potential that up to sixteen cnMIPS cores integrated into a
single-chip OCTEON processor provides for HW- and SW-based KASUMI
implementations.


14

VAASAN YLIOPISTO
Teknillinen tiedekunta
Tekija: Abdi-Hakim Yasin Ararse
Diplomityön nimi: Comparison of the performance of 3G security

algorithms in the NAS layer
Valvojan nimi: Professor Jouni Lampinen
Ohjaajan nimi: M.Sc.(Tech) Jani Kokkonen
Tutkinto: Diplomi-insinööri
Koulutusohjelma: Tietotekniikka
Suunta: Ohjelmistotekniikka
Opintojen aloitusvuosi: 2000
Diplomityön valmistusmisvuosi: 2013 Sivumäärä: 101
Key words: KASUMI, 3G, UMTS-Tietoturva, moniydin

TIIVISTELMÄ

Salaustekniikoiden toteutustavat ovat kehittyneet ajan myötä, kun ensiksi
yleiskäyttöisillä suorittimilla ajettavista tietoturvaohjelmista on siirrytty käyttämään
erillistä apusuoritinta ja kolmanneksi on siirrytty käyttämään sisäänrakennettua
laitteistokiihdytystä, joka on osana keskusyksikön moniydinsuoritinta tietoturvan
takaamiseksi. Tämän tutkimuksen tavoitteena on tehdä suorituskykytestejä, joilla
tutkitaan suorituskyvyn parannusta, joka saavutetaan KASUMI-salausalgoritmin
kiihdytyksellä moniytimisellä Cavium OCTEON suorittimella suhteessa
ohjelmistopohjaiseen, kiihdyttämättömään salausalgoritmiin.

Tulosten analyysi osoittaa, että KASUMI-algoritmin ohjelmistopohjainen toteutus on
paljon hitaampi kuin laitteistopohjainen toteutus, ja että tämä ero kasvaa asteittain, kun
paketin koko kaksinkertaistuu. Yksityiskohtaisissa vertailuissa salauksen koodauksen ja
koodauksen purkamisen välillä osoittavat, ettei alemmalla datanopeudella
kummassakaan KASUMI- toteutuksessa salauksen tai sen purkamisen käsittelyssä ole
suuria eroja riippumatta lisääntyvän käsiteltävien datapakettien koosta.

Kun kaikki kuusitoista OCTEON-suorittimen ydintä on otettu käyttöön, ja kun ydinten
määrä kasvaa, suoritusvaiheiden määrä vähenee vastaavasti. Toinen havainto oli, että
kun käytössä olevien ytimien määrä ylittää viisi ydintä, ei sillä ole vaikutusta
suoritusvaiheiden vähenemiseen.

Tämä tutkimus osoittaa ne mahdollisuudet, jotka jopa 16 cnMIPS ydintä integroituna
yhden piirin OCTEON-suorittimeen tarjoaa ohjelmisto- ja laitteistopohjaisissa
toteutuksissa.


15

1. INTRODUCTION

This chapter introduces the motivation for comparing KASUMI hardware accelerated
cryptography with the software-based KASUMI cryptography implementation, which
has become the most used option in a system-in-chip embedded system like the
OCTEON processor that is studied in this thesis. Afterwards, an overview of the
organization of the thesis will be given.

The Universal Mobile Telecommunication System (UMTS) is a so-called third-
generation (3G) mobile radio system and it is a successor to second-generation (2G)
systems such as the Global System for Mobile Communications (GSM) and General
Packet Radio Service (GPRS). The current UMTS system is in Release 8 and is called
Long-Term Evolution (LTE), which represents a flat architecture solution.

According to the 3GPP security specification (3GPP TS 33.102, 2006), the security
architecture is made up of a set of security features and security mechanisms that meets
one or several security requirements. The requirements for UMTS confidentiality and
integrity algorithms are specified by 3GPP in the technical specification document
(3GPP TS 33.105, 2007). Within the security architecture of the 3GPP system, there are
two mandatory security algorithms, namely, ƒ8 and ƒ9 or the UMTS Encryption
Algorithm (UEA1) and UMTS Integrity Algorithm (UIA1), respectively. The core
functionalities of these algorithms are based on the KASUMI stream cipher.

For the confidentiality algorithm, the UMTS Encryption Algorithm (UEA1) is a stream
cipher and it is used to encrypt and decrypt blocks of data under a confidentiality key
(CK).  The algorithm uses KASUMI in the form of Output Feedback (OFB) mode as a
key stream generator. For the integrity algorithm, the UMTS Integrity Algorithm
(UIA1) computes the Message Authentication Code (MAC) of a given input message
under an integrity key (IK) and imposes no limitation on the length of the input
message. The approach that has been adopted uses KASUMI as it is also used by the
confidentiality algorithm (UEA1) in the form of Cipher Block Chaining (CBC-MAC)
mode.

Cryptography is the art and science of encrypting and decrypting data in order to make
it impossible for outside parties to access the data. Cryptographic functionality
implementation approaches have evolved over time, first, with software running on
general-purpose processors, second, employing a separate custom security co-processor,
and third, using built-in hardware acceleration for security that is a part of a multi-core
CPU system.


16

This thesis examines the use of built-in hardware acceleration for cryptographic
functions in silicon, or, to give it another name, cryptographic functions in SoC. The
SoC system consists of both a hardware component and a software component, which
paves the way for the boundary between hardware and software being shrunk. Security
algorithms are generally dependent on intensive computation and they require many bit-
manipulating operations in order to transform back and forth between plaintext and
ciphertext.

Software running on a general-purpose processor is often inefficient in performing such
operations since the many instructions needed to implement cryptographic operations
consume valuable CPU resources and thus affect the performance of the system.
Hardware acceleration provides a better system performance implementation than the
software.

This thesis work focuses on verifying the cryptographic performance boost that is
provided by a Cavium OCTEON processor as an accelerator of KASUMI encryptions
and decryptions via the software-based KASUMI implementation. KASUMI encryption
and decryption are applied in the UMTS Non-Access Stratum (NAS) layer for ciphering
radio link access which is specified in the 3GPP release 7 specifications.

The structure of this thesis is as follows. Chapter 2 presents the existing literature work
in the field under study. Chapter 3 discusses the basics of performance analysis and how
it is applied in this thesis. Chapter 4 develops the relevant test scenarios and test
environment setups. Chapter 5 discusses the test execution and presents the test results.
Results and analysis of the measurements can be found in Chapter 6, followed by
overall conclusions in Chapter 7.


17

2. UMTS

2.1 Introduction

This chapter will first provide an insight into the history of mobile telephony systems,
from the first generation of mobile systems to the third generation. After taking a closer
look at the evolution of mobile systems, the second subchapter will highlight the basic
principle of mobile security used in the third-generation system. At the end, a detailed
description of the KASUMI cipher and its operations will be given.

2.2 The Evolution of Mobile Telephony Systems

The history of mobile telephony goes back to experiments with radio telephony in the
USA in the 1920s (Agar 2005). There are three different generations as far as mobile
telephony is concerned. The first mobile radio systems, the so-called first generation
(1G) networks, were based on an analogue radio path but used digital switching
technology. These mobile systems offered basic services for their users and the
emphasis was on speech and service-related matter. The other characteristics of these
early mobile systems were that networks were mainly national efforts and very often
they were specified after the networks were established. For this reason, the 1G
networks were incompatible with each other.

Later, the first real cellular systems were implemented, such as the analogue Advanced
Mobile Phone Service (AMPS) system in the USA. For the first time, frequencies were
reused, resulting in the interference inherent to cellular networks. Old analogue systems
have a common limitation in terms of wide area coverage and frequency spectrum reuse
in the systems as a result of the use of only one single radio transmitter (Walke,
Seidenberg &Althoff 2003).

In May 1972 Bell Labs introduced and patented a cellular concept that laid the
foundations for the second- and third-generation mobile radio systems. The cellular
concept is simple: instead of a single base station providing coverage of as large an area
as possible, each base station should only cover a small area. Further elaboration of the
cellular concept is provided in (Macdonald 1979) and (Rappaport 2002).

As the need for mobile communication increased, the need for a more global mobile
communication system also increased. The international specification bodies started to
specify the 2G; the so-called Second Generation (2G) network uses digital channels,
resulting in more efficient use of the spectrum.  Two main systems that established
themselves in the USA – the Time Division Multiple Access (TDMA) systems IS-54


18

and IS-136 – are based on a time slot structure and are  similar to the Global System for
Mobile Communications (GSM) (Althoff 2003) .

Japan also developed its own standard: Personal Digital Cellular (PDC), which also uses
the TDMA technology (3 time slots, 25-MHz channel bandwidth) and operates at 800
MHz and 1500 MHz.

The best-known Second Generation system is one that originated in Europe; the Global
System for Mobile (GSM) Communications was designed in the late ’80s by the state-
owned national telecommunication companies and harmonised for use throughout
Europe. GSM also employs the TDMA technology and uses 8 time slots on a 200-kHz
wide carrier frequency. GSM900 has a total of 124 frequency channels and GSM1800
has as many as 374.

The so-called Third-Generation (3G) mobile radio systems are based on the open
interfaces of its successor, GSM. The third generation, 3G, is expected to complete the
globalisation process of mobile communication. UMTS (Universal Mobile
Telecommunication System) is one 3G implementation. The current UMTS network
has been upgraded to High-Speed Downlink Packet Access (HSPA) in order to increase
the data rate and capacity for downlink packets and has been introduced as a 3GPP
release 5 features, and High-Speed Uplink Packet Access (HSUPA) is introduced in
3GPP release 6 in order to boost uplink performance in a UMTS network.

The combination of HSDPA and HSUPA is often referred to as HSPA. However, even
with these improvements and the introduction of HSPA, the evolution of UMTS has not
reached its end. In a 3GPP release 7, HSPA+ has been introduced, with a significant
enhancement and improvement to the performance of HSPA-based radio networks in
terms of spectrum efficiency, peak data rate, and latency.

2.3 UMTS Network Architecture

There are different ways to visualise a UMTS network, depending on which angle you
look from. One angle to look from is the functions of the network in terms of how the
traffic is handled. Another approach is to study the functions of the network elements.
In this thesis work, the network will be looked at from the point of view of both the
physical and functional viewpoints.

The physical aspects are modeled using the domain concept and the functional aspects
are modeled using the strata concept (3GPP TS 23.110, 2007). Figure 1 (below)
illustrates the basic domains in a UMTS system.


19

User Equipment
Domain

Access
Network
Domain

Core
Network
Domain

Infrastructure
Domain

Cu

Mobile
Equipment
Domain

USIM
Domain

Home
Network
Domain

Transit
Network
Domain

Uu Iu

[Zu]

[Yu]

Serving
Network
Domain

Figure 1. UMTS System

Cu Reference point between USIM and UE
Iu Reference point between Access and Serving Network domains
Uu Reference point between User Equipment and Infrastructure domains,

UMTS radio interface
[Yu] Reference point between Serving and Transit Network domains
[Zu] Reference point between Serving and Home Network domains

According to (3GPP TS 23.101, 2007), the basic UMTS architectural is split into a User
Equipment domain and an Infrastructure domain. The main interest of this thesis work
lies in the Infrastructure domain, which could be further split into an Access Network
domain and a Core Network domain. From functionality point of view, UMTS network
infrastructure is logically divided into Core Network (CN) and Access Network (AN)
subsystems as specified in (3GPP TS 23.101, 2007) and (3GPP TS 23.110, 2007)
respectively.

Each subsystem can be further divided into separate technologies. For example, the
RAN (Radio Access Network) is compromised of different air interface technologies,
such as GERAN (GSM, EDGE, and Radio Access Network), UTRAN (UMTS
Terrestrial Radio Access Network), and future solutions such as WLAN and 4G.

The core network is today clearly divided into two domains:
 the Circuit-Switched (CS) domain and
 the Packet-Switched (PS) domain.

A more detailed description of the UMTS architecture is provided in the 3GPP Network
Architecture document (3GPP TS 23.002, 2007). The interest of this thesis work lies


20

mainly in two interfaces, namely Iu and Iur. More information on UMTS interfaces can
be found in specifications (3GPP TR 23.930, 1999) and (3GPP TS 25.420, 2007).

2.4 UMTS Radio Network

The UMTS Terrestrial Radio Access Network (UTRAN) consists of a set of Radio
Network Subsystems (RNS) connected to the Core Network (CN) through the Iu
interface. A RNS consists of a Radio Network Controller and or more Node Bs. A Node
B is connected to the RNC through the Iub interface (3GPP TS 25.401, 2002).

The main task of the UTRAN is to create and maintain Radio Access Bearers (RAB) for
communication between the UMTS User Equipment (UE) and core network (CN).
With RAB the CN elements are given an illusion about a fixed communication path to
the UE, thus releasing them from the need to take care of radio communication aspects.
There is two open interfaces: Uu and Iu, which are used by UTRAN to talk with the UE
and CN. Since there are packet switched and circuit switched domains for different
services, Iu-interface was separated into the Iu-CS and Iu-PS interfaces. Detail
description of UTRAN protocol architecture is specified in (3GPP TS 25.401, 2002)
document.

The UTRAN architecture is shown following figure.

Figure 2. UTRAN, its networks and interfaces

The focus of this thesis work is on security parts of Non-Access Stratum (NAS)
signaling between UE and CN which passes through UTRAN network. The overall


21

protocol architecture of radio interface is layered into three protocol layers, namely, the
physical layer (L1), the data link layer (L2) and network layer (L3) as described in
(3GPP TS 25.301, 2007) .  L3 provides Uu Stratum services and functions as whole and
detail description of L3 protocol, Radio Resource Control (RRC) is given in (3GPP TS
25.331 2006).

2.4.1 Security Architecture UMTS

UMTS security builds on the security of GSM, inheriting the proven GSM security
features. This maximizes the backward compatibility between GSM /UMTS and
UMTS/LTE. UMTS also provides a solution to the weaknesses of GSM security and
adds security features for new 3G radio access networks and services.

According to specifications, the security architecture is made up of a set of security
features and security mechanisms (3GPP TS 33.102, 2006). A security feature is a
service capability that meets one or several security requirements that are defined in
(3GPP TS 21.133, 2002) and implement the security objectives and principles described
in (3GPP TS 33.120, 2001).

A Security mechanism is an element that is used to realize a security feature and all the
security features and security mechanism taken together to forms the security
architecture (3GPP TS 33.102, 2006).

UMTS consists of five security feature groups:

I) Network Access Security provides users with secure access to UMTS services and
protect against attacks on the radio access link.
II) Network Domain Security protects against attacks on the wireline network and
allows nodes in the provider domain to exchange signaling data securely.
III) User Domain Security provides secure access to mobile stations.
IV) Application Domain Security allows the secure exchange of messages between
applications in the user and in the provider domain.
V) Visibility and configurability of security allows the user to observe whether a
security feature is currently in operation and if certain services depend on this security
feature

Figure 3 shows the way security features are grouped together into five different sets of
features, each one facing a specific threat and accomplishing certain security objectives.


22

Figure 3. Overview of security architecture

The focus of this thesis work is on network access security (group I) and other UMTS
security area are outside the scope of this thesis work.

2.4.2 Network Access Security to UMTS

Network Access Security features enables users to securely access services provided by
the UMTS network. It consist a set of security features that provide users with secure
access to UMTS services in which particular protect against on the radio access link.
Network access security features can be further classified into the following categories,
namely, user identity confidentiality, entity authentication and confidentiality, data
integrity and mobile equipment identification as specified in (3GPP TS 33.102, 2006).

2.4.2.1 User Identity Confidentiality

The main objectives of the user identity confidentiality features are to provide
mechanisms which prevent intruders from eavesdropping on the radio access link. It
also makes sure that the user location confidentiality is secured, so the presence or the
arrival of the user in a certain area cannot be determined by eavesdropping on the radio
access. It also provides user un-traceability mechanism, so that an intruder cannot
deduce whether different services are delivered to the same user by eavesdropping on
the radio access link.

To achieve these objectives, the user is normally identified by a temporary identity by
which he is known by the visited serving network, or by an encrypted permanently
identity. To avoid user traceability, which may lead to the compromise of the user
identity confidentiality, the user should not be identified for a long period by means of
the same temporary or encrypted identity. To achieve these security features, in addition
it is required that any signaling or user data that might reveal the user’s identity is
ciphered on the radio access link.


23

2.4.2.2 Mutual Entity Authentication

There are three entities involved in the authentication mechanism of the UMTS system,
Home Environment (HE), Serving Network (SN) and the User Entity (UE), more
specifically Universal Subscriber Identity Module (USIM). Authentication and Key
Agreement (AKA) mechanism accomplishes this mutual authentication of the user and
the network using a symmetric key (K) and derives the new cipher and integrity keys.
Detail description of Authentication and Key Agreement (AKA) mechanisms are
discussed in (Arkko &  Haverinen 2006), (Kambourakis, Rouskas & Gritzalis 2004) and
(Grecas, Maniatis & Vernieris 2003).

Once the user and the network have authenticated each other, they may begin secure
communication. Encryption and decryption take place in the UE and in RNC on the
network side, which means that Cipher Key (CK) has been transferred from the core
network (CN) to the Radio Access Network (RAN). This is done in a specific Radio
Access Network Application Protocol (RANAP) message, called the security mode
command. After the RNC has obtained the CK, it can switch encryption on the sending
a Radio Resource Control (RRC) security code command to the UE. The UMTS
encryption mechanism is based on a steam cipher concept [see chapter concepts]. The
core of encryption mechanism is the mask generation algorithm, which is denoted as
function f8. The specification is available (3GPP TS 35.201, 2007) and it is based on a
novel block cipher called KASUMI in Universal Mobile Telecommunication System
(UMTS) and uses SNOW 3G in Long-Term Evolution (LTE).

2.4.2.3 Data Integrity

The purpose of integrity data protection is to provide a mechanism that the User
Equipment (UE) and Serving Network (SN) can securely negotiate the integrity
algorithm that they shall use subsequently in order to authenticate individual control
messages.  It also provides a mechanism that the User Equipment (UE) and Serving
Network (SN) agree on an integrity key that they may use subsequently. Lastly, data
integrity feature provide a mechanism that receiving entity (UE or SN) is able to verify
that signaling data has not been modified in an authorized way since it was sent by the
sending entity (UE or SN) and that data origin of the signaling data received is indeed
the one claimed.

According to (Niemi & Nyberg 2006), this mechanism is important, since separate
authentication procedures only give assurance of the identities of the communicating
parties at the time of the authentication. A Message Authentication Code (MAC)
function is applied to each individual signaling message at the RRC layer of UMTS
Terrestrial Radio Access Network (UTRAN) protocol stack (3GPP TS 25.331 2006).

The integrity protection of signaling messages, between UE and RNC starts during the
security mode set-up as soon as the integrity key and integrity protection algorithm is
known.  Integrity protection is based on a message authentication code concept [see in
concept chapter] which is a one-way function controlled by the secret Integrity Key


24

(IK). The function is devoted by f9 and its output is MAC-I a 32 bit, random-looking bit
string. The algorithm for integrity protection is based on the same core function as
encryption.

This thesis work focus mainly on mechanisms used for confidentiality and data integrity
security features, so mobile equipment identification and entity authentication related
mechanisms are outside the scope of this thesis work.

2.4.3 Cryptographic Algorithms for UMTS

This section will highlight the cryptographic algorithms used in UMTS and explain
their functionalities in detail. Further, this section presents the 3GPP UMTS security
algorithm specifications and further descriptions of UMTS security functionalities
provided by security experts, namely Valtteri Niemi and Kaisa Nyberg in their book:
UMTS Security.

In cryptography, a block cipher is a symmetric key cipher which operates on fixed-
length group of bits. The block cipher transforms a plaintext block of fixed lengths into
a sequence of ciphertext blocks of equal length under the control of a secret key K. The
operation of transforming a plaintext block into a ciphertext block is called encryption,
and the operation of transforming a ciphertext block back to plaintext block is called
decryption. A block cipher applies the encryption algorithms and key to an entire block
of plaintext to obtain the ciphertext.

To encrypt messages longer than the block size or in another word in order to provide
confidentiality for messages of arbitrary length, then a mode of operation is used. The
four most widely known modes of operation were originally standardized by National
Institute of Standard and Technology (NIST) for use with the Data Encryption Standard
(DES) algorithm (NIST 1981).

A stream cipher is a cryptographic algorithm for encrypting plaintext similarly as to
block ciphers but the plaintext is partitioned to sequence of blocks. The main difference
between them is that in the block cipher, the current plaintext block is not taken as data
input for cryptographic transformation. Instead, a string of bits, often called a
“keystram” block, is generated independently of the current plaintext block and then
combined with the plaintext using a simple operation, most commonly the XOR
operation. Due to this functional difference, stream ciphers typically operate on
plaintext blocks of shorten length, which can be just 1 bit or 1 byte.


25

The requirement for UMTS confidentiality and integrity algorithms are specified by
3GPP in the technical specification document (3GPP TS 33.105, 2007). Within the
security architecture of the 3GPP system, there are standardized algorithms versions for
UMTS confidentiality algorithm UEA1 and integrity algorithm UIA1 There are two
versions for UMTS security algorithms: 1.0 and 1.1. Version 1.1 must be used.

In following sub-sections confidentiality and Integrity algorithms will be elaborated
more as well as their structure and functionalities.

2.4.3.1 Confidentiality Algorithm

For data confidentiality of user data and signaling data, UMTS uses a cryptographic
function called (ƒ8). Confidentiality algorithm (UEA1) is a stream cipher is used to
encrypt and decrypt blocks of data under a confidentiality key (CK). The block of data
may be between 1 and 20, 000 bits long. The algorithm use KASUMI in a form of
Output Feedback (OFB) mode as a key stream generator in UMTS and uses. The detail
description of f8 algorithm is specified in (3GPP TS 35.201, 2007).

The 3GPP ƒ8 stream cipher mode is not a standard stream cipher mode of operation of a
block cipher. Examples of such standard modes are counter mode and OFB mode
(NIST 1981). A counter mode keystream generator makes us of the generator that is
updated for each new block and is taken as part of the input to the generator function.
The ƒ8 stream cipher mode can be seen as a combination of these two standard modes
and makes use of prewhitening of feedback data (Niemi & Nyberg 2006).

In UMTS, The ƒ8 algorithm makes use of the KASUMI key-dependent function, which
operates on 64-bit data blocks and produces 64-bit blocks under control of a 128-bit key
K. In LTE, ƒ8 uses SNOW 3G as a keystream generator, which generates a sequence of
32-bit words under the control of a 128-bit key and a 128-bit initialization variable. The
input parameters to ƒ8 are the Cipher Key (CK), the time-dependent input (COUNT-C),
the bearer identity (BEARER), the direction of the transmission (DIRECTION) and the
length (LENGTH) of the plaintext.

The Cipher Key (CK) is renewed at every authentication process. COUNT-C, BEREAR
and DIRECTION can be considered as initialization parameters as they are renewed for
each keystream block. The time-dependent input COUNT-C also sent in cleartext and
used as a synchronization parameter for the synchronous stream cipher. The input
parameter LENGTH only affects the length of the keystream, not the actual bits in it.
The exact structure of these parameters and further description of ƒ8 algorithm
operation it is specified in (3GPP TS 33.102, 2006) and in (3GPP TS 35.201, 2007).


26

Following figure illustrate the use of ƒ8 function to encrypt plaintext by applying a
keystream using a bitwise XOR operation. The plaintext may be recovered by
generating the same keysteam using the same input parameters and applying it to the
ciphertext using a bitwise XOR operation.

Figure 4. Ciphering user and signaling data transmitted over the radio access link.

The ƒ8 algorithm makes use of two 64-bit registers as described in (Niemi & Nyberg
2006): the static register A and the counter BLKCNT. Register A is initialized using the
64 bit initialization value:

IV = COUNT ||BEARER||DIRECTION||0…0 obtained as the concatenation of the 32-
bit COUNT, 5-bit BEARER, 1-bit DIRECTION value and a string of 26 zero bits. The
counter BLKCNT is set to 0.

Figure 5. The ƒ8 stream cipher mode


27

The ƒ8 algorithm makes use of a Key Modifier (KM) constant that is equal to the octet
0 x 55 = 01010101 repeated 16 times. First, a single operation of KASUMI is applied to
register A, using a modified version of the CK to compute the prewhiteninig value:

W = KASUMICK  KM(IV) (2.1)

which is stored in register A. Once the keystream generator has been initialized in this
manner, it is ready to be used to generate keystream bits. The plaintext/ciphertext to be
encrypted/decrypted consists of LENGTH bits, where LENGTH varies between 1 to
20,000 with granularity of 1 bit, while the keystream generator produces keystream bits
in multiples of 64 bits. Between 0 and 63 of the least significant bits are discarded from
the last block depending of the total number of bits required by LENGTH.

The number of required keystream bits is denoted by BLOCKS, whose value is
determined by the value of the LENGTH parameter as follows: the value of LENGTH is
divided by 64 and the result is rounded up to the nearest integer. The keystream blocks
are denoted as KSB1, KSB2 ,…,KSBblocks. Set KSBo=0 and let n be an integer with 1
≤ n ≤ BLOCKS, such that n = BLKCNT +1, and set:

KSBn = KASUMICK( W  (n-1) KSB n-1) (2.2)

Individual bits KS[0],KS[1],…….KS[LENGTH-1] of the keystream are extracted in
turn from KSB1 to KSBblocks, with the most significant bit extracted first, by applying
the following operation. For n = 1, …., BLOCKS and for each integer i, with 0 ≤ i ≤ 63,
set :

KS[((n-1) * 64) - i] = KSBn[i] (2.3)

Encryption/decryption operations are identical and are carried out by the bitwise XOR
of the input data using the generated keystream.

The main task with the design of ƒ8 was to make a steam cipher out of the block cipher
KASUMI and there are several standard ways for this task, namely, cipher feedback
(CFB), courter mode or output feedback (OFB) mode. For the function ƒ8 a
combination of () and counter mode used for protection for KASUMI against chosen
plaintext attack and protection against collision attacks (Günther, Howard & Niemi
2002).

To prevent chosen plaintext attacks the initialization vector (IV) is encrypted with
different key CK´ = CK+KM, where KM is a key modifier. This initial encryption is


28

also protection against collision attacks collision attacks as an attacker cannot freely
choose the value which is XOR-ed with the block counter (Günther, Howard & Niemi
2002).

2.4.3.2 Integrity Algorithm

As specified in (3GPP TS 33.102, 2006), some Radio Resource Control (RRC),
Mobility Management (MM) and Call Control (MM) signaling information elements
are considered sensitive and must be integrity protected. Cryptographic integrity
function (ƒ9) shall be applied on certain signaling information elements transmitted
between User Equipment (UE) and Serving Network (SN). For this task, UMTS
Integrity Algorithm (UIA) has been specified and should be implemented in UE and in
the RNC.

The UMTS Integrity Algorithm (UIA) shall be used with an Integrity Key (IK) to
compute a message authentication code for a given message. At least the following
signaling elements sent by the UE to the RNC should be protected:

 The UE capabilities, including authentication mechanism,
ciphering algorithm and message authentication function
capabilities.

 The security mode accept/reject message.

 The called party number in mobile originated call.

 Periodic message authentication messages.

 Various location updates, e.g. cell updates and URA updates

As for RNC concerned, at least following signaling message sent by RNC to the UE
should be protected:

 The security mode command, including whether ciphering is
enabled or not and the ciphering and integrity algorithm to be used.

 Periodic message authentication message


29

A cryptographic function (ƒ9) is used to protect data integrity and authenticate the data
origin of signaling data at the RRC layer. A cryptographic message authentication
algorithm generates a fixed length message authentication code (MAC) from a message
of arbitrary length, under the control of the secret parameter key and a set of
initialization values. The sender and receiver generate the MAC using same function.

3GPP integrity algorithm ƒ9 computes 32-bit Message Authentication Code (MAC) of a
given input message under integrity key (IK) and imposes no limitation on the input
message length. The approach adopted uses KASUMI as used by the confidentiality
algorithm ƒ8 in form of Cipher Block Chaining (CBC-MAC) mode.

The input parameter to the integrity algorithm are the Integrity Key (IK), a time- and
frame-dependent input (COUNT-I), a random value generated by the network side
(FRESH), the direction bit (DIRECTION) and the signaling message (MESSAGE). The
IK is cryptographic key that is newly generated at each authentication process. The
COUNT-I, FRESH, and DIRECTION parameters can be considered as a set of
initialization parameters that are renewed for each message to be authenticated. Based
on these inputs parameters the user computes message authentication code for data
integrity (MAC-I) using the UMTS Integrity Algorithm (UIA) ƒ9.

MAC-I is then appended to the message when sent over the radio access link. The
receiver computes XMAC-I on the message received in the same way as the sender
computed MAC-I on the message sent and verifies the data integrity of the message by
comparing it the received MAC-I. The input parameter COUNT protects against replay
during a connection. It is a value incremented at both sides of the radio access link
every 10ms layer 1 frame. Its initial value is sent by the user to the network at
connection set-up. The user stores the last used COUNT value from the previous
connection and increments it by one. In this way the user is assured that no COUNT
value is re-used (by the network) with the same integrity key.

The input parameter FRESH protects against replay of signaling message by the user.
At connection set-up the network generates a random value FRESH and sent it to the
user. The value FRESH is subsequently used by both the network and user throughout
the during of a single connection. This mechanism assures the network that the user is
not replaying and old MAC-Is.

Following figure is depicted derivation of MAC-I and/or XMAC-I on a signaling
message:


30

Figure 6. Derivation of MAC-I and/or XMAC-I o

The 3GPP standard ƒ9 function (Niemi & Nyberg 2006) makes use of two 64-bit
registers A and B. The initial value for both registers is set equal to 0: A=0 and B = 0.
The function also makes use of a constant value for a 128-bit Key Modifier (KM) that is
equal to 16 repetitions of the octet 0xAA = 10101010. The inputs of the ƒ9 function are
as described above paragraphs, the value of all inputs are concatenated and then a single
“1” bit is appended to this string, followed by between 0 and 63 “0” bits, so that the
total length of the resulting string is an integer multiple of 64 bits. This string is called
Padded String (PS) then:

PS = PSo||PS1||PS2|| . .|| PS BLOCKS -1 (2.4)

This Padded String (PS) is the data input to the Message Authentication Code (MAC)
algorithm. It would also be possible to interpret the first block PS0 of PS as the initial
value, since PSo = COUNT || FRESH, and this initial value would be different for each
message. For each integer n with 0 ≤ n ≤ BLOCKS-1 the following operations are
performed:

A = KASUMIIK(A PSn) (2.5

B = B  A (2.6)

Finally a further application of KASUMI is performed using a modified form of the
Integrity Key (IK), as follows:

B = KASUMIIK  KM(B) (2.7)


31

The output form KASUMI has 64 bits: MAC-I comprises the leftmost 32 bits of the
result and the rightmost 32 bits are discarded.

Following figure show ƒ9 integrity function

Figure 7. The ƒ9 integrity function

UIA negotiations, Integrity key lifetime, integrity protection procedures and local
authentications are described in (3GPP TS 33.102, 2006) document.

2.5 KASUMI

This subchapter gives detail description of the KASUMI cipher. Most text of this
subchapter is extracted from 3GPP specifications (3GPP TS 35.201, 2007) and (3GPP
TS 35.202) handling of KASUMI and (Niemi & Nyberg 2006).

The KASUMI block cipher is used by the confidentiality algorithm (ƒ8), and the
integrity algorithm (ƒ9) developed by Security Algorithms Group of Experts (SAGE)
Technical Forum (TF)  of 3GPP. As mentioned in subchapter 2.4, these algorithms were
designed for specific use in the context of UMTS. They were designed for specific
block cipher algorithm in mind, which was chosen as a starting point for what was
going to be the kernel algorithm, a modified version of MISTY (Mitsure Matsui, 1997).
In Parallel with the development of the ƒ8 and ƒ9 modes of operation, adjustments were
also made to block cipher algorithm MISTY1 (Mitsure Matsui, 1997) and the final
version of the block cipher algorithm is known as KASUMI—kasumi is Japanese for
“hazy, dim blurred” (Niemi & Nyberg 2006).


32

2.5.1 Description of KASUMI

KASUMI (3GPP TS 33.102, 2006), is a 64-bit block cipher that has a key size of
128bits. KASUMI was designed as modification of MISTY1 (Mitsuru Matsui, 1997),
optimized for implementation in hardware. Therefore, the most of the components of
KASUMI are similar to the respective components of MISTY1 (Mitsuru Matsui, 1997).

KASUMI is a Feistel (3GPP TS 35.202, 2007) cipher with eight rounds. It operates on a
64-bit data block and uses a 128-bit key. The round function (or ƒi() f-function) used  in
the ith  round of the Feistel cipher is denoted by ƒi. The f-function has a 32-bit input and
a 32-bit output. Each f-function of KASUMI is composed of two functions an FL-
function and an FO-function. An FO-function is defined as a network that makes use of
three applications of an FI-function.

An FI-function has a 16-bit input and a 16-bit output. Each FI-function comprises a
network that makes use of two applications of a function S9 and two applications of a
function S7. The functions S7 and S9 are also called “S-boxes of KASUMI”.  In this
manner KASUMI has similar three- layer nested structured of MISTY1 (Mitsuru
Matsui, 1997) . In this manner KASUMI decomposes into a number of subfuntions
(FL, FO and FI) that are used in conjunction with associated subkeys (KL, KO and KI).
The outmost Feistel network comprises eight rounds, which are called in the
specification outer rounds and numbered using index i, i = 1,2, ...., 8.

Figure 8. (a) KASUMI; (b) FOi function; (c) FIi,j function; and (d) FLi funtion


33

The FL-functions and FO-functions used at each round of the Feistel network are
numbered accordingly (i.e FLi, and FOi are functions used at the ith round of the outer
network). Function FLi is used in conjunction with subkey KLi, and function FOi is
used conjunction with two subkeys: KOi and Ki.

The network formed by the eight FO-functions are called the inner networks and each
one has three rounds indexed by j, j=1, 2, 3. Each round of an inner network makes use
of a KO-key and an FI-function, the latter is used in conjunction with a KI-key.
Consider the ith inner network FOi. The KO-key, FI-function and the KI-key used at the
jth round of FOi are denoted as KOi,j FIi,j KIi,j respectively. In addition, the KI-key KIi,j
splits into two halves KIi,j 1 and KIi,j 2 .

2.5.2 KASUMI Components and Encryption Function

In ƒ8 and ƒ9 mode operation, the kernel function is only computed in one direction
(Niemi & Nyberg, 2006). So even if the kernel function is a block cipher, the
decryption transformation is never used. The purpose of the 3GPP is only the
encryption function of KASUMI and that only has been defined. KASUMI operates on
a 64-bit input (INPUT) using a 128-key (K) to produce 64-bit output (OUTPUT) as
follows. INPUT is divided into two 32-bit strings L0 and R0, where:

INPUT = L0 || R0 (2.8)

Then for each integer i with 1 ≤ i  ≤ 8 the  operation on the ith round of KASUMI is
defined as :

Ri = Li-1,  Li = Ri-1  ƒi(Li-1, RKi ) (2.9)

This constitute the ith round function of KASAUMI, where Li-1 || Ri-1 is the input data
block, Li || Ri is the output data bock and RKi is the ith round key, defined as a triplet of
subkeys (KLi, KOi, KIi). Subkeys are derived from the key K using the key-scheduling
algorithm.

The output data block (OUTPUT) is defined as:

OUTPUT = L8 || R8 (2.10)

which is the data block offered at the end of the eighth round. In the specification of ƒ8
and ƒ9 this transformation is also denoted as

OUTPUT = KASUMIK [INPUT] (2.11)


34

Components of KASUMI and their functionalities have been defined in (3GPP TS
35.202, 2007) and it is highlighted here shortly.

2.5.2.1 ƒ Functions

According to (Niemi & Nyberg 2006) each f-function ƒi takes a 32-bit input I and
returns a 32-bit output O under the control of round key RKi , where the round key
comprises the triplet ((KLi, KOi, KIi). The f-function ƒi itself is constructed from two
subfunctions: an FL-function FLi and an FO-function FOi with associated subkeys KLi

(used with FLi) and subkeys KOi, and KIi (used with FOi).

The f-function ƒi has two different forms depending on whether it is an even round or an
odd round. For odd rounds I = 1, 3, 5 and 7.The f-function ƒi is defined as:

ƒi( I, RKi) = FOi (FLi (I, KLi,), KOi, KLi ) (2.12)

And for even rounds i = 2,4,6 and 8, The f-function ƒi is defined as:

ƒi(I, RKi) = FLi (FOi (I, KOi, KIi),  KLi (2.13)

i.e for odd rounds first the FL-function and then the FO-function is applied to the round
data, while for even rounds the order of the functions is changed.

2.5.2.2 Function FL

The input function of FLi comprises a 32-bit data input I and a 32-bit subkey KLi. The
subkey is split into two 16-bit subkeys, KLi,1 and KLi,2 , where :

KLi = KLi,1  || KLi,2 (2.14)

The input data I is split into two 16-bit halves, L and R, where I = L ||R. The FL-
funtions make use of the following simple operations:

ROL (D) the left circular rotations of a data block D by one bit

D1 D2 the bitwise OR operation of two data blocks D1 and D2

D1 D2 the bitwise AND operation of two data blocks D1 and D2

Then the 32-bit output value of the FL-function is defined as L || R || where:

L= L  ROL(R´ KLi,2 ) (2.15)

R= R  ROL( L KLi,1) (2.16)


35

2.5.2.3 Function FO

The input to function FOi comprises a 32-bit data input I and two sets of subkeys: a 48-
bit, KOi and 48-bit KIi.The 32-bit data input is split into two halves, L0 and R0.where I
= L0 || R0, while the 48-bit are subdivided into three 16-bit subkeys, where:

KOi = KOi,1 || KOi,2 || KOi,3 and KIi. = KIi,1|| KIi,2 || KIi,3 (2.17)

For each integer j with 1 ≤ j ≤ 3 the operation of the jth round of the function FOi is
defined as:

Rj = FIi, j(Lj-1 KOi j, KIi j )Rj-1 (2.18)

L j = Rj-1 (2.19)

Output from the FOi function is defined as the 32-bit data block L3 || R3.

2.5.2.4 Function FI

The FI-function is depicted in Figure 8. The thick and thin lines in this diagram are
used to emphasize the difference between the 9-bit and 7-bit data paths, respectively.

An FI-function FIi, j takes a 16-bit data input I and 16-bit subkey KIi j. The input I is
split into two unequal components, a 9-bit left half L0 and a 7-bit right half R0, where I
= L0 || R0, . Similarly, the key KIi, j is split into a 7-bit component KIi,j,1 and  a 9-bit
component KIi j,2, where and KIi,j = KIi, j,1|| KIi, j, 2 . Each FI-function FIi, j uses two S-
boxes: S7 which maps a 7-bit input to a 7-bit output and S9 which maps a 9-bit input to
a 9-bit output. The FI-funcions also uses two additional functions, which are designated
by ZE and TR. These simple functions are defined as following:

ZE (D) takes a 7-bit data string D and converts it to a 9-bit data string by appending two
zero bits to the most significant end of D. TR (D) takes a 9-bit data string D and
converts it to a 7-bit value by discarding the two most significant bits of D.

The function FIi, j is defined by the following series of operations:

L1 = R0 R1 = S9[L0] ZE (R0)

L2 = R1 KIi, j, 2 R2 = S7[L1] TR (R1) KIi,j,1

L3 = R2 R3 = S9[L2]  ZE (R2)

L4 = S7 [L3]  TR(R3) R4 = R3 (2.20)

The output of the FIi, function is the 16-bit data block L4 || R4,


36

2.5.2.5 S-boxes
The two S-boxes (S7 and S9) have been designed so that they may be easily
implemented in combinational logic or by a look-up table. Both forms are given for
each S-box. The input x comprises either seven or nine bits with a corresponding
number of bits in the output y. Therefore:

x = x8 || x7 || x6 || x5 || x4 || x3 || x2 || x1 || x0 || and (2.21)

y = y8 || y7 || y6 || y5 || y4 || y3 || y2 || y1 || y0 || (2.22)

Where the x8, y8 and x7, y7 bits only apply to S9 and the x0, y0 bits are the least
significant bits. Gate logical operations of S7 and S9 are described in detail (3GPP TS
35.202, 2007).

2.5.3 Key Schedule

KASUMI has a 128-bit key K. Each round of KASUMI uses 128 bits of key that are
derived from K. Before the round keys can be calculated two arrays of 16-bit values Kj

and K΄j( j = 1, …,8) are derived in following manner. The first array K1, K2, ….K8 is
derived by subdivision of K into eight 16-bit sub-blocks such that:

K = K1 || K2 || K3 || K4 || K5 || K6 || K7 || K8 (2.23)

The second array K`1, K`2, …,K`8 is derived from the first array by adding an array of
16-bit constants Cj as follows:

K`j  =K`j  Cj (2.24)Where the constants Cj are given in following table 1:
Table 1. Constants Cj

Then the subkeys (KL, KO and KI) are derived as defined by the following Table 2.

Table 2. Subkeys (KL, KO, and KI)


37

2.5.4 KASUMI Operations

Since KASUMI is a Feistel cipher with eight rounds, it operates in recursive structure
manner in three round functions, namely, FO(), FI() and FL().Both functions of FO()
and FI() has a subcomponents which are also have a Feistel-like structure. KASUMI
encrypt 64 bits blocks by using a Feistel network of eight rounds. In each round, it
alternate between the left and right halves of the state. In first level, one half enters the
out round function FO() for processing. In second level the FO() function´s output is
then XOR’ed to the opposite half of the state and then operation continues with next
round. As explained earlier, FO() function itself consists of three round of inner round
function FI() and it happens in similar Feister network recursive manner with half the
block size. In third level of recursion FI() itself consists of four rounds of nonlinear S-
Box transformations arranged in a Feiste structure. Following figure illustrate the
KASUMI operations detail structures.

Figure 9. KASUMI operations


38

3. THEORETICAL ANALYSIS OF CRYPTOGRAPHIC PERFORMANCE

In this chapter, thesis will highlight the benchmarking process of analyzing
cryptographic performance in order to understand the characteristics of the multi-core
processor being measured. This chapter also provides a short description of the analysis
of the potential variables that affect the cryptographic performance of a multi-core
processor when measurements are made on the driver level and explains how these
variables manifest themselves in the measured performances of the processor that is
used. This chapter also introduces a theoretical hypothesis about the implementations of
the KASUMI algorithm in both software and hardware. In this way, the predicted can
be compared against measured results.

Cryptographic algorithms are used in modern communication systems and have
become very important components for ensuring data security. Traditionally,
cryptographic algorithms were implemented either by software on a general-purpose
processor or incorporated dedicated security hardware in the form of a specialized
processor with cryptographic features. In this latter option, the cryptographic functions
took place in an off-chip accelerator approach, which means adding another chip or
daughter chip and writing low-level code for accessing data between the CPU and this
external dedicated chip.

Since cryptographic algorithms are computationally demanding, software-based
implementations are very slow or their power consumptions are high, and thus,
hardware-based implementations have been preferred, but even this hardware-based
cryptographic implementations option also had its drawbacks; for instance, in order to
have access to CPU functionalities, low-level code has to be written for transferring the
data forth and back between the CPUs, system bus, memory, and other subsystems,
which could have a huge effect on the overall performance of the systems.

The present embedded system development is moving on to the next step of
cryptographic algorithm implementations. Hardware-based implementations are moved
on-chip or System-on-Chip (SoC), which means that mainstream CPUs incorporate
cryptographic function hardware acceleration. These hardware accelerations were
developed to offload the heavy processing load from the core CPU.

These new SoC systems consist of an embedded CPU and an on-chip bus, a DMA
controller, memories, and other subsystem modules, such as dedicated crypto
coprocessors used for the accelerations of the cryptographic algorithms. The
performance measurements are made for cryptographic algorithms on different levels,
on the driver level, the application level, or the protocol stack level. This thesis makes


39

the performance measurements on the driver level. According to (Freescale
Semiconductor 2008), there are variables which may influence the results of the
measurements.

As the demand for faster processing has increased over the years, the idea of multi-core
has been born; in this two or more cores are combined into a single Integrated Circuit
(IC) package and in this way, the multi-core technology takes advantage of parallel
processing rather than increasing the frequency of the clock rate to achieve higher
performance.

Verifying the cryptographic performance claims of multi-core vendors can be a difficult
task since there can be a difference between the theoretical cryptographic performance
of the processor and its performance in a given application. Some of the results of the
performance are inherited from its architecture, while others are due to the performance
of the application itself, the protocol stacks of the software, or the underlying OS.

According to (Waters & Stammberg 2009), there are many cryptographic acceleration
implementations but there are two common cryptographic acceleration implementation
architectures, namely flow-through and look-aside, which most accelerator
implementations are based on. The following variables that have been identified in
above mentioned reference have an influence on the performance measurements of the
cryptographic implementations and would explain shortly as following.

3.1 Influential Variables for Cryptographic Performance Measurements

In this subchapter, 4 variables that could affect the performance measurements of
different cryptographic-based implementations are discussed. Although a detailed
description of how each parameter affects the results lies outside the scope of this thesis
work, they are defined here to raise awareness of their presence in the measurements.

3.1.1 Application or Protocol Stack Software Overheads

Application and protocol stack overheads are the instructions a processor must execute
in order to determine what sort of crypto processing is required. Application/protocol
stack overheads are typically the greatest source of performance degradation and often
the most shocking to users as they transition from a non-secure version of a given
protocol to the secured versions.

The reason for this degradation is that security protocols are stateful, so as the
cryptographic keys that are used to encrypt and/or decrypt or authenticate the data have
lifetimes, which can be measured in terms of the number of bytes encrypted,the number


40

of seconds since the key was first used, the number of seconds since the key was last
used, or all of the above. Security protocols also add a header field to packets, allowing
the packet to be forwarded to the other end of the security tunnel without revealing
information about the parties to the communication.

3.1.2 Software Overheads

Crypto API, application stacks, and driver overhead affect the interface between crypto
hardware and application software stacks.  Efficient interaction between OS and crypto
accelerators is crucial in order to minimize their effects. With small packet sizes,
software overheads are more evident with a large number of packets processed, but as
the packet size increases, all software overheads become less important and the row
performance of the look-aside accelerator becomes more critical.

3.1.3 Bus Bandwidth

The amount of memory bandwidth consumed during look-aside crypto processing is
significantly greater than what is consumed during a plaintext operation. Whether the
data to be cryptographically processed are a large file or a small packet, the data must
originally be moved from the network or peripheral interface to the system memory,
moved from the system memory to the accelerator and back, and then moved from the
system memory to a network or peripheral interfaces.

In addition to this data movement, the security context, such as the crypto key, must
also be fetched. There is also additional memory bandwidth consumption associated
with additional look-ups, instruction fetches, or architecturally specific reads or writes.
Crypto performance can be constrained if there is not enough bus bandwidth to keep the
look-aside accelerator fully utilized. The bus bottleneck could be between the SOC and
its system memory.

3.1.4 Data Size

Is the test encrypting of a small or large chunk of data matter? The smaller the size of
the chunk of data, the more the results will be influenced by the memory latency of the
accelerator, the descriptor size, and other "non-data" context it must fetch to perform the
operation, and the accuracy of the timer.

3.1.5 Iteration
A good way to include accelerator DMA overheads and memory latency in a small
amount of data while reducing timer resolution as a variable is to construct the test in
such a way that the timer starts before iteration 1 and stops after the nth iteration, where
n is a fairly large number.


41

4. EXPERIMENTAL SET-UP AND TOOLS

This chapter will describe both the OCTEON HW and SW architectures, the setting up
of the test environment, and the tools used; the software under test is also described
briefly in this chapter.  The experimental tests that are studied in this thesis were
performed at the Nokia Siemens Networks laboratory in Espoo, in Finland.

KASUMI cipher implementations have been tested in several other studies
(Jääskeläinen 2003; Tomás Balderas-Contretras 2004; Tomás, Balderas-Contretras, and
René, Cumplido 2005 and H.Kim, Choi, M.Kim and H.Ryu 2002). But none of those
involved Cavium OCTEON Multicore processors. The OCTEON Multicore processors
family contains a wide range of different product options in which some of the chips
will contain a hardware accelerator for security algorithms like KASUMI.  According to
(Cavium Networks, 2009), a Cavium HW-based security accelerator should provide a
substantial performance boost compared to a purely SW-based KASUMI
implementation. The intention of the experimental KASUMI tests in this thesis is to
verify those claims by comparing SW- and HW-based KASUMI implementations to
each other.

4.1 OCTEON Hardware Architecture

This section describes the hardware architecture that was used for running the
experimental tests. The first subchapter first gives a high-level overview of the
OCTEON processor, while the next subchapter describes its architecture in greater
detail. This thesis does not intend to describe each and every fact about the OCTEON
but just cover it on a high level and touch only on those issues that are germane to this
study. For in-depth descriptions of the OCTEON hardware, refer to the OCTEON
User’s Manual (Cavium Networks, 2009f).

Many embedded applications require a higher performance response time for real-time
systems. This demand makes it difficult for a single CPU to satisfy this ever-increasing
hunger for high-performance applications, so the use of multiple CPUs has become very
popular among embedded system manufacturers.

A multi-CPU system is not the same as a multi-core system; the multi-core processor is
a processing system that comprises two or more cores (CPUs) which are integrated in a
single chip. Historically, more processing power was achieved by increasing the speed
of the processor, but current multi-core processors could deliver better performance by
forming multiple logical units into a core one single chip.


42

4.1.1 Cavium Networks MIPS (cnMIPS)

MIPS (Microprocessor without Inter locked Pipeline Stages) is a RISC (Reduced
Instruction Set Computer) processor architecture developed by MIPS Computer System
Inc. The early MIPS architecture was 32-bit implementation and the latest MIPS64 is
based on 64-bit wide registers and data path implementation.

The Cavium Networks OCTEON product family architecture is based on officially
licensed MIPS Inc. (MIPS Technology 2001:7-15) technology and uses the multi-core
MIPS64 v2 processor instruction set supporting both 32-bit and 64-bit processing.
Cavium Networks has added some custom instructions to accelerate common
networking operations, such as a bit test branch instructions, security and packet
processing instructions or bit-field insert/extract forming the cnMIPS core.

These processors are software-compatible processors, with one to sixteen cnMIPS cores
on a single chip and OCTEON II Internet application Processor (IAP) scale from 1 to 32
cores according to (Cavium Networks, 2009).The cnMIPS cores is Cavium Networks´s
custom implementation of the MIPS64 release 2 instruction set.

4.1.1.1 OCTEON Product Overview

The OCTEON product family includes OCTEON, OCTEON Plus and OCTEON II
multi-core MIPS64 processors. All OCTEON processors are software compatible and
supported by industry-standard software tool-chains and operating systems. All
OCTEON products share the same architecture and more detail description of OCTEON
products are shown in Cavium Networks Product Table (Cavium Networks, 2009c).

In this thesis work, OCTEON Plus CN58XX   product family is used for running on the
experimental tests that are used for performance tests on KASUMI cipher
implementations. This processor has a sixteen cnMIPS cores at up to 750MHz speed on
a single chip. The CN58XX processor is provides security hardware acceleration which
implements the latest set of algorithms including KASUMI cipher algorithms.

4.1.1.2 OCTEON Architecture Overview

According to (Cavium Networks, 2009d) All OCTEON products share the same
OCTEON architecture. Individual OCTEON products vary and scale based on the
number of integrated cnMIPS cores, the frequency that the cores and the internal
interconnects run at, the type and the number of I/O interfaces integrated, and the type
and the number of application acceleration hardware units integrated.


43

According to (Cavium Networks, 2009e) key OCTEON features which most OCTEON
models provide are:

 Up to 32 cores, up to 1.5 GHz: The OCTEON family of multicore processors
supports up to 32 cnMIPS with a speed from 300MHz to 1.5GHz.

 Hardware Acceleration Units: Multiple hardware acceleration units are
integrated into each processor. These acceleration units include; Packet-
managment accelerators, Security acceleratiors, Application accelerators and
specialized accelerators.

 Dedicated DMA Engines: Dedicated DMA Engines are provided for each
hardware unit which accesses memory.

 High-Speed Interconnects: The hardware units and the cores ae connected by
high-speed interconnects. These interconnects run at the same frequency as the
cores. Each interconnect is a collection of multiple buses with extensive
pipelining and sophisticated hardware arbitration logic.

 Industry-Standard Tools chains and Operating Systems: Industry-standard tool
chains (GCC, GDB) and operating systems (including SMP Linux) has been
modified to utilize the OCTEON processor´s multiple cores, hardware
accelerators and Cavium Networks –specific instructions.

 Packet Management Acceleration: Packet receive/transmit is automated by
software-configurable packet management accelerators.

 TCP/UDP acceleration

 Per-Core Security Hardware Acceleration. Common security algorithms
including KASUMI are accelerated by optional per-core Security Engines.

The architectural definition of the CN58XX product line provides a good perspective
for illustrating the OCTEON architecture. Following figure is OCTEON Plus CN58XX
block diagram.


44

Figure 10. Cavium Networks MIPS Architecture (cnMIPS)

Although every units of OCTEON hardware is essential to work of this thesis work, in
this subchapter, will be highlighted only some of the hardware –acceleration units that
their functionalities are very essential to this thesis work.

As described earlier, Hardware-Acceleration units consists of packet-management
accelerators, security accelerators, application accelerators and specialized accelerators.
Packet management accelerators and security accelerators has huge effect on work and
the experimental test results analysis of this thesis work, so their functionalities would
be described them in detail following paragraphs.

Packet Management Accelerators in its part consists of Schedule/Synchronization and
Order (SSO) unit that manages packet scheduling and ordering, Free Pool Allocator
(FPA) unit that handles pools of free buffers including Packet Data buffers, and Input
Packet Data (IPD) unit which manages packet input. Packet Input Processor (PIP) unit
which works closely with IPD a handles with input packets, Packet Output Unit (PKO)
manages packet output.

PIP (Packet Input Packet) and IPD (Input Packet Data) units work together to receive
the packet and perform early processing on it. Each packet is represented as “work” for
the cores to do and is represented by a Work Queue Entry (WQE) data structure which


45

contains the tag type, tag value QoS value, group and pointer to Packet Data Buffer.
These units are responsible for many layer-2 through layer-7 processing requirements
such as exception checking and TCP/UDP checksum verification.

Schedule/Synchronization and Oder (SSO) is responsible for scheduling the work to
cores and also maintaining the packet ingress order. Cores may request work from SSO
either asynchronously (the core continue to do other work while the instruction
completes) or synchronously (the core waits for instruction to complete). In another
word, SSO unit provides Scheduling, Synchronization and Ordering services.

Packet Output (PKO) in its turn is responsible for packet transmission. When the packet
processing is complete, the core notifies the PKO that the packet is ready for
transmission and PKO manages the transmission priority.

Security accelerators consists Random Number Generator (RNC) unit, Key Unit which
provides and manages secure on chip memory which can be used to store a hardware
key and can be reset using an external pin. Per-Core Security or Security Engine, this is
special coprocessor used for accelerating security applications and hash generation and
there is one Security-Coprocessor per core.

Once the core issue an instruction to the coprocessor, the core can continue to do other
work while the coprocessor complete the instruction, or the core can wait for the
coprocessor to complete the task. Following figure illustrate the communication
between core and its co-processor.

Figure 11. Communication between core and its co-processor

More detail descriptions of these units and how they functions is handled in (Cavium
Networks, 2009e).


46

4.1.1.3 Other Units and Functionalities

According to (Cavium Networks 2009c, 2009d) there are also other import units beside
those been described in above paragraphs, such as On-Chip Interconnects which joins
the different integrated units together. The OCTEON on-chip interconnect architecture
involves a network of Coherent Memory Buses (CMB), and another network of
interconnects connecting the I/O sub-system (IOB) together. Both of these networks are
implemented based on multiple split-transaction and highly pipelined buses. These two
networks, CMB, and IOB, are connected together through a high performance I/O
bridge.

The CMB network connects all the cnMIPS cores, L2 cache controller, memory
controller, and I/O bridge together while the I/O sub-system (IOB) includes the
controllers for the various I/O interfaces, all the application specific offload and
acceleration engines, and the I/O bridge. The OCTEON architecture offers as well a
cache hierarchy which includes split instruction and data L1 caches on each cnMIPS
core, and a large L2 cache, up to 2MB, that is shared by all cnMIPS cores and the I/O
subsystem.

The OCTEON architecture includes also, a large variety of application specific offload
and acceleration engines. Some of these acceleration features are implemented as
individual functional units that are shared among all the cnMIPS cores, while some of
these acceleration features are integrated into each of the cnMIPS cores. The decision of
stand-alone functional units against integration into cnMIPS core is optimized for each
of the application specific acceleration features in the OCTEON architecture.

4.2. OCTEON Software

This subchapter highlights the OCTEON processor´s software overview specially
software architecture and other software utilities that is used to access the hardware
services. The OCTEON Software is a Software Development Kit parts that is intended
to provide the tool, drivers Application Protocol Interface (APIs), Libraries and other
utilities that is needed to access the underlying HW chipset solution.

It also plays crucial role for providing services to higher level applications that run on
the OCTEON processors. Following figure shows high level overview of software suits
that this test system uses.


47

Figure 12. High level test system architecture

The KASUMI Crypto software that is to be tested runs on OCTEON Simple Executive
to leverage OCTEON per-core security acceleration to achieve maximum performance
possible.

4.2.1 Simple Executive API

In this thesis work KASUMI test are performed by using Octeon Simple Executive
architecture instead of Linux based application.  The reason for this choice is the
services that normal Operating Systems  provide to applications such as interrupts,
context switches, kernel address space copy in/copy out with system calls causes some
overhead to pay. The Simple Executive provides a Hardware Abstraction Layer (HAL)
in the form of an Application Programming Interface (API) to the underlying hardware
units. This API is a very thin layer of simple functions which access the Central Process
Unit (CPU) registers. It also provides some convenience routines for block
initialization. The API can be used from both kernel and user mode.

In another word, Cavium OCTEON Simple Executive (SE) means that the executable
SW will run directly on top of OCTEON HW without Operating System (OS) e.g
Linux.

Following figure is show the role of the Simple Executive API:


48

Figure 12. Simple Executive API

The Simple Executive API is used to access the hardware units:

 Basic units; FPA, IPD, PIP, SSO and PKO

 Intermediate units: FAU and TIM

 Advanced units: LLM, ZIP, RNG, DFA, KEY, CIU, etc.

Simple Executive API also includes functions and macros for:

 System memory allocation (bootmem)

 Synchronization between cores

 spinlocks

 Reader-Writer locks

 Atomic set, add, compare and store operations

 Barrier functions

Simple Executive functions and macros may be used either to create stand-alone Simple
Executive application, or may be called from drivers or applications running on an
operating system kernel such as Linux. For instance, after the Linux kernel is booted, a


49

Cavium Networks Ethernet driver may be started. This driver uses the Simple Executive
API to configure the OCTEON hardware. Simple Executive User-Mode applications
may also be started from Linux. Both 32-bit and 64-bit modes are supported, although
64-bit mode should be used whenever possible due to better overall performance.

Following figure shows using Simple Executive API from different Runtime
Environment:

Figure 13. Simple Executive API runtime environments

4.2.2 Runtime Environment Choices for cnMIPS Cores

There are several choices for runtime environment. The tree supplied by Cavium
Networks is Simple Executive stand-alone mode, Linux and the hardware simulator:

1. When running Simple Executive on multiple cores, the same FLF file is usually
run on all of the cores. These cores are all started from one load command.

2. When running Linux on multiple cores (SMP), there is one kernel running, not
one kernel per core. Linux applications are schedule to run on different cores.

3. The third runtime environment supplied by Cavium Networks is the Hardware
Simulator. The simulator is useful when actual hardware is not available and it
also very useful for performance tuning. Performance tuning is most easily done
using the tool Viewzilla. This tool analyzes the output of the simulator, so
making sure the code will run on the simulator as well as on actual hardware is
recommended for performance critical application.


50

In addition to the three runtime environments supplied by Cavium Networks that is
described above, several open-source and proprietary operating systems are available. In
this thesis work, the Simple Executive running as Linux Stand-alone (SE-S) application
is used, so thesis considerate to explain in detail how it works.

4.2.2.1 Simple Executive

Simple Executive provides an API to the hardware units. Simple Executive may run
Stand-alone (SE-S) application, or as user-mode (SE-U) application on an operating
system such as Linux, in another word, Linux kernel and applications may both make
Simple Executive API calls. When Simple Executive calls are made from Linux user
space, the process is referred to as a Simple Executive User-Mode application (SE-UM).

For instance, when run as a user-mode application, different application start code (
main() ) is called and there other minor porting items to consider. The following figure
shows a representation of a core running Simple Executive in Stand-alone (SE-S) mode.
One core runs a Simple Executive Stand Alone (SE-S) Application.

Figure 14. Simple Executive Stand-alone Application (SE-S)

Simple Executive calls may be made from kernel mode. For example, the Cavium
Networks Ethernet driver, which runs on Linux, makes Simple Executive calls:

Figure 15. Simple Executive calls from kernel Mode

The following figure shows a representation of a core running Simple Executive as a
User-mode application.

Figure 16. Simple Executive User-Mode (SE-UM) application

SE-S

Linux

Driver

SE-UM


51

To use all available memory, SE-UM applications should be compiled for 64-bit mode.
32-bit mode is sometimes used, but can only access a limited amount of physical
memory. SE-S is very fast compared to SE-UM. There no context switches, and all the
memory is mapped for fast access. To get maximum performance from OCTEON:
Whenever possible, design the application to use a 64-bit Simple Executive application.

4.2.2.2 Application Configurations

There are number of possible way to do the multi-core configuration, but one of the
easiest configurations to implement is Simple Executive Stand-alone application
configurations. Following figure shows example of 8-core simple system running a
Simple Executive Stand-alone application uses 1 ELF file, 8 instance of SE-S.

Figure 17. 8-core simple system running a SE-S application

4.2.3 Other Software Issue

In this subchapter, thesis will go through some other steps and issue that are related to
complying and running the software under the test.

4.2.3.1 Application Entry Point and Start up Code

An application may be compiled as either SE-S or SE-UM application without
modification. The code executed when application is started is not the same: when SE-S
is the built target. The makefile $OCTEON_ROOT/executive/cvmx.mk is responsible for
making this change.

4.2.3.2 Booting SE-S Application

When the application is executed and SE-S object is compiled, then to boot Simple
Executive application bootoct bootloader command is used. This command is used to
boot application on one or more cores. Following figure shows cores running Simple
Executive Stand-alone in a Load Set.


52

Figure 18. Simple Executive Stand-alone Mode (SE-S) application

Following figure is shown in detail how booting SE-S applications with Bootoct
command works:

Figure 19. Downloading and Booting SE-S Applications

4.2.4 Software Architecture

In most software design paradigm that intended to run embedded systems follows the
separation of control and data planes in order to simplify the software architecture
design. In cnMIPS software helps too to separate two basic types of processing: normal
packet processing (fast path), and exception processing (slow path). Depending on the


53

number of cores available, different configuration of cores devoted to either fast path or
slow path processing can be used to optimize throughput.

4.2.4.1 Control-plane versus Data-plane applications

Typical telecom networking SE-S application functions are dived into two categories:
control-plane (slow path), and data-path (fast path). In OCTEON processor the control-
plane usually handles exceptions while the data-plane handles normal packet
processing. Software application used in this thesis work falls into fast path category.

According to (MIPS Technology 2001) SE-S applications may be used for both control-
plane and data-plane. SE-S applications provide the lowest overhead and highest
potential for scaling. Next best solution (a typical solution) is SE-UM for control plane
and SE-S for data-plane.

Following figure illustrate SE-S application divisions and only one core does the
initialization routine.

Figure 20. SE-S used both