# ARM assembly language

Assembly language programming on ARM microprocessors with examples of working code.

## Monday, 12 October 2020

## Friday, 7 March 2014

### Triangle numbers

The sequence of triangle numbers is generated by adding the natural numbers. So the 7^{th} triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. The first ten terms would be:

1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...

Let us list the factors of the first seven triangle numbers:

1: 13: 1,36: 1,2,3,610: 1,2,5,1015: 1,3,5,1521: 1,3,7,2128: 1,2,4,7,14,28

We can see that 28 is the first triangle number to have over five divisors.

What is the value of the first triangle number to have over five hundred divisors?

.syntax unified .align 2 .section .rodata resstring: .asciz "%d\n" .equ last, 250 @ we count up to the square root - so half of the divisors will @ be necessarily less than the square root and half above. icount .req r4 @ it is possible to write FORTRAN in any language :-) jcount .req r5 tmp .req r6 num .req r7 .text .align 2 .global get_num_divisors .type get_num_divisors, %function @ we use jcount to hold current divisor and icount to hold the number of divisors seen @ num is passed as r0 and the number of divisors is returned. get_num_divisors: stmfd sp!, {r4-r7, lr} mov num, r0 mov icount, 0 mov jcount, 1 jstart: mov r0, num mov r1, jcount bl divide teq r1, 0 beq factor b nextj factor: add icount, icount, 1 nextj: add jcount, jcount, 1 mul tmp, jcount, jcount cmp num, tmp bgt jstart mov r0, icount ldmfd sp!, {r4-r7, pc} .align 2 .global main .type main, %function main: stmfd sp!, {r4-r7, lr} mov num, 0 mov icount, 0 mov jcount, 1 loop: cmp num, #last bge printme add icount, icount, jcount mov r0, icount bl get_num_divisors mov num, r0 add jcount, jcount, 1 b loop printme: mov r1, icount ldr r0, =resstring @ store address of start of string to r0 bl printf mov r0, 0 ldmfd sp!, {r4-r7, pc} mov r7, 1 @ set r7 to 1 - the syscall for exit swi 0 @ then invoke the syscall from linux

.syntax unified # divide takes value in r0, divisor in r1 and returns dividend in r0 and modulus in r1 .global divide .type divide, %function divide: stmfd sp!, {lr} # see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf bl __aeabi_uidivmod ldmfd sp!, {pc}

divide.s

### Back on track...

## Thursday, 22 August 2013

### Greatest product of adjacent numbers in a grid

A note to any readers : This will be the last post for a month.

In the 2020 grid below, four numbers along a diagonal line have been marked in red.

08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08

49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00

81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65

52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91

22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80

24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50

32 98 81 28 64 23 67 10 **26** 38 40 67 59 54 70 66 18 38 64 70

67 26 20 68 02 62 12 20 95 **63** 94 39 63 08 40 91 66 49 94 21

24 55 58 05 66 73 99 26 97 17 **78** 78 96 83 14 88 34 89 63 72

21 36 23 09 75 00 76 44 20 45 35 **14** 00 61 33 97 34 31 33 95

78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92

16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57

86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58

19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40

04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66

88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69

04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36

20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16

20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54

01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48

The product of these numbers is 26 63 78 14 = 1788696.

What is the greatest product of four adjacent numbers in the same direction (up, down, left, right, or diagonally) in the 2020 grid?

The solution to this problem is quite interesting because it demonstrates the use of a macro. A macro is a construct available within *as* which permits the same sort of abstraction as a procedure|subroutine|function would in a high-level language.

Inside the macro, the parameter *a* is accessed via *\a* and the construct *\@* expands out to a unique value each time the macro is invoked; this allows the use in labels which are distinct for each macro invocation.

.syntax unified .align 2 .section .rodata buffer: .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 8, 2, 22, 97, 38, 15, 0, 40, 0, 75, 4, 5, 7, 78, 52, 12, 50, 77, 91, 8, 0, 0, 0 .byte 0, 0, 0, 49, 49, 99, 40, 17, 81, 18, 57, 60, 87, 17, 40, 98, 43, 69, 48, 4, 56, 62, 0, 0, 0, 0 .byte 0, 0, 0, 81, 49, 31, 73, 55, 79, 14, 29, 93, 71, 40, 67, 53, 88, 30, 3, 49, 13, 36, 65, 0, 0, 0 .byte 0, 0, 0, 52, 70, 95, 23, 4, 60, 11, 42, 69, 24, 68, 56, 1, 32, 56, 71, 37, 2, 36, 91, 0, 0, 0 .byte 0, 0, 0, 22, 31, 16, 71, 51, 67, 63, 89, 41, 92, 36, 54, 22, 40, 40, 28, 66, 33, 13, 80, 0, 0, 0 .byte 0, 0, 0, 24, 47, 32, 60, 99, 3, 45, 2, 44, 75, 33, 53, 78, 36, 84, 20, 35, 17, 12, 50, 0, 0, 0 .byte 0, 0, 0, 32, 98, 81, 28, 64, 23, 67, 10, 26, 38, 40, 67, 59, 54, 70, 66, 18, 38, 64, 70, 0, 0, 0 .byte 0, 0, 0, 67, 26, 20, 68, 2, 62, 12, 20, 95, 63, 94, 39, 63, 8, 40, 91, 66, 49, 94, 21, 0, 0, 0 .byte 0, 0, 0, 24, 55, 58, 5, 66, 73, 99, 26, 97, 17, 78, 78, 96, 83, 14, 88, 34, 89, 63, 72, 0, 0, 0 .byte 0, 0, 0, 21, 36, 23, 9, 75, 0, 76, 44, 20, 45, 35, 14, 0, 61, 33, 97, 34, 31, 33, 95, 0, 0, 0 .byte 0, 0, 0, 78, 17, 53, 28, 22, 75, 31, 67, 15, 94, 3, 80, 4, 62, 16, 14, 9, 53, 56, 92, 0, 0, 0 .byte 0, 0, 0, 16, 39, 5, 42, 96, 35, 31, 47, 55, 58, 88, 24, 0, 17, 54, 24, 36, 29, 85, 57, 0, 0, 0 .byte 0, 0, 0, 86, 56, 0, 48, 35, 71, 89, 7, 5, 44, 44, 37, 44, 60, 21, 58, 51, 54, 17, 58, 0, 0, 0 .byte 0, 0, 0, 19, 80, 81, 68, 5, 94, 47, 69, 28, 73, 92, 13, 86, 52, 17, 77, 4, 89, 55, 40, 0, 0, 0 .byte 0, 0, 0, 4, 52, 8, 83, 97, 35, 99, 16, 7, 97, 57, 32, 16, 26, 26, 79, 33, 27, 98, 66, 0, 0, 0 .byte 0, 0, 0, 88, 36, 68, 87, 57, 62, 20, 72, 3, 46, 33, 67, 46, 55, 12, 32, 63, 93, 53, 69, 0, 0, 0 .byte 0, 0, 0, 4, 42, 16, 73, 38, 25, 39, 11, 24, 94, 72, 18, 8, 46, 29, 32, 40, 62, 76, 36, 0, 0, 0 .byte 0, 0, 0, 20, 69, 36, 41, 72, 30, 23, 88, 34, 62, 99, 69, 82, 67, 59, 85, 74, 4, 36, 16, 0, 0, 0 .byte 0, 0, 0, 20, 73, 35, 29, 78, 31, 90, 1, 74, 31, 49, 71, 48, 86, 81, 16, 23, 57, 5, 54, 0, 0, 0 .byte 0, 0, 0, 1, 70, 54, 71, 83, 51, 54, 69, 16, 92, 33, 48, 61, 43, 52, 1, 89, 19, 67, 48, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 resstring: .asciz "%d\n" .equ ocolumns, 26 .equ icolumns, 20 .equ colstart, 23 .equ border, 3 .equ offset, 2 .equ points, 4 .equ north, -ocolumns .equ north_east, -ocolumns+1 .equ east, 1 .equ south_east, ocolumns+1 .equ south, ocolumns .equ south_west, ocolumns - 1 .equ west, -1 .equ north_west, -ocolumns-1 .equ point_set, 1 tmp .req r2 pointv .req r3 data_ptr .req r4 icount .req r5 jcount .req r6 kcount .req r7 maxv .req r8 dpstore .req r9 .macro direction a mov data_ptr, dpstore mov kcount, points mov pointv, point_set kloop\@: ldrb tmp, [data_ptr], \a mul pointv, pointv, tmp subs kcount, kcount, 1 bne kloop\@ cmp maxv, pointv movlt maxv, pointv @ set maxv to max of maxv and pointv .endm .text .align 2 .global main .type main, %function main: stmfd sp!, {r4-r9, lr} mov icount, icolumns iloop: mov jcount, icolumns jloop: mov tmp, ocolumns add data_ptr, icount, offset mul tmp, tmp, data_ptr add tmp, tmp, jcount add tmp, tmp, offset ldr data_ptr, =buffer add dpstore, data_ptr, tmp @ data[i][j] direction north direction north_east direction east direction south_east direction south direction south_west direction west direction north_west subs jcount, jcount, 1 bne jloop subs icount, icount, 1 bne iloop printme: mov r1, maxv ldr r0, =resstring @ store address of start of string to r0 bl printf mov r0, 0 ldmfd sp!, {r4-r9, pc} mov r7, 1 @ set r7 to 1 - the syscall for exit swi 0 @ then invoke the syscall from linux

## Thursday, 15 August 2013

### Summing prime numbers

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17.

Find the sum of all the primes below two million.

.syntax unified .equ word,4 .equ limit,2000000 .equ numprimes4,595732 @ sufficient to hold all the bytes in all the elements of the vector we use to hold the primes sum_hi .req r4 sum_lo .req r5 numprimes .req r6 primes_ptr .req r7 number .req r8 limit .req r9 .section .bss .lcomm primes_vector,numprimes4 @ allocate space for vector of primes .section .rodata llustring: .asciz "%llu\n" .text .align 8 .global main .type main, %function main: stmfd sp!, {r4-r9, fp, lr} ldr primes_ptr, =primes_vector mov numprimes, 1 mov number, 2 str number, [primes_ptr] @ store 2 to primes vector ldr limit, =limit mov sum_hi, 0 @ initialise sum mov sum_lo, 2 mov number, 3 @ start with 3 as first odd prime loop: cmp number, limit bge printme @ leave if we have exceeded limit mov r0, number ldr r1, =primes_vector mov r2, numprimes bl prime_vector teq r0, 1 @ check for primality of number bne nexti str number, [primes_ptr, numprimes, lsl 2] @ if prime, add to vector add numprimes, numprimes, 1 @ increment prime count adds sum_lo, sum_lo, number @ and add to sum adc sum_hi, sum_hi, 0 nexti: add number, number, 2 @ increment number by 2 b loop printme: mov r2, sum_lo mov r3, sum_hi ldr r0, =llustring @ store address of start of string to r0 bl printf mov r0, 0 ldmfd sp!, {r4-r9, fp, pc} mov r7, 1 @ set r7 to 1 - the syscall for exit swi 0 @ then invoke the syscall from linux

.syntax unified .equ word, 4 # this subroutine returns 1 if the passed number is prime; 0 if not # # inputs # r0 - integer to test # r1 - pointer to vector of prime integers smaller than r0 # r2 - length of vector passed in r1 # # outputs # r0 - prime boolean number .req r4 vptr .req r5 tmp .req r6 squared .req r7 vsize .req r8 .global prime_vector .type prime_vector, %function .text prime_vector: stmfd sp!, {r4-r8, lr} mov number, r0 mov vptr, r1 mov vsize, r2 nexti: ldr tmp, [vptr], word @ get vector element to tmp mul squared, tmp, tmp @ and square it cmp squared, number @ compare the square to the input number movgt r0, 1 @ store 1 to r0 if square is greater (i.e. number is primes) bgt last @ and jump out mov r0, number mov r1, tmp bl divide @ divide number by tmp teq r1, 0 @ and check if it is a divisor moveq r0, 0 @ if so, not prime, so set r0 to 0 beq last @ and leave subs vsize, vsize, 1 @ decrement prime counter bgt nexti @ and try next vector element mov r0, 1 last: ldmfd sp!, {r4-r8, pc}

prime_vector.s

.syntax unified # divide takes value in r0, divisor in r1 and returns dividend in r0 and modulus in r1 .global divide .type divide, %function divide: stmfd sp!, {lr} # see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf bl __aeabi_uidivmod ldmfd sp!, {pc}

divide.s

## Sunday, 28 July 2013

### Generating processor-specific code from a source file

There are many ARM cores as described at Wikipedia. The processors get more complex as they evolve and occasionally new instructions are introduced. One instance is the *rev* instruction which will reverse the bytes in a word e.g. for converting big-endian numbers to little-endian and vice versa.

In my Sheeva Plug there is a version 5TE processor (which does not support this instruction); my Raspberry Pi has a version 6 processor (which does).

I would like to keep my code in one source file, and dynamically assemble the optimum code for the processor upon which it was built.

We can use the macro assembler of the GNU assembler to achieve this.

.if ARCH >= 6 rev r0, r0 .else and r2, r0, #0xff000000 @ load the top 2 bytes to r2 and r3, r0, #0x00ff0000 @ load the next 2 bytes to r3 and r4, r0, #0x0000ff00 @ load the next 2 bytes to r4 and r5, r0, #0x000000ff @ load the bottom 2 bytes to r5 mov r0, r2, lsr #24 @ shift r2 bytes to bottom and store to r0 orr r0, r3, lsr #8 @ or the remaining shifted data orr r0, r4, lsl #8 orr r0, r5, lsl #24 .endif

We can interrogate the machine at build time by looking inside */proc/cpuinfo* via

*cat /proc/cpuinfo | grep ^Processor* which gives :

- On A SheevaPlug :
*Processor : Feroceon 88FR131 rev 1 (v5l)* - On A Raspberry Pi :
*Processor : ARMv6-compatible processor rev 7 (v6l)*

And then we pull the appropriate number within the Makefile and then we can build via *make endian* using the following *Makefile*

AS := /usr/bin/as LD := /usr/bin/ld ASOPTS := -gstabs ARCH = $(shell sed -ne 's/^Processor\s*:.*(v\([0-9]\)l)/\1/p' /proc/cpuinfo) endian: endian.s echo ARCH is $(ARCH) $(AS) $(ASOPTS) --defsym ARCH=$(ARCH) -march=armv$(ARCH) -o $@.o $^ $(LD) -o $@ $@.o

## Tuesday, 23 July 2013

### Pythagorean triplets

A Pythagorean triplet is a set of three natural numbers, `a` `b` `c`, for which,

`a`

^{2}+

`b`

^{2}=

`c`

^{2}

For example, 3^{2} + 4^{2} = 9 + 16 = 25 = 5^{2}.

There exists exactly one Pythagorean triplet for which `a` + `b` + `c` = 1000.

Find the product `abc`.

.syntax unified .equ limit,1000 .align 4 icount .req r4 @ use of i, j and k as loop indices is historic :-) jcount .req r5 @ we use icount for a, jcount for b and kcount for c kcount .req r6 tmp .req r8 jksum .req r9 .section .rodata resstring: .asciz "%d\n" .text .global main .type main, %function main: stmfd sp!, {r4-r9, lr} ldr icount, =limit istart: subs jcount, icount, 1 @ i > j so start j at i-1 beq nexti jstart: subs kcount, jcount, 1 @ j > k too beq nextj add tmp, icount, jcount kstart: add tmp, tmp, kcount cmp tmp, #limit bne nextk mul jksum, kcount, kcount @ we re-use the register later as jksum (this is ksquared here) @ we could also alias ksquared to r9 - but then we have to avoid @ collision with jksum mul tmp, jcount, jcount @ could alias jsquared to r8 (or just re-use tmp) as here add jksum, jksum, tmp mul tmp, icount, icount @ and also could alias isquared 9but choose to reuse tmp) cmp jksum, tmp @ this is the important comparison - if i*i == (j*j) + (k*k) beq printme @ then leave since there is only one solution nextk: add tmp, icount, jcount subs kcount, kcount, 1 bne kstart nextj: subs jcount, jcount, 1 bne jstart nexti: subs icount, icount, 1 bne istart printme: mul tmp, icount, jcount mul tmp, tmp, kcount mov r1, tmp ldr r0, =resstring bl printf mov r0, 0 ldmfd sp!, {r4-r9, pc} mov r7, 1 swi 0