c - Fastest way to work with unaligned data on a word-aligned proccessor? -
i'm doing project on arm cortex m0, not support unaligned(by 4bytes) access, , i'm trying optimize speed of operations on unaligned data.
i'm storing bluetooth low energy access addresses (48bit) 6-byte arrays in packed structs acting packet buffers. because of packing, ble addresses not starting @ word aligned address, , i'm running complications when optimizing access functions these addresses.
the first, , obvious approach loop operating on each byte in array individually. checking if 2 addresses same instance done this:
uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2) { (uint32_t = 0; < 6; ++i) { if (addr1[i] != addr2[i]) return 0; } return 1; } i'm doing lot of comparisons in project, , wanted see if squeeze more speed out of function. realised aligned addresses, cast them uint64_t, , compare 48 bit masks applied, i.e.
((uint64_t)&addr1[0] & 0xffffffffffff) == ((uint64_t)&addr2[0] & 0xffffffffffff) similar operations done writing, , works aligned versions. however, since addresses aren't word-aligned (or half-word), have tricks make work.
first off, came unoptimized nightmare of compiler macro:
#define addr_aligned(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000ffffffff)\ | (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00ffff00000000) it shifts entire address start @ previous word aligned memory position, regardless of offset. instance:
0 1 2 3 |-------|-------|-------|-------| |.......|.......|.......|<addr0>| |<addr1>|<addr2>|<addr3>|<addr4>| |<addr5>|.......|.......|.......| becomes
0 1 2 3 |-------|-------|-------|-------| |<addr0>|<addr1>|<addr2>|<addr3>| |<addr4>|<addr5>|.......|.......| |.......|.......|.......|.......| and can safely 64-bit comparison of 2 addresses, regardless of actual alignment:
addr_aligned(addr1) == addr_aligned(addr2) neat! operation takes 71 lines of assembly when compiled arm-mdk, compared 53 when doing comparison in simple loop (i'm going disregard additional time spent in branch instructions here), , ~30 when unrolled. also, doesn't work writes, alignment happens in registers, not in memory. unaligning again require similar operation, , whole approach seems suck.
is unrolled for-loop working each byte individually fastest solution cases this? have experience similar scenarios, , feel sharing of wizardry here?
update
ok, because data has no alignment whatsover, need either read data in, byte byte, aligned buffers , fast 64-bit compares, or, if won't using data after compares, read in data bytes , 6 compares, in case calling memcmp() might better option.
for @ least 16-bit aligned:
u16 *src1 = (u16 *)addr1; u16 *src2 = (u16 *)addr2; (int = 0; < 3; ++i) { if (src1[i] != src2[i]) return 0; } return 1; will twice fast byte comparisons , might best can reasonably long data @ least 2-byte aligned. i'd expect compiler remove loop , use conditionally executed if statements instead.
trying 32-bit aligned reads not faster unless can guarentee source1 , 2 similiarly aligned (add1 & 0x03) == (addr2 & 0x03). if case, can read in 32-bit value , 16-bit (or visa-versa, depending on starting alignment) , remove 1 more compare.
as 16-bit shared base, can start there , compiler should generate nice ldrh type opcodes.
Comments
Post a Comment