Annotation of sys/arch/mvme88k/stand/sboot/oc_cksum.S, Revision 1.1
1.1 ! nbrk 1: | $OpenBSD: oc_cksum.S,v 1.3 2006/05/16 22:52:26 miod Exp $
! 2:
! 3: | Copyright (c) 1988 Regents of the University of California.
! 4: | All rights reserved.
! 5: |
! 6: | Redistribution and use in source and binary forms, with or without
! 7: | modification, are permitted provided that the following conditions
! 8: | are met:
! 9: | 1. Redistributions of source code must retain the above copyright
! 10: | notice, this list of conditions and the following disclaimer.
! 11: | 2. Redistributions in binary form must reproduce the above copyright
! 12: | notice, this list of conditions and the following disclaimer in the
! 13: | documentation and/or other materials provided with the distribution.
! 14: | 3. All advertising materials mentioning features or use of this software
! 15: | must display the following acknowledgement:
! 16: | This product includes software developed by the University of
! 17: | California, Berkeley and its contributors.
! 18: | 4. Neither the name of the University nor the names of its contributors
! 19: | may be used to endorse or promote products derived from this software
! 20: | without specific prior written permission.
! 21: |
! 22: | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
! 23: | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
! 24: | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
! 25: | ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
! 26: | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
! 27: | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
! 28: | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
! 29: | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
! 30: | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
! 31: | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
! 32: | SUCH DAMAGE.
! 33: |
! 34: | @(#)oc_cksum.s 7.2 (Berkeley) 11/3/90
! 35: |
! 36: |
! 37: | oc_cksum: ones complement 16 bit checksum for MC68020.
! 38: |
! 39: | oc_cksum (buffer, count, strtval)
! 40: |
! 41: | Do a 16 bit ones complement sum of 'count' bytes from 'buffer'.
! 42: | 'strtval' is the starting value of the sum (usually zero).
! 43: |
! 44: | It simplifies life in in_cksum if strtval can be >= 2^16.
! 45: | This routine will work as long as strtval is < 2^31.
! 46: |
! 47: | Performance
! 48: | -----------
! 49: | This routine is intended for MC 68020s but should also work
! 50: | for 68030s. It (deliberately) does not worry about the alignment
! 51: | of the buffer so will only work on a 68010 if the buffer is
! 52: | aligned on an even address. (Also, a routine written to use
! 53: | 68010 "loop mode" would almost certainly be faster than this
! 54: | code on a 68010).
! 55: |
! 56: | We do not worry about alignment because this routine is frequently
! 57: | called with small counts: 20 bytes for IP header checksums and 40
! 58: | bytes for TCP ack checksums. For these small counts, testing for
! 59: | bad alignment adds ~10% to the per-call cost. Since, by the nature
! 60: | of the kernel allocator, the data we are called with is almost
! 61: | always longword aligned, there is no benefit to this added cost
! 62: | and we are better off letting the loop take a big performance hit
! 63: | in the rare cases where we are handed an unaligned buffer.
! 64: |
! 65: | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
! 66: | tested on random data on four different types of processors (see
! 67: | list below -- 64 was the largest unrolling because anything more
! 68: | overflows the 68020 Icache). On all the processors, the
! 69: | throughput asymptote was located between 8 and 16 (closer to 8).
! 70: | However, 16 was substantially better than 8 for small counts.
! 71: | (It is clear why this happens for a count of 40: unroll-8 pays a
! 72: | loop branch cost and unroll-16 does not. But the tests also showed
! 73: | that 16 was better than 8 for a count of 20. It is not obvious to
! 74: | me why.) So, since 16 was good for both large and small counts,
! 75: | the loop below is unrolled 16 times.
! 76: |
! 77: | The processors tested and their average time to checksum 1024 bytes
! 78: | of random data were:
! 79: | Sun 3/50 (15MHz) 190 us/KB
! 80: | Sun 3/180 (16.6MHz) 175 us/KB
! 81: | Sun 3/60 (20MHz) 134 us/KB
! 82: | Sun 3/280 (25MHz) 95 us/KB
! 83: |
! 84: | The cost of calling this routine was typically 10% of the per-
! 85: | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us
! 86: | and each additional byte cost 125ns. With the high fixed cost,
! 87: | it would clearly be a gain to "inline" this routine -- the
! 88: | subroutine call adds 400% overhead to an IP header checksum.
! 89: | However, in absolute terms, inlining would only gain 10us per
! 90: | packet -- a 1% effect for a 1ms ethernet packet. This is not
! 91: | enough gain to be worth the effort.
! 92:
! 93: #include <machine/asm.h>
! 94:
! 95: .text
! 96:
! 97: .text; .even; .globl _oc_cksum; _oc_cksum:
! 98: movl sp@(4),a0 | get buffer ptr
! 99: movl sp@(8),d1 | get byte count
! 100: movl sp@(12),d0 | get starting value
! 101: movl d2,sp@- | free a reg
! 102:
! 103: | test for possible 1, 2 or 3 bytes of excess at end
! 104: | of buffer. The usual case is no excess (the usual
! 105: | case is header checksums) so we give that the faster
! 106: | 'not taken' leg of the compare. (We do the excess
! 107: | first because we are about the trash the low order
! 108: | bits of the count in d1.)
! 109:
! 110: btst #0,d1
! 111: jne L5 | if one or three bytes excess
! 112: btst #1,d1
! 113: jne L7 | if two bytes excess
! 114: L1:
! 115: movl d1,d2
! 116: lsrl #6,d1 | make cnt into # of 64 byte chunks
! 117: andl #0x3c,d2 | then find fractions of a chunk
! 118: negl d2
! 119: andb #0xf,cc | clear X
! 120: jmp pc@(L3-.-2:b,d2)
! 121: L2:
! 122: movl a0@+,d2
! 123: addxl d2,d0
! 124: movl a0@+,d2
! 125: addxl d2,d0
! 126: movl a0@+,d2
! 127: addxl d2,d0
! 128: movl a0@+,d2
! 129: addxl d2,d0
! 130: movl a0@+,d2
! 131: addxl d2,d0
! 132: movl a0@+,d2
! 133: addxl d2,d0
! 134: movl a0@+,d2
! 135: addxl d2,d0
! 136: movl a0@+,d2
! 137: addxl d2,d0
! 138: movl a0@+,d2
! 139: addxl d2,d0
! 140: movl a0@+,d2
! 141: addxl d2,d0
! 142: movl a0@+,d2
! 143: addxl d2,d0
! 144: movl a0@+,d2
! 145: addxl d2,d0
! 146: movl a0@+,d2
! 147: addxl d2,d0
! 148: movl a0@+,d2
! 149: addxl d2,d0
! 150: movl a0@+,d2
! 151: addxl d2,d0
! 152: movl a0@+,d2
! 153: addxl d2,d0
! 154: L3:
! 155: dbra d1,L2 | (NB- dbra does not affect X)
! 156:
! 157: movl d0,d1 | fold 32 bit sum to 16 bits
! 158: swap d1 | (NB- swap does not affect X)
! 159: addxw d1,d0
! 160: jcc L4
! 161: addw #1,d0
! 162: L4:
! 163: andl #0xffff,d0
! 164: movl sp@+,d2
! 165: rts
! 166:
! 167: L5: | deal with 1 or 3 excess bytes at the end of the buffer.
! 168: btst #1,d1
! 169: jeq L6 | if 1 excess
! 170:
! 171: | 3 bytes excess
! 172: clrl d2
! 173: movw a0@(-3,d1:l),d2 | add in last full word then drop
! 174: addl d2,d0 | through to pick up last byte
! 175:
! 176: L6: | 1 byte excess
! 177: clrl d2
! 178: movb a0@(-1,d1:l),d2
! 179: lsll #8,d2
! 180: addl d2,d0
! 181: jra L1
! 182:
! 183: L7: | 2 bytes excess
! 184: clrl d2
! 185: movw a0@(-2,d1:l),d2
! 186: addl d2,d0
! 187: jra L1
CVSweb