[BACK]Return to oc_cksum.S CVS log [TXT][DIR] Up to [local] / sys / arch / mvme68k / stand / sboot

Annotation of sys/arch/mvme68k/stand/sboot/oc_cksum.S, Revision 1.1.1.1

1.1       nbrk        1: |      $OpenBSD: oc_cksum.S,v 1.4 2003/06/04 16:36:14 deraadt Exp $
                      2:
                      3: | Copyright (c) 1988 Regents of the University of California.
                      4: | All rights reserved.
                      5: |
                      6: | Redistribution and use in source and binary forms, with or without
                      7: | modification, are permitted provided that the following conditions
                      8: | are met:
                      9: | 1. Redistributions of source code must retain the above copyright
                     10: |    notice, this list of conditions and the following disclaimer.
                     11: | 2. Redistributions in binary form must reproduce the above copyright
                     12: |    notice, this list of conditions and the following disclaimer in the
                     13: |    documentation and/or other materials provided with the distribution.
                     14: | 3. Neither the name of the University nor the names of its contributors
                     15: |    may be used to endorse or promote products derived from this software
                     16: |    without specific prior written permission.
                     17: |
                     18: | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
                     19: | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
                     20: | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
                     21: | ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
                     22: | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
                     23: | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
                     24: | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
                     25: | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
                     26: | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
                     27: | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
                     28: | SUCH DAMAGE.
                     29: |
                     30: |      @(#)oc_cksum.s  7.2 (Berkeley) 11/3/90
                     31: |
                     32: |
                     33: | oc_cksum: ones complement 16 bit checksum for MC68020.
                     34: |
                     35: | oc_cksum (buffer, count, strtval)
                     36: |
                     37: | Do a 16 bit ones complement sum of 'count' bytes from 'buffer'.
                     38: | 'strtval' is the starting value of the sum (usually zero).
                     39: |
                     40: | It simplifies life in in_cksum if strtval can be >= 2^16.
                     41: | This routine will work as long as strtval is < 2^31.
                     42: |
                     43: | Performance
                     44: | -----------
                     45: | This routine is intended for MC 68020s but should also work
                     46: | for 68030s.  It (deliberately) does not worry about the alignment
                     47: | of the buffer so will only work on a 68010 if the buffer is
                     48: | aligned on an even address.  (Also, a routine written to use
                     49: | 68010 "loop mode" would almost certainly be faster than this
                     50: | code on a 68010).
                     51: |
                     52: | We do not worry about alignment because this routine is frequently
                     53: | called with small counts: 20 bytes for IP header checksums and 40
                     54: | bytes for TCP ack checksums.  For these small counts, testing for
                     55: | bad alignment adds ~10% to the per-call cost.  Since, by the nature
                     56: | of the kernel allocator, the data we are called with is almost
                     57: | always longword aligned, there is no benefit to this added cost
                     58: | and we are better off letting the loop take a big performance hit
                     59: | in the rare cases where we are handed an unaligned buffer.
                     60: |
                     61: | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
                     62: | tested on random data on four different types of processors (see
                     63: | list below -- 64 was the largest unrolling because anything more
                     64: | overflows the 68020 Icache).  On all the processors, the
                     65: | throughput asymptote was located between 8 and 16 (closer to 8).
                     66: | However, 16 was substantially better than 8 for small counts.
                     67: | (It is clear why this happens for a count of 40: unroll-8 pays a
                     68: | loop branch cost and unroll-16 does not.  But the tests also showed
                     69: | that 16 was better than 8 for a count of 20.  It is not obvious to
                     70: | me why.)  So, since 16 was good for both large and small counts,
                     71: | the loop below is unrolled 16 times.
                     72: |
                     73: | The processors tested and their average time to checksum 1024 bytes
                     74: | of random data were:
                     75: |      Sun 3/50 (15MHz)        190 us/KB
                     76: |      Sun 3/180 (16.6MHz)     175 us/KB
                     77: |      Sun 3/60 (20MHz)        134 us/KB
                     78: |      Sun 3/280 (25MHz)        95 us/KB
                     79: |
                     80: | The cost of calling this routine was typically 10% of the per-
                     81: | kilobyte cost.  E.g., checksumming zero bytes on a 3/60 cost 9us
                     82: | and each additional byte cost 125ns.  With the high fixed cost,
                     83: | it would clearly be a gain to "inline" this routine -- the
                     84: | subroutine call adds 400% overhead to an IP header checksum.
                     85: | However, in absolute terms, inlining would only gain 10us per
                     86: | packet -- a 1% effect for a 1ms ethernet packet.  This is not
                     87: | enough gain to be worth the effort.
                     88:
                     89: #include <machine/asm.h>
                     90:
                     91:        .text
                     92:
                     93:        .text; .even; .globl _oc_cksum; _oc_cksum:
                     94:        movl    sp@(4),a0       | get buffer ptr
                     95:        movl    sp@(8),d1       | get byte count
                     96:        movl    sp@(12),d0      | get starting value
                     97:        movl    d2,sp@-         | free a reg
                     98:
                     99:        | test for possible 1, 2 or 3 bytes of excess at end
                    100:        | of buffer.  The usual case is no excess (the usual
                    101:        | case is header checksums) so we give that the faster
                    102:        | 'not taken' leg of the compare.  (We do the excess
                    103:        | first because we are about the trash the low order
                    104:        | bits of the count in d1.)
                    105:
                    106:        btst    #0,d1
                    107:        jne     L5              | if one or three bytes excess
                    108:        btst    #1,d1
                    109:        jne     L7              | if two bytes excess
                    110: L1:
                    111:        movl    d1,d2
                    112:        lsrl    #6,d1           | make cnt into # of 64 byte chunks
                    113:        andl    #0x3c,d2        | then find fractions of a chunk
                    114:        negl    d2
                    115:        andb    #0xf,cc         | clear X
                    116:        jmp     pc@(L3-.-2:b,d2)
                    117: L2:
                    118:        movl    a0@+,d2
                    119:        addxl   d2,d0
                    120:        movl    a0@+,d2
                    121:        addxl   d2,d0
                    122:        movl    a0@+,d2
                    123:        addxl   d2,d0
                    124:        movl    a0@+,d2
                    125:        addxl   d2,d0
                    126:        movl    a0@+,d2
                    127:        addxl   d2,d0
                    128:        movl    a0@+,d2
                    129:        addxl   d2,d0
                    130:        movl    a0@+,d2
                    131:        addxl   d2,d0
                    132:        movl    a0@+,d2
                    133:        addxl   d2,d0
                    134:        movl    a0@+,d2
                    135:        addxl   d2,d0
                    136:        movl    a0@+,d2
                    137:        addxl   d2,d0
                    138:        movl    a0@+,d2
                    139:        addxl   d2,d0
                    140:        movl    a0@+,d2
                    141:        addxl   d2,d0
                    142:        movl    a0@+,d2
                    143:        addxl   d2,d0
                    144:        movl    a0@+,d2
                    145:        addxl   d2,d0
                    146:        movl    a0@+,d2
                    147:        addxl   d2,d0
                    148:        movl    a0@+,d2
                    149:        addxl   d2,d0
                    150: L3:
                    151:        dbra    d1,L2           | (NB- dbra does not affect X)
                    152:
                    153:        movl    d0,d1           | fold 32 bit sum to 16 bits
                    154:        swap    d1              | (NB- swap does not affect X)
                    155:        addxw   d1,d0
                    156:        jcc     L4
                    157:        addw    #1,d0
                    158: L4:
                    159:        andl    #0xffff,d0
                    160:        movl    sp@+,d2
                    161:        rts
                    162:
                    163: L5:    | deal with 1 or 3 excess bytes at the end of the buffer.
                    164:        btst    #1,d1
                    165:        jeq     L6              | if 1 excess
                    166:
                    167:        | 3 bytes excess
                    168:        clrl    d2
                    169:        movw    a0@(-3,d1:l),d2 | add in last full word then drop
                    170:        addl    d2,d0           |  through to pick up last byte
                    171:
                    172: L6:    | 1 byte excess
                    173:        clrl    d2
                    174:        movb    a0@(-1,d1:l),d2
                    175:        lsll    #8,d2
                    176:        addl    d2,d0
                    177:        jra     L1
                    178:
                    179: L7:    | 2 bytes excess
                    180:        clrl    d2
                    181:        movw    a0@(-2,d1:l),d2
                    182:        addl    d2,d0
                    183:        jra     L1

CVSweb