Annotation of sys/arch/mvme68k/stand/sboot/oc_cksum.S, Revision 1.1.1.1
1.1 nbrk 1: | $OpenBSD: oc_cksum.S,v 1.4 2003/06/04 16:36:14 deraadt Exp $
2:
3: | Copyright (c) 1988 Regents of the University of California.
4: | All rights reserved.
5: |
6: | Redistribution and use in source and binary forms, with or without
7: | modification, are permitted provided that the following conditions
8: | are met:
9: | 1. Redistributions of source code must retain the above copyright
10: | notice, this list of conditions and the following disclaimer.
11: | 2. Redistributions in binary form must reproduce the above copyright
12: | notice, this list of conditions and the following disclaimer in the
13: | documentation and/or other materials provided with the distribution.
14: | 3. Neither the name of the University nor the names of its contributors
15: | may be used to endorse or promote products derived from this software
16: | without specific prior written permission.
17: |
18: | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
19: | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20: | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
21: | ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
22: | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23: | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
24: | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
25: | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
26: | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
27: | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
28: | SUCH DAMAGE.
29: |
30: | @(#)oc_cksum.s 7.2 (Berkeley) 11/3/90
31: |
32: |
33: | oc_cksum: ones complement 16 bit checksum for MC68020.
34: |
35: | oc_cksum (buffer, count, strtval)
36: |
37: | Do a 16 bit ones complement sum of 'count' bytes from 'buffer'.
38: | 'strtval' is the starting value of the sum (usually zero).
39: |
40: | It simplifies life in in_cksum if strtval can be >= 2^16.
41: | This routine will work as long as strtval is < 2^31.
42: |
43: | Performance
44: | -----------
45: | This routine is intended for MC 68020s but should also work
46: | for 68030s. It (deliberately) does not worry about the alignment
47: | of the buffer so will only work on a 68010 if the buffer is
48: | aligned on an even address. (Also, a routine written to use
49: | 68010 "loop mode" would almost certainly be faster than this
50: | code on a 68010).
51: |
52: | We do not worry about alignment because this routine is frequently
53: | called with small counts: 20 bytes for IP header checksums and 40
54: | bytes for TCP ack checksums. For these small counts, testing for
55: | bad alignment adds ~10% to the per-call cost. Since, by the nature
56: | of the kernel allocator, the data we are called with is almost
57: | always longword aligned, there is no benefit to this added cost
58: | and we are better off letting the loop take a big performance hit
59: | in the rare cases where we are handed an unaligned buffer.
60: |
61: | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
62: | tested on random data on four different types of processors (see
63: | list below -- 64 was the largest unrolling because anything more
64: | overflows the 68020 Icache). On all the processors, the
65: | throughput asymptote was located between 8 and 16 (closer to 8).
66: | However, 16 was substantially better than 8 for small counts.
67: | (It is clear why this happens for a count of 40: unroll-8 pays a
68: | loop branch cost and unroll-16 does not. But the tests also showed
69: | that 16 was better than 8 for a count of 20. It is not obvious to
70: | me why.) So, since 16 was good for both large and small counts,
71: | the loop below is unrolled 16 times.
72: |
73: | The processors tested and their average time to checksum 1024 bytes
74: | of random data were:
75: | Sun 3/50 (15MHz) 190 us/KB
76: | Sun 3/180 (16.6MHz) 175 us/KB
77: | Sun 3/60 (20MHz) 134 us/KB
78: | Sun 3/280 (25MHz) 95 us/KB
79: |
80: | The cost of calling this routine was typically 10% of the per-
81: | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us
82: | and each additional byte cost 125ns. With the high fixed cost,
83: | it would clearly be a gain to "inline" this routine -- the
84: | subroutine call adds 400% overhead to an IP header checksum.
85: | However, in absolute terms, inlining would only gain 10us per
86: | packet -- a 1% effect for a 1ms ethernet packet. This is not
87: | enough gain to be worth the effort.
88:
89: #include <machine/asm.h>
90:
91: .text
92:
93: .text; .even; .globl _oc_cksum; _oc_cksum:
94: movl sp@(4),a0 | get buffer ptr
95: movl sp@(8),d1 | get byte count
96: movl sp@(12),d0 | get starting value
97: movl d2,sp@- | free a reg
98:
99: | test for possible 1, 2 or 3 bytes of excess at end
100: | of buffer. The usual case is no excess (the usual
101: | case is header checksums) so we give that the faster
102: | 'not taken' leg of the compare. (We do the excess
103: | first because we are about the trash the low order
104: | bits of the count in d1.)
105:
106: btst #0,d1
107: jne L5 | if one or three bytes excess
108: btst #1,d1
109: jne L7 | if two bytes excess
110: L1:
111: movl d1,d2
112: lsrl #6,d1 | make cnt into # of 64 byte chunks
113: andl #0x3c,d2 | then find fractions of a chunk
114: negl d2
115: andb #0xf,cc | clear X
116: jmp pc@(L3-.-2:b,d2)
117: L2:
118: movl a0@+,d2
119: addxl d2,d0
120: movl a0@+,d2
121: addxl d2,d0
122: movl a0@+,d2
123: addxl d2,d0
124: movl a0@+,d2
125: addxl d2,d0
126: movl a0@+,d2
127: addxl d2,d0
128: movl a0@+,d2
129: addxl d2,d0
130: movl a0@+,d2
131: addxl d2,d0
132: movl a0@+,d2
133: addxl d2,d0
134: movl a0@+,d2
135: addxl d2,d0
136: movl a0@+,d2
137: addxl d2,d0
138: movl a0@+,d2
139: addxl d2,d0
140: movl a0@+,d2
141: addxl d2,d0
142: movl a0@+,d2
143: addxl d2,d0
144: movl a0@+,d2
145: addxl d2,d0
146: movl a0@+,d2
147: addxl d2,d0
148: movl a0@+,d2
149: addxl d2,d0
150: L3:
151: dbra d1,L2 | (NB- dbra does not affect X)
152:
153: movl d0,d1 | fold 32 bit sum to 16 bits
154: swap d1 | (NB- swap does not affect X)
155: addxw d1,d0
156: jcc L4
157: addw #1,d0
158: L4:
159: andl #0xffff,d0
160: movl sp@+,d2
161: rts
162:
163: L5: | deal with 1 or 3 excess bytes at the end of the buffer.
164: btst #1,d1
165: jeq L6 | if 1 excess
166:
167: | 3 bytes excess
168: clrl d2
169: movw a0@(-3,d1:l),d2 | add in last full word then drop
170: addl d2,d0 | through to pick up last byte
171:
172: L6: | 1 byte excess
173: clrl d2
174: movb a0@(-1,d1:l),d2
175: lsll #8,d2
176: addl d2,d0
177: jra L1
178:
179: L7: | 2 bytes excess
180: clrl d2
181: movw a0@(-2,d1:l),d2
182: addl d2,d0
183: jra L1
CVSweb