1 \documentclass[10pt,oneside]{article}
10 \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref}
11 \renewcommand{\ttdefault}{cmtt}
12 \title{The FleetTwo Architecture Manual\\{\normalsize A Programmer's View of Fleet}}
16 \begin{thebibliography}{[GDG01]}
17 \bibitem[IES02]{ies02}
19 \href{http://research.cs.berkeley.edu/class/fleet/docs/people/ivan.e.sutherland/ies02-FLEET-A.Once.Instruction.Computer.pdf}{\it UCB-IES02: Fleet -- A One Instruction Computer}
20 \bibitem[IES14]{ies14}
22 \href{http://research.cs.berkeley.edu/class/fleet/docs/people/ivan.e.sutherland/ies14-Fleet.Definition.pdf}{\it UCB-IES14: The Fleet Definition}
23 \bibitem[IES44]{ies44}
25 \href{http://research.cs.berkeley.edu/class/fleet/docs/people/ivan.e.sutherland/ies44-Fleet.Definition.pdf}{\it UCB-IES44: The Fleet Definition}
26 \bibitem[MS68]{WheelOfReincarnation}
27 Myer, T. and Sutherland, Ivan, {\it On the Design of Display Processors}
28 Communications of the ACM, Vol 11, No. 6, June 1968 .
33 This manual attempts to detail all those aspects of Fleet which are
34 visible to a software programmer or compiler backend.
36 This document assumes that the reader is broadly familiar with the
37 Fleet architecture and the motivation behind it. For a far better
38 introduction to these topics, the reader is referred to \cite{ies02}
39 \cite{ies14} and \cite{ies44}. Please bear in mind that many of the
40 details in these documents have been changed, but they still serve as
41 the authoritative introduction to the architecture. After reviewing
42 these introductory papers, the reader is invited to digest this
43 architecture manual for the latest information on the details
44 necessary to write software and compilers for Fleet.
49 \section*{Introduction (from \cite{ies02})}
51 During the past two and a half years, the Fleet architecture has gone
52 through massive changes and improvements. In spite of this, however,
53 the fundamental idea behind Fleet remains unchanged. As Sutherland
54 writes in the original Berkeley Fleet description \cite{ies02},
57 When computers were new, logic and storage were expensive and wires
58 were relatively cheap. Early computer designers avoided using logic
59 wherever possible but were not greatly concerned with communication.
60 They made design choices consistent with the costs of the day. My
61 favorite example is the jump instruction. Early designers put jump
62 instructions at the end of each block of code to avoid the expense of
63 storing the address of the next block while executing the present one.
65 Today's chip fabrication methods invert the older cost balance. In
66 today's integrated circuits logic and memory are now almost free but
67 communication costs dominate chip area, energy consumption and delay.
68 In spite of these changes in the stuff from which we make computers,
69 vestiges of the past remain in many of today's common microprocessor
70 designs. For example, jump instructions still appear at the end of
71 each basic block of code in spite of the need to pre-fetch the next
72 block while executing this one. It seems better today to store a
73 pointer to the next block and the length of the current block early in
76 Instead of following the path of history, I'd like to listen carefully
77 to what modern chip structures have to teach about how one might build
78 a modern computer. I see three major lessons. First, simplicity can
79 reduce cost. Second, moving data will consume most of the time,
80 energy and chip area of a modern computer. And third, the low cost of
81 logic makes concurrency available if we can figure out how to use it.
85 \item {\bf Simplicity}: The Fleet architecture seeks simplicity by treating
86 all processing devices alike...
88 \item {\bf Communication}: The Fleet architecture seeks to control the
89 cost of communication by putting it under direct programmer
90 control. Thus Fleet avoids instructions like {\sc ADD} or {\sc STORE} that
91 include concealed communication to and from a register file...
93 \item {\bf Concurrency}: The Fleet architecture assumes concurrency
96 \item {\bf Asynchrony}: This provides enormous flexibility for
97 implementers to improve the performance or reduce the cost of
98 the processing devices in a Fleet system. -- Synchronous
99 implementations both of ships and of the switch fabric are
100 possible provided they use validity or occupancy bits to achieve
101 the arbitrary delays required at sources and destinations...
108 \section*{The Programmer's View of The Ship-Fabric Interface}
110 The role of the Fleet switch fabric is to carry {\it packets} from
111 {\it sources} to {\it destinations}.
113 A packet consists of a {\it path} and a {\it payload}. The path is
114 some string of bits which tells the switch fabric how to route from
115 the desired source to the desired destination. The distinction
116 between a {\it path} and a {\it destination} is extremely important!
117 The payload is either a {\it token} or a {\it data item}. If the
118 payload is a data item, it includes one machine word of data.
120 The diagram below represents a {\it programmer's} conceptual view of
121 the interface between ships and the switch fabric. Actual
122 implementation circuitry may differ substantially. Sources and
123 destinations which can send and recieve only tokens -- not data items
124 -- are drawn as dashed lines.
128 \epsfig{file=ports,width=4in}
132 The term {\it port} refers to each interface to the ship, the {\it
133 valve} connecting it to the switch fabric, and the corresponding
134 sources and destinations on the switch fabric.
136 Each valve consists of a {\it data latch}, which is as wide as a
137 single machine word and a {\it pump}, which is a circular fifo of
138 instruction-width latches. The contents of the instruction fifo
139 control the data latch, as future chapters will explain.
141 Note that the pump in each valve has a destination of its own, and
142 that there is no buffering fifo guarding this destination. Note that
143 all other destinations are guarded by a buffering fifo; the size of
144 this fifo is exposed to the software programmer so she can ensure that
145 deadlock does not occur.
149 \section*{Data Formats}
151 \subsection*{Path (12 bits)}
153 These bits appear physically within the switch fabric, and have
154 ``address bit timing.'' The {\tt T} bit is the ``tokenhood'' bit; if
155 set, this packet represents a token and it does not cause the switch
156 fabric data latches to fire.
159 \begin{bytefield}{49}
160 \bitheader[b]{37,47,48}\\
167 \subsection*{Data Word In Memory (37 bits)}
169 A word of data is 37 bits wide.
172 \begin{bytefield}{49}
173 \bitheader[b]{0,36}\\
175 \bitbox{37}{Data Word}
179 \subsection*{Packet In Flight (49 bits)}
182 \begin{bytefield}{49}
183 \bitheader[b]{0,36,37,47,48}\\
186 \bitbox{37}{Data Word}
190 \subsection*{Instruction In Memory (37 bits)}
192 An instruction must be no wider than a memory word. The next section
193 explains the bits in greater detail. The {\tt Instruction Path} is
194 the path which the instruction itself will take in order to arrive at
195 the desired pump destination. The {\it Data/Token Path} is placed on
196 any packet inserted into the switch fabric as a result of executing
200 \begin{bytefield}{49}
201 \bitheader[b]{0,10,11,17,18-26,36}\\
203 \bitbox{11}{Instruction Path}
212 \bitbox{11}{Data/Token Path}
217 \subsection*{Instruction Packet In Flight (49 bits)}
219 Note that Fleet simply copies the {\tt Instruction Path} field, bit
220 for bit, into the packet {\tt Path} field in order to ``dispatch'' an
224 \begin{bytefield}{49}
225 \bitheader[b]{0,10,11,17,18-25,37,47,48}\\
227 \bitbox{11}{Instruction Path}
228 \bitbox{11}{Instruction Path}
237 \bitbox{11}{Data/Token Path}
242 \setlength{\bitwidth}{5mm}
245 \section*{Instruction Formats}
246 \begin{wrapfigure}{R}{2in}
247 \epsfig{file=locations,width=2in}
250 Within the pump's instruction fifo, there are two points of particular
251 interest. The first point is the {\it insertion point}, which is the
252 point in the ring where new instructions enter from the switch fabric.
253 The other point of interest is the {\it execution point}, which is the
254 point in the ring where instructions other than {\tt kill} and {\tt
255 unclog} are performed.
257 The programmer may assume that the insertion point is the immediate
258 successor of the execution point.
264 \begin{bytefield}{26}
265 \bitheader[b]{0,6,7,20-25}\\
274 When a {\tt kill} instruction reaches the insertion point, it will
275 wait there for a non-{\tt clog} instruction to reach the execution
276 point. When this occurs, the instruction at the execution point is
277 retired and the count field of the {\tt kill} instruction is
278 decremented. If the {\tt kill} instruction's count was {\tt 0} before
279 decrementing, the {\tt kill} instruction is retired.
285 \begin{bytefield}{26}
286 \bitheader[b]{0,20-25}\\
294 When a {\tt clog} instruction reaches the execution point, no more
295 instructions will be executed until an {\tt unclog} is performed.
301 \begin{bytefield}{26}
302 \bitheader[b]{0,20-25}\\
310 When an {\tt unclog} instruction reaches the insertion point, it will wait
311 there until a {\tt clog} instruction is at the execution point. When
312 this occurs, both instructions retire. {\it Note:} this means
313 that if an {\tt unclog} instruction enters the instruction fifo
314 without a {\tt clog} ahead of it, the pump's insertion point will become
315 irretrievably blocked.
321 \begin{bytefield}{26}
322 \bitheader[b]{0,6,7,23-25}\\
328 When a literal instruction reaches the execution point, its {\tt
329 Literal} field is sign-extended to a full word length and captured
334 {\bf Normal Instruction (at Execution Point)}
337 \begin{bytefield}{26}
338 \bitheader[b]{0,6,7,16-25}\\
354 \item [\tt Ti] ({\bf Token Input}) wait for a token and accept
355 it\footnote{{\tt Ti}=1,{\tt Di}=1 is invalid on inbox.}
357 \item [\tt Di] ({\bf Data Input}) wait for a datum and accept it.
359 \item [\tt Dc] ({\bf Data Capture}) capture (latch) the accepted
360 datum. This bit is ignored if the incoming packet is
361 a token. \footnote{ Note that {\tt Di}=0,{\tt Dc}=1
362 is meaningless and therefore reserved for other
365 \item [\tt Do] ({\bf Data Output}) emit a datum.
367 \item [\tt To] ({\bf Token Output}) emit a token.\footnote{ {\tt To}=1,{\tt
368 Do}=1 have special meaning on an outbox.}
370 \item [\tt Ig] ({\bf Ignore {\tt To} Until Last Iteration}) ignore
371 the {\tt To} bit unless {\tt Count=0} \footnote{{\tt
372 To}=0,{\tt Ig}=1 is invalid}
374 \item [\tt Rq] ({\bf ReQueue}) if set, instructions having nonzero
375 count are ``Re-Queued'' rather than RePeated. See
376 {\tt Count} for more detail. \footnote{ An
377 instruction {\it in memory} may not have {\tt
378 Rq=1,Count=0} (use {\tt Rq=0,Count=0})}
380 \item [\tt Count] ({\bf Count}) {\it After} executing:
383 discard this instruction
385 if Count < MAX_COUNT {
389 put this instruction back into the instruction fifo
391 execute this instruction again
395 Note how a ``standing'' instruction is encoded as {\tt Count=1111111}
397 \item [\tt Dest] ({\bf Data/Token Destination})
398 Normally, this field is copied into the address portion of any
399 outgoing packet ({\tt Do} on an outbox or {\tt To}).
401 However, in the special case of an outbox, if {\tt Do=1,To=1}, then
402 the {\it most significant} {\tt 11} bits of the value in the {\it
403 data register} are used as a destination address instead.
405 %functionality eliminates the need for any sort of ``{\tt Execute}''
406 %ship, and lets a {\tt Fifo} ship act as an extension of the
407 %instruction queue in the pump.}
413 \section*{Fleet Assembly Language}
415 The Fleet Assembly language is designed to be a human-readable
416 depiction of the bits which comprise a Fleet program. The formal
417 syntax is given below:
419 \verbatiminput{fleet.g}
424 Inspired by JavaDoc, the Fleet software toolchain includes a tool
425 called {\tt FleetDoc}, which processes {\it ship description files}
426 (with the extension {\tt .ship}). These files contain sections for:
429 \item The name of the ship
430 \item A list of its ports
431 \item A prose description of the ship
432 \item A list of the constants associated with a ship
433 \item Java source code for a software simulation of the ship
434 \item Verilog source code for an FPGA simulation of the ship
435 \item A test case for the ship
436 \item A list of the contributors to all of the above
439 Keeping all of this information in a single file ensures that changes
440 to one item (such as the list of ports) are propagated to the other
441 items (such as the verilog code for simulation purposes).
443 Keeping this information in {\it machine-readable} format makes it
444 possible to automatically generate tools which depend on the precise
445 details of ship ports (such as the Fleet assembler) without the risk
446 inherent in manual synchronization of the spec with the
449 The latter half of this manual is generated automatically from the
450 contents of the {\tt .ship} files.
453 \subsection*{An Example FleetDoc File}
457 == Ports ===========================================================
460 constant lsbFirst: ....................................1
461 constant msbFirst: ....................................0
462 constant drop: .............................uuuuuu..
463 constant take: .......................uuuuuu........
466 == TeX ==============================================================
467 The BitFifo internally stores some number of bits in a fifo structure.
468 Its capacity is guaranteed to be at least two full machine words or
472 == Fleeterpreter ====================================================
473 public void service() {
474 if (box_inOp.dataReadyForShip() && box_in.dataReadyForShip()) {
477 == FPGA ==============================================================
478 always @(posedge clk) begin
479 if (!in_r && in_a) in_a <= 0;
480 if (!inOp_r && inOp_a) inOp_a <= 0;
483 == Test ==============================================================
487 #ship bitfifo : BitFifo
489 bitfifo.outOp: literal BitFifo.outOp[take=37]; [*] deliver;
492 == Contributors =========================================================
493 Amir Kamil <kamil@cs.berkeley.edu>
494 Adam Megacz <megacz@cs.berkeley.edu>
500 \subsection*{Representing Code}
501 \subsection*{Representing Ships}
507 - sequencing guarantees
509 - behavior when token arrives at data port, vice versa
510 - overlapping codebags
514 \section*{Future Directions}
516 \subsection*{The Role of the Count Field}
518 Looking back on the design of the pump, several things are now
519 apparent which were not initially. In particular, it seems that it
520 could be useful to separate {\it loading the count register} from
521 other types of instructions. This would have a few advantages:
524 \item The size of the count field would not be a consideration in the
525 ``instruction budget'' of normal execution instructions
527 \item It would be possible to have finitely-repeating,
528 infinitely-requeueing instructions. Various implementation
529 details\footnote{in many implementations, the count field of an
530 instruction is destroyed as it counts down; another register and
531 mux would be needed to ``save'' this value so it could be
532 restored prior to being requeued} make it awkward to support
533 this unless the count register can be manipulated independently
534 of instruction execution.
537 With this in mind, we can refactor the ``essence of the pump'' into
538 the following actions, each of which may or may not be performed by a
539 particular instruction:
542 \item await and acknowledge a token
543 \item await and acknowledge a datum
544 \item load the awaited datum into the data latch
545 \item load the a literal into the data latch
546 \item load a value indicated by the instruction into the path latch
547 \item load the top 11 bits of the data latch into the path latch
548 \item treat the values in the path latch and data latch as a packet and transmit it
549 \item treat the value in the path latch as a token and transmit it
550 \item set the count register to a literal value
551 \item decrement the count register
552 \item requeue this instruction if {\tt count>0}
553 \item requeue this instruction unconditionally
554 \item repeat this instruction if {\tt count>0}
555 \item repeat this instruction unconditionally
558 The crucial change here is the decoupling of the act of {\it loading
559 the count register} from the act of {\it loading the next
560 instruction}. It also separates the act of loading the ``path
561 latch'' from the act of actually performing the transmission. This
562 latter feature makes it possible to load the data and destination
563 latches from two distinct data items, allowing ships to create,
564 handle, and consume {\it packets} in the form of a pair of data items.
566 At this point, it boils down to a question of instruction bit
567 budgeting. Currently we have a separate instruction form for
568 literals, so that the literal can use bits which are not otherwise
569 relevant to literal instructions (for example, {\tt Di}). Adding an
570 additional instruction form for loading the count register would have
574 \subsection*{For Just a Little Bit More...}
576 Compared to early revisions of the Fleet architecture, the valves of
577 FleetTwo are substantially more powerful. One might ask, why not add
578 additional power to the valves, perhaps up to the point of making them
579 tiny von Neumann machines in their own right? Indeed, doing so would
580 risk reliving the {\it Wheel of Reincarnation}
581 \cite{WheelOfReincarnation}.
583 I believe, however, that there is a fundamental distinction to be
584 made between devices whose control flow {\it can branch} and those
585 whose control flow {\it cannot branch}, and that this distinction is
586 founded in the geometry of VLSI, and indeed, Euclidean space.
587 Ultimately, the instruction stream for any component of a processor
588 must exist in some physical incarnation -- some geometric extent. If
589 the control flow of a program is linear, with no branching, then the
590 instruction stream has an optimal spatial mapping which is quite
591 efficient: it is simply laid out in FIFO fashion.
593 By contrast, however, a program whose instruction stream has
594 unrestricted branching is logically a {\it tree} structure. Quite
595 unfortunately, general trees have no efficient mapping onto
596 two-dimensional surfaces. By this, I mean that in every possible
597 mapping, there exists some pair of {\it logically adjacent}
598 instructions whose {\it physical distance} is proportional to the size
599 of the entire program, rather than some constant factor irrespective
600 of the program size. Therefore I conclude that the spatial mapping of
601 a program with unrestricted branching is in an entirely different
602 class than that of programs with linear control flow. I choose this
603 distinction as the basis for limiting our valves to linear control
604 flow sequences. A similar argument can be made for looping constructs
605 in which two or more loops may be nested within a single parent loop.
607 With this distinction in mind, it would not be objectionable for
608 future incarnations of Fleet to have {\it predicated} instructions at
609 the valves. This might take the form of:
612 \item Instructions to unconditionally set some ``state flags''
613 \item Instructions to conditionally set some ``state flags'' based on the contents of the data latch
614 \item Bits which cause an instruction to be ignored or retired based on the state flags
617 I feel that this addition would not risk falling down the slippery
618 slope -- instruction streams with predicated instructions (but not
619 unrestricted branching) preserve the crucial property that they have
620 an efficient spatial mapping. Any steps beyond this, however, cross a
621 fairly fundamental line.