A Compact 3-D VLSI Classifier Using Bagging Threshold Network Ensembles

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 1097 A Compact 3-D VLSI Classifier Using Bagging Threshold Network Ensembles Amine Bermak, Member, IEEE, and Dominique Martinez Abstract A bagging ensemble consists of a set of classifiers trained independently and combined by a majority vote. Such a combination improves generalization performance but can require large amounts of memory and computation, a serious drawback for addressing portable real-time pattern recognition applications. We report here a compact three-dimensional (3-D) multiprecision very large-scale integration (VLSI) implementation of a bagging ensemble. In our circuit, individual classifiers are decision trees implemented as threshold networks-one layer of threshold logic units (TLUs) followed by combinatorial logic functions. The hardware was fabricated using 0.7- m CMOS technology and packaged using MCM-V micro-packaging technology. The 3-D chip implements up to 192 TLUs operating at a speed of up to 48GCPPS and implemented in a volume of ( )= (2 2 0 7) cm 3. The 3-D circuit features a high level of programmability and flexibility offering the possibility to make an efficient use of the hardware resources in order to reduce the power consumption. Successful operation of the 3-D chip for various precisions and ensemble sizes is demonstrated through an electronic nose application. Index Terms Bagging, decision trees, threshold networks, very large-scale integration (VLSI), three-dimensional (3-D) packaging technology. I. INTRODUCTION COMBINING multiple classifiers (such as neural networks or decision trees) to build an ensemble is an advanced pattern recognition technique which has gained increasing attention within the machine learning community. Bagging and Boosting are two popular methods proposed in order to create accurate ensembles (see compilation of papers at http://www.boosting.org). The two methods rely on resampling techniques to obtain different training sets for each of the individual classifiers. The resulting combined classifier is generally more robust and accurate than a single classifier trained on the original dataset. However, ensembles suffer from some shortcomings, as stated by Dietterich [1]: While ensembles provide very accurate classifiers, there are problems that may limit their practical applications. One problem is that ensembles can require large amounts of memory to store and large amounts of computation to apply. Thus, this scheme can be put to efficient practical use only if good hardware implementation strategies are developed. In this paper, we describe a proof-of-concept compact threedimentional (3-D) chip that we believe can meet the computa- Manuscript received September 15, 2002. A. Bermak is with the Electrical and Electronics Engineering Department, Hong Kong University of Science and Technology, Kowloon, Hong Kong. D. Martinez is with LORIA, Vandoeuvre-Les-Nancy 54506, France. Digital Object Identifier 10.1109/TNN.2003.816362 tional requirement of bagging ensembles. In our chip, individual classifiers are decision trees implemented as threshold networks (binary neural networks having a layer of threshold logic units (TLUs) followed by combinatorial logic elements). The prototype combines silicon very large-scale integration (VLSI)-based circuits with compact 3-D packaging technology whereby the computational power is increased by stacking VLSI chips vertically using micropackaging technology referred to as multichip-module-vertical (MCM-V). Selective gas detection was used as a test-bed for the 3-D chip operating as a compact and low power pattern recognition classifier for electronic nose application. However, novel design features such as multiprecision and hardware reconfigurability were introduced in order to make the 3-D chip a general problem solving system. The 3-D chip can be configured to implement, with a programmable precision, any threshold network topology. To the best of our knowledge this is the very first 3-D VLSI implementation of bagging ensembles. In Section II, performance of bagging decision trees specified as threshold network ensembles are evaluated for different precision requirements. Section III describes the hardware architecture of the basic chip and its main features including the multiprecision computation and reconfigurability concept. Section IV details the VLSI implementation of the basic prototype, the multichip module, and the final 3-D packaged circuit. Section V presents the experimental results and the chip performance operating as a systolic processor as well as a bagging ensemble applied to odor discrimination for electronic nose applications. II. ALGORITHMIC CONSIDERATIONS A. Bagging Decision Trees Bagging [2] is a popular and effective technique for improving classification performance by creating ensembles. Bagging uses random sampling with replacement from the original data set in order to obtain different training sets. Because the size of the sampled data set has the same size as the original one, many of the original examples may be repeated while others may be left out. On average, 63% of the original data appears in the sampled training set [2]. Each individual classifier is built on each training set by applying the same learning algorithm. The resulting classifiers are then combined by a simple majority vote. It is well known that bagging significantly improves classifiers that are unstable in the sense that small perturbations in the training data may result in large changes in the generated classifier [2]. Empirical evaluations have shown that bagging improves decision trees 1045-9227/03$17.00 2003 IEEE

1098 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 TABLE I DATASETS USED IN THE EXPERIMENTS. THE FIRST THREE ONES ARE TAKEN FROM THE UCI REPOSITORY. SEE SECTION V-D FOR EXPLANATIONS ON HOW THE ODOR DATASET WAS OBTAINED Fig. 1. Equivalence between decision trees and threshold networks. (a) and (c) Two examples of a tree and a threshold network, respectively. (b) and (d) Their respective partition of the input space. Each node of the tree corresponds to a separator hyperplane and the leaves correspond to a given class. Each class can be represented by a logical function that combines a set of nodes. In our example, it can be seen from the tree structure that class = ac + acd, class =a b + acd, and class =ab. Note however that the logical function for class 1 extracted by our program is class = ac + ad which is better optimized. or neural networks [3], [4] but does not improve the k-nearest neighbor algorithm [2]. For the nearest neighbor algorithm, a test case may change classification only if its nearest neighbor in the original dataset is not picked in at least half of the sampled training sets. The probability that this occurs gets very small as the number of classifiers within the ensemble gets larger [2]. A similar reasoning can be applied to support vector machines [5], [6] that are stable classifiers. Whether bagging decision trees are more accurate than bagging neural networks depends on the particular data set but on average they have similar performance and there is no clear evidence to prefer one or the other [3], [4]. However, because decision trees are fast to build with standard procedures (like CART [7], C4.5 [8], or OC1 [9]) and can be interpreted as a series of rules, they are often used in bagging ensembles. B. Decision Trees as Threshold Networks Hardware implementation is made easier if one considers decision trees as threshold networks as evidenced through the example shown in Fig. 1. The decision tree shown in Fig. 1(a) implements a classifier that discriminates between three classes (denoted 1, 2, and 3) shown in Fig. 1(b). Each node in the tree is a TLU implementing a linear discriminant and each leaf is associated to a given class. Classifying an input pattern then reduces to a sequence of binary decisions, starting from the root node and ending when a leaf is reached. Each class can then be represented by a logical function that combines the binary decisions encountered at the nodes. Therefore, a decision tree can, thus, be considered as a threshold network having a hidden layer of TLUs followed by one logical function per class. Note that the architecture of such threshold networks is similar to some binary neural networks that use a single hidden layer of threshold neurons followed by an XOR gate [10], [11] or by a combination of AND and OR gates [12] [15]. Also, the equivalence between decision trees and binary neural networks or threshold networks was first noted by Sethi [16], [17]. Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent threshold network by extracting one logical function per class from the tree structure. A logical function for a given class has a number of conjunctions equal to TABLE II LEAVE-ONE-OUT ACCURACY (IN %) FOR THE DIFFERENT DATASETS. DECISION TREES WERE TRAINED WITH OC1 AND TRANSFORMED INTO EQUIVALENT THRESHOLD NETWORKS. SEE TEXT FOR DETAILS ON THE PROCEDURE USED FOR CREATING INDIVIDUAL AND ENSEMBLE OF THRESHOLD NETWORKS. THE NUMBERS IN BRACKETS INDICATE THE TENFOLD CROSS-VALIDATION ACCURACY TAKEN FROM [3] and [4] FOR SINGLE DECISION TREES OR BAGGING ENSEMBLES OF TEN DECISION TREES TRAINED WITH C4.5. FOR SVM, THE ERROR PENALTY PARAMETER C = 1000, POLYNOMIAL KERNELS HAVE BEEN USED AND THE DEGREES OF THE POLYNOMIAL HAVE BEEN ADJUSTED SEPARATELY TO GET THE BEST PERFORMANCE ON EACH INDIVIDUAL DATASET the number of leaves associated to this class (see Fig. 1(a) and logical expressions reported in figure caption). While it is not possible to reduce this number without loosing the equivalence between the decision tree and the threshold network, it can be possible to simplify the conjunctions themselves. Any node that has a leaf of a given class as children can be removed from the other conjunctions of the same class. For example, node c in Fig. 1(a) has a leaf of class 1. The conjunction associated to this leaf is. It is easy to check that node c can be removed from the other conjunction. To evaluate the performance of bagging decision trees specified as threshold network ensembles, we performed discrimination experiments on the four datasets summarized in Table I. Research works suggested that ensembles with ten members are adequate to improve the classification performance on these datasets [3], [4]. Thus, ensembles of ten decision trees were created by using bagging and transformed to ensembles of ten threshold networks. CART [7], C4.5 [8] and OC1 [9] are perhaps the most popular tree building algorithms. Here OC1 was used because it seems to perform better than the others (smaller and more accurate decision trees) [9]. OC1 is a randomized algorithm that builds oblique decision trees by simulated annealing. The OC1 program available at fttp://ftp.cs.jhu.edu/pub/oc1 was modified to incorporate bagging. Moreover, an additional program was written in C in order to transform each decision tree into an equivalent threshold network by extracting automatically one optimized logical function per class from the tree structure. Because the datasets we used were small, generalization performance were estimated by a leave-one-out procedure. Table II reports the leave-one-out performance of bagging decision trees implemented as threshold network ensembles in comparison to the

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1099 one of single threshold networks and support vector machines (SVMs) [5], [6]. Our SVM program used for the comparisons was written in C and uses a quadratic programming method originating from [18], implemented in [19], and available at http://www.isr.umd.edu/labs/cacse/fsqp/qld.c. For these datasets, threshold network ensembles were always more accurate than single threshold networks, which agrees with previous findings [3], [4]. Moreover, they outperformed SVMs in two datasets over four. C. What Is the Required Precision? Threshold networks require only TLUs and combinatorial logic and are very suitable for a VLSI implementation [20]. The threshold function is indeed easy to implement in digital and this results in significant silicon area saving as compared to sigmoidal or radial basis functions used in multilayer perceptrons or RBF networks and implemented through area consuming lookup tables. This simplification results in very compact arithmetic units, and makes the prospect of building up VLSI chips implementing bagging threshold networks particularly promising for real-time decisions. However, when implementing bagging threshold networks with hardware of limited precision, the sensitivity to weight perturbation may result in performance degradation. It is, therefore, very important to study carefully the precision requirements for the classification problem at hand. Weights and inputs are then coded with the minimum required precision without affecting too much the classification performance. We have evaluated the effect of weight precision on the performance of bagging decision trees for the datasets used above. After training, the weight vectors of each individual threshold network were normalized and uniformly quantized with bits of precision by assigning uniform intervals over the range. Table III reports the performance of threshold network ensembles with weights quantized with 16, 8, and 4 bits. In order to maintain acceptable performance, 16 bits of precision are sufficient for all the datasets. However, the required precision depends on the problem at hand (16 bits for hepatitis against eight bits only for the other datasets). The use of a wordlength larger than the required precision (for example 16 bits for ionosphere ) results in an inefficient usage of the hardware resources (slower processing and higher power consumption). As a consequence, we have chosen to implement threshold network ensembles in hardware with multiprecision. This is greatly beneficial particularly for exploiting the VLSI chip for various problems with different precision requirements. It also permits to exploit efficiently the hardware resources available and, hence, facilitate the implementation of reasonable size bagging ensembles (see Section III). III. HARDWARE CONSIDERATIONS An important issue when implementing reasonable size network ensembles for portable pattern recognition applications is to provide high level of compactness together with low power operation. To meet these challenges, we chose to implement our hardware using a low cost 3-D packaging technology referred to as MCM-V [21]. It was shown in the literature that 3-D packaging technology enhances most aspects of electronic systems TABLE III LEAVE-ONE-OUT ACCURACY (IN %) OF ENSEMBLES OF TEN THRESHOLD NETWORKS WITH RESPECT TO THE WEIGHT PRECISION (IN NUMBER OF BITS) such as size, weight, speed, and yield and reduces power consumption by as much as 30% [22]. In addition, 3-D technologies offer interesting aspects and options for solving the problem of neural network connectivity [23]. Our VLSI design is divided into three different parts: design fabrication and test of 1) the single chip architecture; 2) the multichip-module; and 3) the 3-D packaged system. A. Basic Chip Architecture and Circuit Description The basic building block chip is based on a two-dimensional (2-D) systolic array architecture. This array consists of 4 4 processing elements (PEs) as shown in Fig. 2(a). The array could be configured, using the control signal (cne), to perform either a weighted sum or a TLU unit (Out). Each processor PE includes a local configurable memory to store either one 16-bit weight, two 8-bit weights, or four 4-bit weights. The 16-bit Xi bus is used to feed the inputs serially from the least significant bit to the most significant one with an arbitrary user defined precision. The buses Si and are systolic input output buses used to interface between basic VLSI chips when higher number of PEs is needed. When a processor receives an input it computes the product of the locally stored weight by. A compact low power multiprecision serial parallel multiplier [24] has been used to perform the multiplication within each processor. The output of the multiplier is then added to the partial sum received from the processor located on the left and transmitted to the processor located on the right. Input data are propagated vertically through flip-flops and, therefore, the computation that takes place in row i of the systolic array is repeated at row one clock cycle later. All results are collected on the right side of the array. Wider systolic array (more inputs) can be realized by bypassing the activation function and connecting the partial weighted sum across different chips through flip-flops. This is realized using the output multiplexer controlled by the control bit (cne). The outputs may also be configured using the internal control bit (cne) to perform the threshold activation function and, hence, to realize a TLU unit. A ten-bit internal control register is used to configure the 4 4 array of PEs in terms of threshold network topology (TLU, matrix operation), and weights precision. The activation function is simply realized by detecting the sign bit of the weighted sum which corresponds to the last generated bit from the serial parallel processor (most significant bit in two complement). A sampling circuit is also used in order to detect the sign bit of each TLU and to multiplex the different TLU outputs in time so that only one physical pin is used for the out signal of the entire array. This time multiplexing scheme does not affect the overall speed performance of the system since a systolic architecture is used and data are processed in a pipelined way. For example

1100 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 Fig. 2. (a) Internal architecture of a basic VLSI chip. (b) PE building block diagram. For NO=Op = 1; S =S (Op), while for NO=Op = 0; s =s (NO). FF stands for a flip-flop. the sign bit for row1 is obtained one clock cycle earlier than the one of row2 and, therefore, only one physical pin is required to sample the two rows data. This has allowed us to reduce the physical number of Pins and facilitate the interfacing of several chips within the MCM and the 3-D chip. Other methods, such as bus sharing technique, were also used in order to reduce further the number of physical PINs and buses within the architecture. Since the system presents different modes of operation, a single bus was assigned different tasks over the different modes of operation. Fig. 2(b) shows the internal block diagram of each PE within the systolic array. It can be noticed from this figure that the 4-bit bus is used to load the 16-bit internal weight register and also to provide the partial sum inputs to either the configurable arithmetic unit or to the 2-inputs/1-output multiplexer, depending on the value of the control bit (NO/OP) stored in the internal flip-flop. When the bus is directly provided to the neighboring PE located on the right and, hence, no processing is performed within the PE; the processor is in a nonoperational (NO) mode. This mode is used in order to obtain three interesting features for the systolic array. Possibility of exhaustive test of individual PE within the array. This is achieved by programming the PE to be tested as OP and all other PEs as NO. Improved fault tolerance system. With the NO mode, it is possible to isolate a defective PE within the systolic array and/or a defective chip within the MCM or the 3-D chip. Efficient loading of the weights using a single bus. Using the NO mode, it is possible to bypass already loaded PEs and provide the bus to unloaded PEs. This reduces the number of I O pads of the VLSI chip since all processors are loaded using a single bus. Moreover, the technique is also used to load the data into cascaded chips using a single bus. The weights loading technique is explained in more details in the next section. B. Weights Loading Technique Fig. 3 describes the timing sequence during the loading phase of the chip. To begin the loading process, the signal lw of Fig. 3. Technique used for loading weights. L corresponds to the loading mode (Internal weight register! s );Rcorresponds to the reset mode (S = 0) and NO corresponds to the not operation mode (S = S ). D stands for a delay block. Fig. 3 must be held high for at least half a cycle. When lw is high, loading process begins by presenting to the bus, the weights of the four PEs forming the left column of the systolic architecture. The lw signal is held by a module for four cycles before being passed to the four neighboring PEs located on the right side. At each clock cycle, four bits are stored in each processor. Since each processor contains a 16-bit weight, the loading process of each column of four processors takes four clock cycles. While holding the lw signal, a PE loads data into its internal register. The PE is then automatically programmed as an NO processor. In this case, the configurable arithmetic unit of the processor is bypassed and the bus is automatically allocated to the adjacent PE located on the right side. The weights within the basic VLSI chips are, therefore, loaded using only one 16-bit bus. Loading proceeds in a similar manner for all columns of the processing element as the lw signal is transmitted to the right as illustrated in Fig. 3. The same bus can be

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1101 used to load the weights of cascaded chips. This is achieved by connecting the signals and of a basic chip to and of the chip. This proposed technique of loading the weights obviates the cumbersome need for a high number of I O in the MCM and the 3-D VLSI circuit. C. Multiprecision Processing In order to implement a multiprecision processing, the 16-bit arithmetic unit was built-up as four four-bit processors wired together using a set of multiplexers. Fig. 4 shows the architecture of the configurable arithmetic unit, in which each row consists of a single four-bit processor. Eight multiplexers within each PE are used to change the hardware connections between two adjacent rows of cells in order to obtain a weight precision of 4-, 8-, or 16-bit. For example, if then the combined weights of each of the four-bit processors are considered as a single 16-bit weight and only the buses sin4 and sout4 of Fig. 3 are enabled. PE1 would process the most significant four-bit of the weight while PE4 would process the least significant ones. If and, then the precision would be set to eight-bit. The remaining control bits such as c3, c4, and c are used in order to configure the number of inputs and outputs of the multiprecision processor. For example, it is possible to have either a precision of eight-bit with two inputs and one TLU or one input and two TLUs. D. Reconfigurability Reconfigurability is defined as the ability of the hardware to be modified in order to fit the topology of the threshold network implemented [25]. The reconfigurability is an issue that needs to be addressed in order to make the relatively expensive hardware solution a general problem solving system. A reconfigurable topology using the architecture is described in Fig. 5. The hardware can be configured to operate at three different configuration of weight precision. The input precision is arbitrary selected by the user and, hence, Xi can take any word-length. As it can be seen from Fig. 5, depending on the selected precision different topologies of threshold networks can be configured. stands for a network topology with a bits of precision, inputs, and TLUs. For a four-bit precision three configurations are possible namely: and. For an eight-bit precision two configurations are possible: and only one configuration is possible for a 16-bit precision:. The available resources of the circuit are a tradeoff between the three parameters ( and ). The configuration with the lowest precision allows to increase the number of inputs or TLUs according to, where is a constant term which depends on the number of chip interfaced. For example, for a single chip while for four cascaded chips. Larger networks (more inputs per TLU) are obtained by bypassing the activation function and connecting the partial weighted sum across different chips. This is achieved using the output multiplexer controlled by the bit cne. Fig. 4. Internal schematic of the configurable multiprecision arithmetic unit. 4bitMul denotes a four-bit serial parallel multiplier in which the weights are stored in parallel while the input are fed serially from the LSB to the MSB. Mux1I2O is a one-input two-outputs multiplexer, while Mux2I1O is a two-inputs one-output multiplexer. IV. VLSI IMPLEMENTATION, MCM DESIGN, AND 3-D PACKAGING A. Modularity and System Expansibility Before explaining the technological process involved in the design of the basic chip, the MCM, and the 3-D package, it is important to show how different chips would be integrated in order to build a more powerful system. This is referred to as modularity and system expansibility and has been widely studied in the literature for neural-network hardware [26]. The circuit as shown in Fig. 2 can be expanded horizontally in order to realize a threshold network with more inputs as well as vertically in order to increase the number of TLUs. Moreover the circuit can be expanded in terms of number of classifiers per network ensemble. Fig. 6 shows the interchip connectivity with an example of four basic chips. The figure illustrates the modularity and easy expansibility of the system without requiring extra interfacing circuitry. It should be noticed that special attention was paid to the design of the output buffers particularly for the global and systolic control buses. The data communicated from one chip to another are done in a systolic manner through flip-flops. This is done in order to avoid the accumulation of delays when chips are pipelined together. B. Single Chip Implementation We have described in the Section III the internal circuitry of the hardware architecture. The chip implements the recall operation of threshold networks as described in Section II and depicted in Fig. 1(c). It includes weight storage memory, switches for topology reconfiguration of TLU, together with a local memory which stores the topology of the threshold network (number of TLUs and inputs) and its computational precision. Before implementing the recall operation of any classification problem, the systolic architecture should be first configured in order to realize the required topology and precision. Both the topology and the precision are determined by the content of a 10-bit register. A 16-bit register is also used in each basic chip in order to store the NO/OP configuration

1102 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 Fig. 5. The different topologies of threshold networks implemented by the circuit shown in Fig. 3 depending on the selected precision. P p 2 q stands for a network topology with a b bits of precision, p inputs, and q TLUs. For a four bits precision three configurations are possible namely: P 16 2 4; P 8 2 8 and P 4 2 16. For eight bits precision two configurations are possible: P 4 2 8;P 8 2 4 and only one configuration is possible for a 16 bit precision: P 4 2 4. Fig. 6. Interchip connectivity illustrating the modularity and expansibility of the system. A setup of a given topology requires the loading of the control word into the internal control register. This is done through the bus Xi of Fig. 2, during the loading phase of the weights. Only two clock cycles are needed to load the control register. Each time the topology or the precision has to be changed, the content of the 26-bit register needs to be loaded into the chip accordingly. The chip has been fabricated using standard cell 0.7 m CMOS technology. Fig. 7(a) shows a microphotograph of the fabricated chip with a silicon area of 13.70 mm. The functionality of the packaged basic chips was fully tested prior to wire bounding the dies on the MCM substrates. Test results will be presented in Section V. C. MCM and 3-D Packaging (a) (b) Fig. 7. (a) Microphotograph of the VLSI chip. (b) Photograph of the MCM including four VLSI chips and occupying an area of 2 2 2cm. required for the 4 4 systolic array. The NO/OP register would allow to configure the hardware to a smaller topology. An NO mode also permits to force the processor into a stand-by mode which results in reduced power consumption in the case where smaller network topologies are needed. The main objective behind the development of the MCM and the 3-D chip is to develop a powerful compact, low power pattern recognition system using bagging threshold network ensembles. After designing the basic chip and successfully testing it, four dies were mounted on a single MCM. Each MCM will therefore implement a fully configurable systolic array constituted of 16 16 16-bit PEs. Fig. 7(b) shows a photograph of the MCM. It includes four VLSI chips mounted onto a flexible laminated substrate (film) and the dies are wire bonded to wiring pads on the substrate. The MCMs are designed such that the test pads are provided along the back-side of the film. The 3-D packaging technology referred to as MCM-V [21] was used to realize the 3-D chip. After test of the MCMs, the selected ones are stacked, one above the other and encapsulated in epoxy resin. The epoxy is then sawn together with the edge wires of the MCM. The block is then plated with layers of Cu/Ni/Au using standard electroplating techniques. A YAG laser is then used to

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1103 Fig. 8. (a) Photograph of the final 3-D chip occupying a volume of (w2 L2h)=(22220:7) cm. (b) Block diagram of the 3-D chip including four levels of MCM. The three top levels corresponds to the one represented on Fig. 8(a), while the bottom one is dedicated to the report of the PINs to a standard PGA package. pattern the surface so that vertical wire tracks are formed on the cube [21]. Interconnections between the layers are realized on the sides of the module. External signals and power supply of the 3-D chip are routed around the bottom side for interconnection to a standard package. This is realized by dedicating the first layer to a custom-made MCM that is laser soldered to the module. Each step in the fabrication of the 3-D module uses a standard and well-characterized technological process. As a consequence, the MCM-V technology is relatively low cost [21], [22]. Fig. 8(a) shows a photograph of the final 3-D chip and Fig. 8(b) shows its internal block diagram. As is it can be seen from Fig. 8(b), the module includes four substrates layers with four chips on each of the three top levels (12 chips in total). The Bottom substrate is fully dedicated to the report of the vertical connections to an external PGA package. The size of the final module is cm. This represents at least 50% of the volume of an advanced PCB implementation. V. RESULTS AND CHIP PERFORMANCE Several test modes were implemented in order to facilitate the test and the debugging of the chips. The test procedure included the functional test of the single chip, the MCMs and the 3-D chip. A fault characterization study was carried-out prior to mounting the 3-D chip. After extensive functional tests, an experimental setup for selective gas detection and electronic nose application was developed which provided a test-bed for the 3-D chip operating as a bagging threshold network ensembles classifier. A. Experimental Setup and Functional Test of the Single Chip In order to verify the correct operation of the basic chip, two approaches were adopted. Verilog simulator was used in order to generate test vectors from the schematic of the circuit. The test vectors were then inserted into the test program of Tektronics LV500 digital tester. This has allowed us, not only to verify the correct operation of the chip for different topologies and precisions, but also to fully extract its performance. A PCB board was designed connecting a basic chip to the parallel port of a PC. The board allows software control of the weights and the reconfiguration of the threshold network topology. TABLE IV EXPERIMENTAL RESULTS OF THE PROCESSING TIME FOR THE DIFFERENT TOPOLOGIES AND PRECISIONS DESCRIBED IN FIG. 5. THE RESULTS WERE OBTAINED BY TESTING A LARGE NUMBER OF RECALL CYCLES AND CONSIDERING THE WORST CASE RESULT. THE NUMBER OF CLOCK CYCLES REPORTED CORRESPOND TO THE FULL CLASSIFICATION OF ONE INPUT PATTERN AND SHARED PIPELINING CYCLES OF CONSECUTIVE INPUTS ARE NOT SUBTRACTED Both tests confirmed that the chip was fully operational at 20 MHz for all configurations of precision and TLU topology with an average power consumption of 16 mw/mhz. The chip presents a loading time of less than 1 s. This value corresponds to the time required to load the synaptic words (64 words of eight bits) and the control sequences. Table IV summarizes the performance of the chip as function of the different topologies reported in Fig. 5. The number of clock cycles reported excludes the ones needed for the loading phase which is done only once. The inputs were coded with the same number of bits as the synaptic weights. A topology would assume a bits for its inputs and bits for its synaptic weights. The maximum frequency was obtained by considering a large number of different recall cycles and then taking the worst case performance. The processing times for a single recall operation (without including the loading time and without subtracting the shared pipelining cycles of consecutive input patterns) varies from 550 ns to 1400 ns depending on the selected topology and precision. We can note from Table IV that for a topology, the number of clock cycles required (without subtracting the shared pipelining cycles) is equal to. The number of clock cycles is, therefore, independent of the number of inputs. We can also note that the maximum frequency is decreased with increased number of inputs. This is explained by the fact that input data are pipelined vertically through D flip-flops and, hence, a topology with higher number of TLUs would require more clock cycles but without affecting the delay of a basic operation. Partial sums are, however, communicated directly to adjacent processors and, hence, no additional clock cycles are required for

1104 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 a topology with higher number of inputs at the expense of increased delay of a basic operation which results in a degradation of the maximum clock frequency. It should be noticed that a full pipelining of the operations would result in an increased performance in terms of the maximum clock frequency at the expense of increased number of clock cycles and silicon area required for the pipelining blocks (flip-flops). B. Functional Test of the MCM and Fault Characterization A total of 50 basic chips were manufactured (44 dies and six packaged chips). After verifying the full functionality of the packaged chips, the remaining 44 dies were mounted on 11 MCMs each containing four chips. Special care was paid to the design of the MCM in order to make the test and debugging of each chip within the MCM possible. This was achieved by designing special back-side contact on the MCM such that they would fit on a test socket. Access to interchip connection was also obtained using back-side MCM contacts designed to report the test contacts of the MCM into a standard DIL package. Similar procedure to the one used for the test of a single chip was then employed in order to test the MCMs. A schematic view of the MCM was designed and simulated using Verilog simulator. Test vectors were then generated and automatically inserted into the test program of the Tektronics LV500 digital tester. For each MCM each single chip was separately addressed and tested. Several tests were conducted. Test 1 Wire bonding and MCM level connectivity test: A preliminary microscopic check of wire bonding was conducted before applying test vectors designed to target the connectivity test of the MCM routing signals. Test 2 Propagation of the control signals within the MCM: This test was designed to check the proper propagation of systolic control signals from one chip to another. Test 3 Synaptic weight loading test: Test vectors were applied to check the correct loading of the synaptic weights and the internal control registers. Test 4 NO/OP programming test: Test vectors were applied in order to check the successful programming of the NO/OP feature explained in Section III-A. Test 5 Functional test: In this test, Verilog simulator was used in order to generate test vectors covering functional tests of individual chips, group of two chips, and four chips as well as the test of the different configurations. Fig. 9 summarizes the test results for the 11 MCMs (numbered MCM1 to MCM11). From the 11 MCMs, six were found to be fully operational (54%). A faulty wire bonding was detected for MCM 1 and 2. MCM 3 passed the connectivity test but failed all the remaining tests. MCM4 and 5 successfully passed all test except the last functional test. Even though MCM 4 and 5 do not operate properly, their malfunction does not affect the correct operation of neighboring MCMs if they were mounted within the 3-D package. This is made possible using the NO mode, hence improving considerably the fault tolerance of the final system. Indeed, two MCMs out of the five faulty ones (40%) do not present catastrophic faults, thanks to the NO feature of the chip. Fig. 9. Fault characterization of the 11 MCMs with respect to the five tests described. P and F stands for pass and fail results of the test, respectively. Dark shades represent samples with catastrophic fault. Light shades represent nonoperational samples, however, defective MCMs can be isolated using the NO mode. No shades are samples with no fault. TABLE V EXPERIMENTAL RESULTS OF THE PROCESSING TIME FOR DIFFERENT TOPOLOGIES AND PRECISIONS. THE RESULTS WERE OBTAINED BY TESTING A LARGE NUMBER OF RECALL CYCLES AND CONSIDERING THE WORST-CASE RESULT C. Functional Test of the 3-D Chip The six operational MCMs were used to build two 3-D chips (three levels of four chips each). The same procedure was repeated for the test of the 3-D chip. The test procedure of the MCMs and the 3-D chip was greatly facilitated by using the NO mode so that PEs within the VLSI chips, VLSI chips within the MCM, and MCMs within the 3-D chip are individually addressed and tested. The 3-D chip has been successfully tested for all configurations of precision and TLU topology. Table V summarizes the performance of the 3-D chip as function of some of the possible topologies and sizes of the network ensembles. The test configurations were set similarly to the one reported

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1105 Fig. 11. Experimental setup (left) and time response of the sensor array to a transient concentration of ethanol (right). The time indicated by the arrow is the time when the concentration step is applied. The time indicated by the vertical dashed line is the measurement time corresponding to the steady-state sensor response. Fig. 10. Experimentally measured sequence for the loading acknowledgment signal Lw and the output of the different TLUs. Ch1, Ch2, Ch3, Ch4, and Ch5 representing the clock signal, the acknowledgment loading signal, the TLU s outputs from the first second and third chip respectively. The TLU s output are sequentially fed-out at each rising edge of the clock. for the test of a single chip. The processing time for a single recall operation (without including the loading time and excluding the shared pipelining cycles) varies from 611 ns to 2111 ns depending on the size of the ensemble, the selected topology of the threshold network and precision. We can note from Table V that the hardware resources of the 3-D chip are a tradeoff between the size of the ensemble (first column of the table) and the topology of the threshold network. The hardware resources available is deduced according to:. A complete recall operation is obtained in less than 1.2 s for any topology with 4-bit or 8-bit precision while it is obtained in less than 2.2 s for any 16-bit precision. The 3-D chip presents a loading time of 10.8 s. This value corresponds to the time required to load the synaptic words (768 words of eight bits) and the control sequences. It should be noticed that, for a 4-bit precision, none of the reported topologies requires pipelining chips together as evidenced by the third column of the table. This is explained by the fact that for a 4-bit precision, each basic chip is able to cope with 16 inputs and, therefore, more TLUs are just obtained by using more basic chips without communicating partial sums between chips. This is not the case for an8-bit and a 16-bit precision, where each basic chip can only handle eight inputs for an 8-bit precision or four inputs for a 16-bit precision. For example, the configuration requires pipelining two basic chips and the partial sums are passed from one chip to another through flip-flops. This results in additional clock cycles, where is the number of pipelined chips. We can also note from Table V that the number of clock cycles required for the network is exactly the same as the one reported in Table IV for a single chip implementing while the frequency is slightly better for the later case. This is explained by the fact that the network is obtained by operating 12 chips in parallel. Each chip realizes one element of the ensemble. This results in exactly the same number of clock cycles required while the frequency is slightly reduced for the network due to additional wiring delay of the output pad to the external PGA Pin of the 3-D package. It can also be noticed from Table V that the performance of configurations, and are exactly similar. This is simply because each of the previous configurations are obtained by just reorganizing the number of individual classifiers and the number of TLUs per classifier. Fig. 10 shows the experimental output from the chip, which corresponds to a full recall cycle for the topology reported in Table V. A 16-bit 4 4 weight matrix was loaded into each chip within the 3-D prototype. A test vector was also fed serially to the chip from the least significant bit to the most significant one. The values of both and were chosen so that the output of two adjacent rows of the systolic array would have opposite sign and hence the TLU s outputs (Out signal) would oscillate at. Fig. 10 shows the output waveforms, with Ch1, Ch2, Ch3, Ch4, and Ch5 representing the clock signal, the acknowledgment loading signal, the TLUs outputs from the first, second, and third chip, respectively. A chip within each MCM has been selected for the purpose of this test. The first TLUs output is obtained after eight clock cycles of the load acknowledgment signal. The 192 TLU outputs are obtained in only 23 clock cycles using only 12 physical pins. Each output would generate sequentially the results corresponding to 16 TLUs as shown in Fig. 10. This corresponds to a very high level of parallelism realized with very limited physical outputs. D. Test of the 3-D Chip as an Electronic Nose It is well known that the gas sensors commercially available present a lack of selectivity and, thus, respond to a wide variety of odors. In this section, we report the performance of our 3-D chip combined with a gas sensor array in order to act as an electronic nose. An automated gas delivery experimental setup was developed for extracting volatile compounds at given concentrations from liquids (Fig. 11). It consists of two pumps, two mass flow controllers (MFCs), one bubbler, a gas chamber, and a data acquisition system. Ethanol or butanol vapors were injected into the gas chamber at a flow rate determined by the mass flow controllers. Knowing these flow rates and the saturated vapor pressure at

1106 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 (a) (b) Fig. 12. Experimentally measured (a) training and (b) test performance of the odor dataset for an ensemble of ten 8-bit precision threshold networks (full line), ten 4-bit threshold networks (dashed line), and 20 4-bit threshold networks (dotted line). At 20 MHz, the circuit achieves a classification performance of 28 M samples/s, 11 M samples/s, and 4 M samples/s for four-, eight-, and 16-bit weight precision, respectively, and 11-bit input precision. room temperature, the concentration of the injected gas was calculated. We used 16 different concentrations for ethanol ranging from 1360 to 5165 ppm and 15 different concentrations for butanol ranging from 870 to 3050 ppm. We used sensor arrays composed of five commercial TGS Figaro gas sensors (TGS 2600, 2602, 2610, 2611, and 2620). The potential differences across the sensor resistances were measured using a voltage divider with 2.2 k load resistors while keeping the heating voltage constant to 5 V. The sensors output voltages were sampled at a rate of 10 Hz and quantized with an 11 bit analog to digital converter. A typical plot of these voltages versus time for a transient concentration of ethanol is also shown in Fig. 11 at the right. It shows that the sensor array system reacts slowly to a transient gas concentration and takes some minutes to reach the stationary state. This time is a combination of the time needed to fill the chamber and the time needed for the sensors to respond. The steady-state value was recorded for each concentration of the two gases. Besides being nonselective, gas sensors present also long-term drifts and can be poisoned by a particular gas or an excessive concentration so that they have to be replaced from time to time by new ones. However, this is not easy because gas sensors of the same type do not have exactly the same characteristics due to a bad control of the manufacturing process. To allow for such a sensor replacement within the array, the pattern recognition system has to be robust to some dispersions in the characteristic of the sensors. In order to accomplish this, we have recorded the steady-state outputs of four sensor arrays, each one composed of the five TGS Figaro gas sensors (same type as those mentioned above) for the 31 different concentrations of ethanol and butanol. This yielded a total of 124 patterns. This odor dataset 1 was used in Section II for estimating the performance of threshold network ensemble trained with bagging. However, estimating the performance of the 3-D chip by leaving-one-out is difficult, as it requires testing 124 different threshold network ensembles. Instead, we decided to randomly split the odor dataset into a training set (100 patterns) and a test set (24 patterns). 1 Available at: http://www.loria.fr/~dmartine/odor.txt. To implement the recall and the training test, the user first latches the synaptic weights and the control bits onto the Xi and buses of the 3-D chip, while presenting a load-in command. This will enable the chip to systolically load the data into the different cascaded chips. The load-in signal will propagate from PE to PE and from chip to chip and, thus, allowing each chip to load the weights and the control bits configurations using two input buses. The acknowledgment signal is received from the 3-D chip once it has successfully completed the data loading. The 11-bit input data obtained from the data acquisition system were then fed into the 3-D chip. The test and the training performance were experimentally measured on the 3-D chip and compared with the performance obtained by simulation for four-bit eight-bit of precisions and for ensembles of ten and 20 threshold networks. Fig. 12(a) and (b) shows the results obtained for both training and test, respectively. A 98% accuracy was obtained for the training set in the case of an ensemble of ten threshold networks with an 8-bit weight precision [solid line of Fig. 12(a)] while the performance dropped to 82% for a four-bit weight precision (dashed line). Using an ensemble of 20 threshold networks for the four-bit precision improves the training accuracy by 8% resulting in a 90% accuracy (dotted line). A test performance of 96% was obtained in the case of an ensemble of ten threshold networks with 8-bit weight precision [solid line of Fig. 12(b)] while the performance dropped to 84% for a 4-bit weight precision (dashed line). Using an ensemble of 20 threshold networks for the 4-bit precision improves the performance by 3% resulting in an 87% accuracy (dotted line). The performance measurements obtained for both the training and the test match these obtained by simulation. However, the test and the training performance dropped for frequencies higher than 20 MHz. This was expected by the functional tests of Section V-C as reported maximum frequencies were found to be around 20 MHz. It should be noticed, however, that the peak classification per second performance achievable at a relatively low frequency of 20 Mhz are very high. Indeed, a 28 M samples/s, 11 M samples/s, and 4-M samples/s are achieved for 4-, 8-, and 16-bit weights precision, respectively, with an input precision

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1107 (a) (b) Fig. 13. (a) Experimentally measured classification time as function of the weights and input precisions for odor dataset. (b) Experimentally measured power consumption as function of the classification time for different weight precision. The input data precision is 11-bit. TABLE VI PERFORMANCE COMPARISON OF OUR DESIGN WITH SOME NEURAL DIGITAL CIRCUITS REPORTED IN THE LITERATURE [27], [31] [38]. IN THE TABLE, CPPS IS THE NUMBER OF CONNECTIONS PRIMITIVES PER SECOND,ENERGY/PC STANDS FOR THE ENERGY PER PRIMITIVE CONNECTION (ENERGY NORMALIZED WITH RESPECT TO BOTH NUMBER OF CONNECTIONS AND NUMBER OF BITS PER CONNECTION), CFG STANDS FOR CONFIGURABLE AND NA STANDS FOR NOT AVAILABLE of 11-bits, thanks to the very high level of parallelism obtained in the 3-D chip (12 chips operating in parallel) and the relatively short number of clock cycles needed for a classification due to the pipelining properties of the systolic array. Fig. 13(a) shows the classification time for the input patterns of the odor dataset. The dotted curve correspond to the measured data from the chip corresponding to an 11-bit input accuracy while the figures reported for 4-, 8-, and 16-bit input precision are deduced by experimental measurement of the maximum frequency and analytically deriving the classification time. We can note from Fig. 13(a) that the classification time is 0.7 s for a 4-bit and 11-bit weight and the input precisions respectively. This was obtained by pipelining the input patterns in our systolic architecture implementing 20 threshold networks using the topology (768 connections). This leads to a peak performance of 48 GCPPS. The normalized power per classification time and as function of the weight precision is reported in Fig. 13(b). It is clearly shown from this figure that the circuit can operate at very low power for four-bit precision (0.13 mw/1 K classification per s) and eight-bit precision (0.33 mw/1 K classification per second). The power consumption for 16-bit precision is around 0.85 mw/1 K which is comparable to the one reported for VindAx neural network processor [27]. Table VI reports further the performance comparison between our circuit and most well-known digital neural networks circuits reported in the literature. It should be noticed, however, that the comparison of different neural-network hardware is often difficult and can be very tricky as confirmed by several authors [28], [29]. In Table VI, we therefore describe briefly some selected digital neural-network circuits and report their performance using normalized figures of merits such as connections

1108 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 primitives per second (CPPS) [30] and energy per primitive connection [28]. Even though our main objective is not to build a neural-network accelerator, Table VI shows that the 3-D chip has a computational speed which is in the order of 50% of these reported for advanced neural-network accelerators. The power consumption is however very competitive with VindAx, a very advanced digital neural circuit recently reported [27]. It can also be seen from Table VI that the 3-D chip proposed in this paper presents the advantage of being reconfigurable in terms of both precision and topology offering the possibilities of increasing the computational power at lower level of precisions. Lower level of precision also allows for a low power operation [see Fig. 13(b)] as nonrequired resources are kept on a switched off mode. VI. DISCUSSION In this paper, we have reported a 3-D circuit implementation of bagging ensembles as well as its experimental test results. In our circuit, individual classifiers within the ensemble are decision trees specified as threshold networks having a layer of TLUs followed by combinatorial logic elements. We have shown that such bagging threshold network ensembles are more accurate than single classifiers trained on the original dataset and that the required weight precision depends on the application at hand. The proposed architecture supports a variable precision computation (4/8/16-bit) in which it is possible for low-precision applications to either reduce the power consumption or increase the ensemble size to improve the classification performance. Inefficient usage of the hardware resources is avoided since both the weights and input precisions are user-defined for the application at hand. The proposed architecture also supports a configurable network structure (number of networks per ensemble, number of TLUs and inputs per network). In addition, the design includes other novel features such as its loading technique of the weights and its modular expansibility. Statistical tests performed on the manufactured chips showed that the proposed NO/OP feature implemented in our circuit improves the fault tolerance of the system by as much as 40%. In order to meet the high computational requirement of threshold network ensembles, we developed a compact 3-D circuit which includes four layers of MCM integrating 16-bit 192 PEs with 768 digital synapses and up to 192 TLUs implemented in a module size of cm. This represents at least 50% the size of a very advanced PCB implementation including the same number of chips. The power consumption of the 3-D chip depends on the selected precision and the required classification speed. The circuit consumes 0.13, 0.33, and 0.85 mw at 1 K classification per second for 4-, 8-, and 16-bit, respectively. The very high level of compactness together with the relatively low-power operation of the 3-D chip make it a very suitable candidate for portable and compact pattern recognition systems. Successful operation of the 3-D chip for various precisions and ensemble sizes was first demonstrated through extensive functional tests. Operation of the 3-D chip as a compact pattern recognition hardware was also demonstrated through an electronic nose application. Experimental results suggests a peak classification performance of 28, 11, and 4 M samples/s for 4-bit, 8-bit, and 16-bit, respectively. A major issue in this work concerns on-chip learning. On chip-learning could have been implemented by means of additional circuitry, reducing, however, the space available on the chip for the threshold network ensemble. Whether on-chip learning is necessary or not depends on the application at hand. On-chip learning is attractive when continuous unsupervised training is needed or when the training time may require days of computing. This is certainly not the case for bagging threshold networks that only need to be trained once and that are very fast to build. As an example, it takes an average of 4.6 s on a PC Pentium III running at 1 GHz to obtain a bagging ensemble of ten threshold networks trained on the odor dataset. This time includes the generation of the sampled datafiles, the training of the decision trees and the extraction of the optimized logical functions needed for transforming each decision tree into an equivalent threshold network. The complete leave-one-out procedure on the odor dataset takes only 12 min. Another related issue that needs consideration concerns the limited precision of the hardware. It might be possible to take care of the chosen weight precision for threshold network ensembles by using a boosting procedure. Boosting is similar to bagging except that examples incorrectly classified by previous classifiers are chosen more often in the sampled training set than examples that were correctly classified (for details refer to http://www.boosting.org). If earlier classifiers are evaluated with quantized weights, boosting will attempt to focus the new classifier on the classification errors whether they come from the chosen weight precision or not. Work is ongoing to test this idea. ACKNOWLEDGMENT This work was initiated while the authors were at LAAS-CNRS Toulouse. The authors would like to thank D. Hoeung for his help on extracting optimized logical functions from the tree structure, T. Doconto for wire bonding the MCMs and 3-D plus Electronics, and C. Val for manufacturing the 3-D block. REFERENCES [1] T. G. Dietterich, Machine learning research: Four current directions, AI Mag., pp. 97 136, 1997. [2] L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123 140, 1996. [3] D. Opitz and R. Maclin, Popular ensemble methods: An empirical study, J. Artificial Intell. Res., vol. 11, pp. 169 198, 1999. [4] R. Maclin and D. Opitz, An empirical evaluation of bagging and boosting, in Proc. AAAI, 1997. [5] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery, vol. 2, no. 2, 1998. [6] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth Int. Group, 1984. [8] J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, pp. 81 106, 1986. [9] S. K. Murthy, S. Kasif, and S. Salzberg, A system for induction of oblique decision trees, J. Artificial Intell. Res. 2, pp. 1 33, 1994. [10] D. Martinez and D. Estéve, The offset algorithm: Building and learning method for multilayer neural networks, Europhys. Lett, vol. 18, no. 2, pp. 95 100, 1992.

BERMAK AND MARTINEZ: A COMPACT 3-D VLSI CLASSIFIER 1109 [11] M. Biehl and M. Opper, Construction algorithm for the parity-machine, Phys. A, vol. 193, no. 3 4, pp. 307 313, 1993. [12] S. Knerr, L. Perconnaz, and G. Dreyfus, Handwritten digit recognition by neural networks with single-layer training, IEEE Trans. Neural Networks, vol. 6, pp. 962 968, Nov. 1992. [13] H. E. Ayestaran and R. W. Prager, The Logical Gates Growing Network, Cambridge Univ., Cambridge, U.K., Tech. Rep. CUED/F-IN- FENG/TR 137, 1993. [14] N. K. Bose and A. K. Garda, Neural network design using voronoi diagrams, IEEE Trans. Neural Networks, vol. 4, pp. 778 787, Sept. 1993. [15] E. Langheld and K. Goser, Generalized boolean operations for neural networks, in Proc. Int. Joint Conf. Neural Networks, 1990, pp. II 159 II 162. [16] I. K. Sethi, Entropy net: From decision trees to neural nets, Proc. IEEE, vol. 78, pp. 1605 1613, Oct. 1990. [17], Neural implementation of tree classifiers, IEEE Trans. Syst., Man, Cybern., vol. 25, pp. 1243 1249, Aug. 1995. [18] D. Goldfarb and A. Idnani, A numerically stable dual method for solving strictly convex quadratic programs, Math. Programming, vol. 27, pp. 1 33, 1983. [19] M. J. D. Powell, ZQPCVX A Fortran Subroutine for Convex Quadratic Programming, Univ. Cambridge, Cambridge, U.K., Tech. Rep. DAMTP/1983/NA17, 1983. [20] P. D. Moerland and E. Fiesler, Hardware-friendly learning algorithms for neural networks: And overview, in Proc. MicroNeuro 96, Lausanne, Switzerland, 1996, pp. 117 124. [21] S. P. Larcombe, P. A. Ivey, N. L. Seed, J. M. Stern, and C. M. Val, Electronic systems in dense three-dimensional packages, Electron. Lett., vol. 31, no. 10, pp. 786 788, June 1995. [22] S. F. Al Sarawi, D. Abbott, and P. D. Franzon, A review of 3-D packaging technology, IEEE Trans. Components, Packaging, Manuf. Technol. B, vol. 21, pp. 2 14, Jan. 1998. [23] K. Goser et al., VLSI technologies for artificial neural networks, IEEE Micro Mag., pp. 28 44, Sept. 1989. [24] A. Bermak, D. Martinez, and J. L. Noullet, High-density 16/8/4-bit configurable multiplier, Proc. Inst. Elect. Eng. Circuits Devices Systems, vol. 144, no. 5, pp. 272 276, Oct. 1997. [25] S. Satyanarayana, Y. P. Tsividis, and H. P. Graf, A reconfigurable VLSI neural network, IEEE J. Solid-State Circuits, vol. 27, pp. 67 81, Jan. 1992. [26] T. Serrano-Gotarredona and B. Linares-Barranco, A real-time clustering microchip neural engine, IEEE Trans. VLSI Syst., vol. 4, pp. 195 209, June 1996. [27] VindAX Processor Silicon, AXEON, Limited, Jan. 2003. Data Sheet Issue 5. [28] R. Schuffny, A. Graupner, and J. Schreiter, Hardware for neural networks, presented at the 4th Int. Workshop Neural Networks Applications, Magdeburg, Germany, Mar. 1999. [29] Overview of Neural Hardware, J. N. H. Heemsherk. (1995). [Online]. Available: ftp://ftp.mrc-apu.cam.ac.uk/pub/nn/mirre/neurhard.ps [30] E. Van Keulen, S. Colak, H. Withagen, and H. Hegt, Neural network hardware performance criteria, in Proc. IEEE Int. Conf. Neural Networks, June 1994, pp. 1885 1888. [31] MD 1220 Data Sheet, Micro Devices, 1990. [32] B. Friebe, S. Neusser, and B. Hofflinger, SIOP: Application-specific neural hardware, in Proc. MicroNeuro 97, Dresden, Germany, 1997, pp. 18 24. [33] W. Eppler, T. Fischer, and H. Gemmeke, Neural chip SAND/1 for real time pattern recognition, IEEE Trans. Nuclear Science, vol. 45, pp. 1819 1823, Oct. 1998. [34] N. Mauduit, M. Duranton, J. Gobert, and J. A. Sirat, Lneuro 1.0: A piece of hardware LEGO for building neural network systems, IEEE Trans. Neural Networks, vol. 3, pp. 414 422, May 1992. [35] M. Duranton, Lneuro 2.3: Image processing by neural networks, IEEE Micro Mag., vol. 16, pp. 12 19, Oct. 1996. [36] Adaptive Solutions: CNAPS product Information, 1995. [37] U. Ramacher, J. Beichter, and N. Bruls, Architecture of a general-purpose neural signal processor, in Proc. Int. Joint Conf. Neural Networks, vol. I, July 1991, pp. 443 446. [38] P. Ienne and M. A. Viredaz, GENES IV: A bit-serial processing element for a built-model neural-network accelerator, in Proc. Int. Conf. Application-Specific Array Processors, vol. I, Oct. 1993, pp. 345 356. Amine Bermak (M 99) received the M.Eng. and Ph.D. degrees in electronic engineering from Paul Sabatier University, Toulouse, France, in 1994 and 1998 respectively. He was part of the Microsystems and Microstructures Research Group at the French National Research Center LAAS-CNRS, where he developed a number of VLSI chips for artificial neural network classification and detection applications in a project funded by Motorola-Toulouse. He spent one year at the Advanced Computer Architecture Research Group, York University, York, U.K., where he worked on VLSI implementation of CMM neural-network for vision applications in a project funded by British Aerospace. In 1998, he joined Edith Cowan University, Perth, Australia, as a Research Fellow at the Visual Information Processing research Group, where he worked on the design of smart vision sensors with on-chip biologically inspired image processing. In January 2000, he became a Lecturer with the School of Engineering and Mathematics, Edith Cowan University, Perth, Australia, where he was promoted to Senior lecturer in November 2001. He is currently an Assistant Professor with the Electrical and Electronic Engineering Department, Hong Kong University of Science and Technology, Hong Kong. His current research interests include VLSI circuits and systems, packaging technologies, CMOS image sensors, and VLSI implementation of signal and image processing algorithms. Dominique Martinez received the Ph.D. degree in electrical and electronic engineering from Paul Sabatier University, Toulouse, France, in 1992. He was a Postdoctoral Fellow at the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, and the VLSI Group, Harvard University, Cambridge, MA, in 1992 and 1994, respectively. From 1993 to 1999, he was with LAAS- CNRS, Toulouse, where his research was concerned with machine learning (neural networks and support vector machines). In 2000, he joined LORIA, Nancy, France, where his research interests currently focus on biologically inspired neural networks for artificial olfaction (neuromorphic electronic noses).