Kevin Cuzner's Personal Blog

Electronics, Embedded Systems, and Software are my breakfast, lunch, and dinner.

The LED Wristwatch: A (more or less) completed project!

About 2009 I saw an article written by Dr. Paul Pounds in which he detailed a pocketwatch he had designed that fit inside a standard pocketwatch case and used LEDs as the dial. While the article has since disappeared, the youtube video remains. The wayback machine has a cached version of the page. Anyway, the idea has kind of stuck with me for a while and so a year or so ago I decided that I wanted to build a wristwatch inspired by that idea.

Although the project started out as an AVR project, I decided after my escapades with the STM32 in August that I really wanted to make it an STM32 project, so around November I started making a new design that used the STM32L052C8 ARM Cortex-M0+ ultra-low power USB microcontroller. The basic concept of the design is to mock up an analog watch face using a ring of LEDs for the hours, minutes, and seconds. I found three full rings to be expecting a bit much if I wanted to keep this small, so I ended up using two rings: One for the hours and another for combined minutes and seconds (the second hand is recognized by the fact that it is "moving" perceptibly).

In this post I'm going to go over my general design, some things I was happy with, and some things that I wasn't happy with. I'll make some follow-up posts for the following topics:

The complete design files can be found here:

IMG_20170409_222521.jpg IMG_20170415_194157.jpg

Quick-n-dirty data acquisition with a Teensy 3.1

The Problem

I am working on a project that involves a Li-Ion battery charger. I've never built one of these circuits before and I wanted to test the battery over its entire charge-discharge cycle to make sure it wasn't going to burst into flame because I set some resistor wrong. The battery itself is very tiny (100mAH, 2.5mm thick) and is going to be powering an extremely low-power circuit, hopefully over the course of many weeks between charges.


After about 2 days of taking meter measurements every 6 hours or so to see what the voltage level had dropped to, I decided to try to automate this process. I had my trusty Teensy 3.1 lying around, so I thought that it should be pretty simple to turn it into a simple data logger, measuring the voltage at a very slow rate (maybe 1 measurement per 5 seconds). Thus was born the EZDAQ.

All code for this project is located in the repository at ` <>`__

Setting up the Teensy 3.1 ADC

I've never used the ADC before on the Teensy 3.1. I don't use the Teensy Cores HAL/Arduino Library because I find it more fun to twiddle the bits and write the makefiles myself. Of course, this also means that I don't always get a project working within 30 minutes.

The ADC on the Teensy 3.1 (or the Kinetis MK20DX256) is capable of doing 16-bit conversions at 400-ish ksps. It is also quite complex and can do conversions in many different ways. It is one of the larger and more configurable peripherals on the device, probably rivaled only by the USB module. The module does not come pre-calibrated and requires a calibration cycle to be performed before its accuracy will match that specified in the datasheet. My initialization code is as follows:

 1//Enable ADC0 module
 4//Set up conversion precision and clock speed for calibration
 5ADC0_CFG1 = ADC_CFG1_MODE(0x1) | ADC_CFG1_ADIV(0x1) | ADC_CFG1_ADICLK(0x3); //12 bit conversion, adc async clock, div by 2 (<3MHz)
 6ADC0_CFG2 = ADC_CFG2_ADACKEN_MASK; //enable async clock
 8//Enable hardware averaging and set up for calibration
10while (ADC0_SC3 & ADC_SC3_CAL_MASK) { }
11if (ADC0_SC3 & ADC_SC3_CALF_MASK) //calibration failed. Quit now while we're ahead.
12    return;
13temp = ADC0_CLP0 + ADC0_CLP1 + ADC0_CLP2 + ADC0_CLP3 + ADC0_CLP4 + ADC0_CLPS;
14temp /= 2;
15temp |= 0x1000;
16ADC0_PG = temp;
17temp = ADC0_CLM0 + ADC0_CLM1 + ADC0_CLM2 + ADC0_CLM3 + ADC0_CLM4 + ADC0_CLMS;
18temp /= 2;
19temp |= 0x1000;
20ADC0_MG = temp;
22//Set clock speed for measurements (no division)
23ADC0_CFG1 = ADC_CFG1_MODE(0x1) | ADC_CFG1_ADICLK(0x3); //12 bit conversion, adc async clock, no divide

Following the recommendations in the datasheet, I selected a clock that would bring the ADC clock speed down to <4MHz and turned on hardware averaging before starting the calibration. The calibration is initiated by setting a flag in ADC0_SC3 and when completed, the calibration results will be found in the several ADC0_CL* registers. I'm not 100% certain how this calibration works, but I believe what it is doing is computing some values which will trim some value in the SAR logic (probably something in the internal DAC) in order to shift the converted values into spec.

One thing to note is that I did not end up using the 16-bit conversion capability. I was a little rushed and was put off by the fact that I could not get it to use the full 0-65535 dynamic range of a 16-bit result variable. It was more like 0-10000. This made figuring out my "volts-per-value" value a little difficult. However, the 12-bit mode gave me 0-4095 with no problems whatsoever. Perhaps I'll read a little further and figure out what is wrong with the way I was doing the 16-bit conversions, but for now 12 bits is more than sufficient. I'm just measuring some voltages.

Since I planned to measure the voltages coming off a Li-Ion battery, I needed to make sure I could handle the range of 3.0V-4.2V. Most of this is outside the Teensy's ADC range (max is 3.3V), so I had to make myself a resistor divider attenuator (with a parallel capacitor for added stability). It might have been better to use some sort of active circuit, but this is supposed to be a quick and dirty DAQ. I'll talk a little more about handling issues spawning from the use of this resistor divider in the section about the host software.

Quick and dirty USB device-side driver

For this project I used my device-side USB driver software that I wrote in this project. Since we are gathering data quite slowly, I figured that a simple control transfer should be enough to handle the requisite bandwidth.

 1static uint8_t tx_buffer[256];
 4 * Endpoint 0 setup handler
 5 */
 6static void usb_endp0_handle_setup(setup_t* packet)
 8    const descriptor_entry_t* entry;
 9    const uint8_t* data = NULL;
10    uint8_t data_length = 0;
11    uint32_t size = 0;
12    uint16_t *arryBuf = (uint16_t*)tx_buffer;
13    uint8_t i = 0;
15    switch(packet->wRequestAndType)
16    {
17...USB Protocol Stuff...
18    case 0x01c0: //get adc channel value (wIndex)
19        *((uint16_t*)tx_buffer) = adc_get_value(packet->wIndex);
20        data = tx_buffer;
21        data_length = 2;
22        break;
23    default:
24        goto stall;
25    }
27    //if we are sent here, we need to send some data
28    send:
29...Send Logic...
31    //if we make it here, we are not able to send data and have stalled
32    stall:
33...Stall logic...

I added a control request (0x01) which uses the wIndex (not to be confused with the cleaning product) value to select a channel to read. The host software can now issue a vendor control request 0x01, setting the wIndex value accordingly, and get the raw value last read from a particular analog channel. In order to keep things easy, I labeled the analog channels using the same format as the standard Teensy 3.1 layout. Thus, wIndex 0 corresponds to A0, wIndex 1 corresponds to A1, and so forth. The adc_get_value function reads the last read ADC value for a particular channel. Sampling is done by the ADC continuously, so the USB read doesn't initiate a conversion or anything like that. It just reads what happened on the channel during the most recent conversion.

Host software

Since libusb is easy to use with Python, via PyUSB, I decided to write out the whole thing in Python. Originally I planned on some sort of fancy gui until I realized that it would far simpler just to output a CSV and use MATLAB or Excel to process the data. The software is simple enough that I can just put the entire thing here:

 1#!/usr/bin/env python3
 3# Python Host for EZDAQ
 4# Kevin Cuzner
 6# Requires PyUSB
 8import usb.core, usb.util
 9import argparse, time, struct
11idVendor = 0x16c0
12idProduct = 0x05dc
13sManufacturer = ''
14sProduct = 'EZDAQ'
16VOLTS_PER = 3.3/4096 # 3.3V reference is being used
18def find_device():
19    for dev in usb.core.find(find_all=True, idVendor=idVendor, idProduct=idProduct):
20        if usb.util.get_string(dev, dev.iManufacturer) == sManufacturer and \
21                usb.util.get_string(dev, dev.iProduct) == sProduct:
22            return dev
24def get_value(dev, channel):
25    rt = usb.util.build_request_type(usb.util.CTRL_IN, usb.util.CTRL_TYPE_VENDOR, usb.util.CTRL_RECIPIENT_DEVICE)
26    raw_data = dev.ctrl_transfer(rt, 0x01, wIndex=channel, data_or_wLength=256)
27    data = struct.unpack('H', raw_data)
28    return data[0] * VOLTS_PER;
30def get_values(dev, channels):
31    return [get_value(dev, ch) for ch in channels]
33def main():
34    # Parse arguments
35    parser = argparse.ArgumentParser(description='EZDAQ host software writing values to stdout in CSV format')
36    parser.add_argument('-t', '--time', help='Set time between samples', type=float, default=0.5)
37    parser.add_argument('-a', '--attenuation', help='Set channel attentuation level', type=float, nargs=2, default=[], action='append', metavar=('CHANNEL', 'ATTENUATION'))
38    parser.add_argument('channels', help='Channel number to record', type=int, nargs='+', choices=range(0, 10))
39    args = parser.parse_args()
41    # Set up attentuation dictionary
42    att = args.attenuation if len(args.attenuation) else [[ch, 1] for ch in args.channels]
43    att = dict([(l[0], l[1]) for l in att])
44    for ch in args.channels:
45        if ch not in att:
46            att[ch] = 1
48    # Perform data logging
49    dev = find_device()
50    if dev is None:
51        raise ValueError('No EZDAQ Found')
52    dev.set_configuration()
53    print(','.join(['Time']+['Channel ' + str(ch) for ch in args.channels]))
54    while True:
55        values = get_values(dev, args.channels)
56        print(','.join([str(time.time())] + [str(v[1] * (1/att[v[0]])) for v in zip(args.channels, values)]))
57        time.sleep(args.time)
59if __name__ == '__main__':
60    main()

Basically, I just use the argparse module to take some command line inputs, find the device using PyUSB, and spit out the requested channel values in a CSV format to stdout every so often.

In addition to simply displaying the data, the program also processes the raw ADC values into some useful voltage values. I contemplated doing this on the device, but it was simpler to configure if I didn't have to reflash it every time I wanted to make an adjustment. One thing this lets me do is a sort of calibration using the "attenuation" values that I put into the host. The idea with these values is to compensate for a voltage divider in front of the analog input in order so that I can measure higher voltages, even though the Teensy 3.1 only supports voltages up to 3.3V.

For example, if I plugged my 50%-ish resistor divider on channel A0 into 3.3V, I would run the following command:

1$ ./ezdaq 0
2Time,Channel 0

We now have 1.799 for the "voltage" seen at the pin with an attenuation factor of 1. If we divide 1.799 by 3.3 we get 0.545 for our attenuation value. Now we run the following to get our newly calibrated value:

1$ ./ezdaq -a 0 0.545 0
2Time,Channel 0

This process highlights an issue with using standard resistors. Unless the resistors are precision resistors, the values will not ever really match up very well. I used 4 1meg resistors to make two voltage dividers. One of them had about a 46% division and the other was close to 48%. Sure, those seem close, but in this circuit I needed to be accurate to at least 50mV. The difference between 46% and 48% is enough to throw this off. So, when doing something like this with trying to derive an input voltage after using an imprecise voltage divider, some form of calibration is definitely needed.



After hooking everything up and getting everything to run, it was fairly simple for me to take some two-channel measurements:

1$ ./ezdaq -t 5 -a 0 0.465 -a 1 0.477 0 1 > ~/Projects/AVR/the-project/test/charge.csv

This will dump the output of my program into the charge.csv file (which is measuring the charge cycle on the battery). I will get samples every 5 seconds. Later, I can use this data to make sure my circuit is working properly and observe its behavior over long periods of time. While crude, this quick and dirty DAQ solution works quite well for my purposes.

Dev boards? Where we're going we won't need dev boards...

A complete tutorial for using an STM32 without a dev board



About two years ago I started working with the Teensy 3.1 (which uses a Freescale Kinetis ARM-Cortex microcontroller) and I was super impressed with the ARM processor, both for its power and relative simplicity (it is not simple...its just relatively simple for the amount of power you get for the cost IMO). Almost all my projects before that point had consisted of AVRs and PICs (I'm in the AVR camp now), but now ARM-based microcontrollers had become serious contenders for something that I could go to instead. I soon began working on a small development board project also involving some Freescale Kinetis microcontrollers since those are what I have become the most familiar with. Sadly, I have had little success since I have been trying to make a programmer myself (the official one is a minimum of $200). During the course of this project I came across a LOT of STM32 stuff and it seemed that it was actually quite easy to set up. Lots of the projects used the STM32 Discovery and similar dev boards, which are a great tools and provide an easy introduction to ARM microcontrollers. However, my interest is more towards doing very bare metal development. Like soldering the chip to a board and hooking it up to a programmer. Who needs any of that dev board stuff? For some reason I just find doing embedded development without a development board absolutely fascinating. Some people might interpret doing things this way as a form of masochism. Perhaps I should start seeing a doctor...

Having seen how common the STM32 family was (both in dev boards and in commercial products) and noting that they were similarly priced to the Freescale Kinetis series, I looked in to exactly what I would need to program these, saw that the stuff was cheap, and bought it. After receiving my parts and soldering everything up, I plugged everything into my computer and had a program running on the STM32 in a matter of hours. Contrast that to a year spent trying to program a Kinetis KL26 with only partial success.

This post is a complete step-by-step tutorial on getting an STM32 microcontroller up and running without using a single dev board (breakout boards don't count as dev boards for the purposes of this tutorial). I'm writing this because I could not find a standalone tutorial for doing this with an ARM microcontroller and I ended up having to piece it together bit by bit with a lot of googling. My objective is to walk you through the process of purchasing the appropriate parts, writing a small program, and uploading it to the microcontroller.

I make the following assumptions:

  • The reader is familiar with electronics and soldering.
  • The reader is familiar with flash-based microcontrollers in general (PIC, AVR, ARM, etc) and has programmed a few using a separate standalone programmer before.
  • The reader knows how to read a datasheet.
  • The reader knows C and is at least passingly familiar with the overall embedded build process of compilation-linking-flashing.
  • The reader knows about makefiles.
  • The reader is ridiculously excited about ARM microcontrollers and is strongly motivated to overlook any mistakes here and try this out for themselves (srsly tho...if you see a problem or have a suggestion, leave it in the comments. I really do appreciate feedback.)

All code, makefiles, and configuration stuff can be found in the project repository on github. Project Repository: ` <>`__


You will require the following materials:

  • A computer running Linux. If you run Windows only, please don't be dissuaded. I'm just lazy and don't want to test this for Windows. It may require some finagling. Manufacturer support is actually better for Windows since they provide some interesting configuration and programming software that is Windows only...but who needs that stuff anyway?
  • A STLinkv2 Clone from eBay. Here's one very similar to the one I bought. ~$3
  • Some STM32F103C8's from eBay. Try going with the TQFP-48 package. Why this microcontroller? Because for some reason it is all over the place on eBay. I suspect that the lot I bought (and all of the ones on eBay) is probably not authentic from ST. I hear that Chinese STM32 clones abound nowadays. I got 10 for $12.80.
  • A breakout board for a TQFP-48 with 0.5mm pitch. Yes, you will need to solder surface mount. I found mine for $1. I'm sure you can find one for a similar price.
  • 4x 0.1uF capacitors for decoupling. Mine are surface mount in the 0603 package. These will be soldered creatively to the breakout board between the power pins to provide some decoupling since we will probably have wires flying all over. I had mine lying around in a parts bin, left over from my development board project. Digikey is great for getting these, but I'm sure you could find them on eBay or Amazon as well.
  • Some dupont wires for connecting the programmer to the STM32. You will need at least 4. These are the ones that are sold for Arduinos. These came with my programmer, but you may have some in your parts box. They are dang cheap on Amazon.
  • Regular wires.
  • An LED and a resistor.

I was able to acquire all of these parts for less than $20. Now, I did have stuff like the capacitors, led, resistor, and wires lying around in parts boxes, but those are quite cheap anyway.

Side note: Here is an excellent video by the EE guru Dave Jones on surface mount soldering if the prospect is less than palatable to you:

Step 1: Download the datasheets

Above we decided to use the STM32F103C8 ARM Cortex-M3 microcontroller in a TQFP-48 package. This microcontroller has so many peripherals its no wonder its the one all over eBay. I could see this microcontroller easily satisfying the requirements for all of my projects. Among other things it has:

  • 64K flash, 20K RAM
  • 72MHz capability with an internal PLL
  • USB
  • CAN
  • I2C & SPI
  • Lots of timers
  • Lots of PWM
  • Lots of GPIO

All this for ~$1.20/part no less! Of course, its like $6 on digikey, but for my purposes having an eBay-sourced part is just fine.

Ok, so when messing with any microcontroller we need to look at its datasheet to know where to plug stuff in. For almost all ARM Microcontrollers there will be no less than 2 datasheet-like documents you will need: The part datasheet and the family reference manual . The datasheet contains information such as the specific pinouts and electrical characteristics and the family reference manual contains the detailed information on how the microcontroller works (core and peripherals). These are both extremely important and will be indispensable for doing anything at all with one of these microcontrollers bare metal.

Find the STM32F103C8 datasheet and family reference manual here (datasheet is at the top of the page, reference manual is at the bottom): They are also found in the "ref" folder of the repository.

Step 2: Figure out where to solder and do it


After getting the datasheet we need to solder the microcontroller down to the breakout board so that we can start working with it on a standard breadboard. If you prefer to go build your own PCB and all that (I usually do actually) then do that instead of this. However, you will still need to know which pins to hook up.

On the pin diagram posted here you will find the highlighted pins of interest for hooking this thing up. We need the following pins at a minimum:

  • Shown in Red/Blue:  All power pins, VDD, VSS, AVDD, and AVSS. There are four pairs: 3 for the VDD/VSS and one AVDD/AVSS. The AVDD/AVSS pair is specifically used to power the analog/mixed signal circuitry and is separate to give us the opportunity to perform some additional filtering on those lines and remove supply noise induced by all the switching going on inside the microcontroller; an opportunity I won't take for now.
  • Shown in Yellow/Green: The SWD (Serial Wire Debug) pins. These are used to connect to the STLinkV2 programmer that you purchased earlier. These can be used for so much more than just programming (debugging complete with breakpoints, for a start), but for now we will just use it to talk to the flash on the microcontroller.
  • Shown in Cyan: Two fun GPIOs to blink our LEDs with. I chose PB0 and PB1. You could choose others if you would like, but just make sure that they are actually GPIOs and not something unexpected.

Below you will find a picture of my breakout board. I soldered a couple extra pins since I want to experiment with USB.


Very important: You may notice that I have some little tiny capacitors (0.1uF) soldered between the power pins (the one on the top is the most visible in the picture). You need to mount your capacitors between each pair of VDD/VSS pins (including AVDD/AVSS) . How you do this is completely up to you, but it must be done and *they should be rather close to the microcontroller itself* . If you don't it is entirely possible that when the microcontroller first turns on and powers up (specifically at the first falling edge of the internal clock cycle), the inductance created by the flying power wires we have will create a voltage spike that will either cause a malfunction or damage. I've broken microcontrollers by forgetting the decoupling caps and I'm not eager to do it again.

Step 3: Connect the breadboard and programmer


Don't do this with the programmer plugged in.

On the right you will see my STLinkV2 clone which I will use for this project. Barely visible is the pinout. We will need the following pins connected from the programmer onto our breadboard. These come off the header on the non-USB end of the programmer. Pinouts may vary. Double check your programmer!

  • 3.3V: We will be using the programmer to actually power the microcontroller since that is the simplest option. I believe this pin is Pin 7 on my header.
  • GND: Obviously we need the ground. On mine this was Pin 4.
  • SWDIO: This is the data for the SWD bus. Mine has this at Pin 2.
  • SWCLK: This is the clock for the SWD bus. Mine has this at Pin 6.

You may notice in the above picture that I have an IDC cable coming off my programmer rather than the dupont wires. I borrowed the cable from my AVR USBASP programmer since it was more available at the time rather than finding the dupont cables that came with the STLinkV2.

Next, we need to connect the following pins on the breadboard:

  • STM32 [A]VSS pins 8, 23, 35, and 47 connected to ground.
  • STM32 [A]VDD pins 9, 24, 36, and 48 connected to 3.3V.
  • STM32 pin 34 to SWDIO.
  • STM32 pin 37 to SWCLK.
  • STM32 PB0 pin 18 to a resistor connected to the anode of an LED. The cathode of the LED goes to ground. Pin 19 (PB1) can also be connected in a similar fashion if you should so choose.

Here is my breadboard setup:


Step 4: Download the STM32F1xx C headers

Project Repository: ` <>`__

Since we are going to write a program, we need the headers. These are part of the STM32CubeF1 library found here.

Visit the page and download the STM32CubeF1 zip file. It will ask for an email address. If you really don't want to give them your email address, the necessary headers can be found in the project github repository.

Alternately, just clone the repository. You'll miss all the fun of poking around the zip file, but sometimes doing less work is better.

The STM32CubeF1 zip file contains several components which are designed to help people get started quickly when programming STM32s. This is one thing that ST definitely does better than Freescale. It was so difficult to find the headers for the Kinetis microcontrollers that almost gave up at that point. Anyway, inside the zip file we are only interested in the following:

  • The contents of Drivers/CMSIS/Device/ST/STM32F1xx/Include. These headers contain the register definitions among other things which we will use in our program to reference the peripherals on the device.
  • Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/gcc/startup_stm32f103xb.s. This contains the assembly code used to initialize the microcontroller immediately after reset. We could easily write this ourselves, but why reinvent the wheel?
  • Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/system_stm32f1xx.c. This contains the common system startup routines referenced by the assembly file above.
  • Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/gcc/linker/STM32F103XB_FLASH.ld. This is the linker script for the next model up of the microcontroller we have (we just have to change the "128K" to a "64K" near the beginning of the file in the MEMORY section (line 43 in my file) and we are good to go). This is used to tell the linker where to put all the parts of the program inside the microcontroller's flash and RAM. Mine had a "0" on every blank line. If you see this in yours, delete those "0"s. They will cause errors.
  • The contents of Drivers/CMSIS/Include. These are the core header files for the ARM Cortex-M3 and the definitions contained therein are used in all the other header files we reference.

I copied all the files referenced above to various places in my project structure so they could be compiled into the final program. Please visit the repository for the exact locations and such. My objective with this tutorial isn't really to talk too much about project structure, and so I think that's best left as an exercise for the reader.

Step 5: Install the required software

We need to be able to compile the program and flash the resulting binary file to the microcontroller. In order to do this, we will require the following programs to be installed:

  • The arm-none-eabi toolchain. I use arch linux and had to install "arm-none-eabi-gcc". On Ubuntu this is called "gcc-arm-none-eabi". This is the cross-compiler for the ARM Cortex cores. The naming "none-eabi" comes from the fact that it is designed to compile for an environment where the program is the only thing running on the target processor. There is no underlying operating system talking to the application binary file (ABI = application binary interface, none-eabi = No ABI) in order to load it into memory and execute it. This means that it is ok with outputting raw binary executable programs. Contrast this with Linux which likes to use the ELF format (which is a part of an ABI specification) and the OS will interpret that file format and load the program from it.
  • arm-none-eabi binutils. In Arch the package is "arm-none-eabi-binutils". In Ubuntu this is "binutils-arm-none-eabi". This contains some utilities such as "objdump" and "objcopy" which we use to convert the output ELF format into the raw binary format we will use for flashing the microcontroller.
  • Make. We will be using a makefile, so obviously you will need make installed.
  • OpenOCD. I'm using 0.9.0, which I believe is available for both Arch and Ubuntu. This is the program that we will use to talk to the STLinkV2 which in turn talks to the microcontroller. While we are just going to use it to flash the microcontroller, it can be also used for debugging a program on the processor using gdb.

Once you have installed all of the above programs, you should be good to go for ARM development. As for an editor or IDE, I use vim. You can use whatever. It doesn't matter really.

Step 6: Write and compile the program

Ok, so we need to write a program for this microcontroller. We are going to simply toggle on and off a GPIO pin (PB0). After reset, the processor uses the internal RC oscillator as its system clock and so it runs at a reasonable 8MHz or so I believe. There are a few steps that we need to go through in order to actually write to the GPIO, however:

  1. Enable the clock to PORTB. Most ARM microcontrollers, the STM32 included, have a clock gating system that actually turns off the clock to pretty much all peripherals after system reset. This is a power saving measure as it allows parts of the microcontroller to remain dormant and not consume power until needed. So, we need to turn on the GPIO port before we can use it.
  2. Set PB0 to a push-pull output. This microcontroller has many different options for the pins including analog input, an open-drain output, a push-pull output, and an alternate function (usually the output of a peripheral such as a timer PWM). We don't want to run our LED open drain for now (though we certainly could), so we choose the push-pull output. Most microcontrollers have push-pull as the default method for driving their outputs.
  3. Toggle the output state on. Once we get to this point, it's success! We can control the GPIO by just flipping a bit in a register.
  4. Toggle the output state off. Just like the previous step.

Here is my super-simple main program that does all of the above:

 2 * STM32F103C8 Blink Demonstration
 3 *
 4 * Kevin Cuzner
 5 */
 7#include "stm32f1xx.h"
 9int main(void)
11    //Step 1: Enable the clock to PORT B
14    //Step 2: Change PB0's mode to 0x3 (output) and cfg to 0x0 (push-pull)
17    while (1)
18    {
19        //Step 3: Set PB0 high
20        GPIOB->BSRR = GPIO_BSRR_BS0;
21        for (uint16_t i = 0; i != 0xffff; i++) { }
22        //Step 4: Reset PB0 low
23        GPIOB->BSRR = GPIO_BSRR_BR0;
24        for (uint16_t i = 0; i != 0xffff; i++) { }
25    }
27    return 0;

If we turn to our trusty family reference manual, we will see that the clock gating functionality is located in the Reset and Clock Control (RCC) module (section 7 of the manual). The gates to the various peripherals are sorted by the exact data bus they are connected to and have appropriately named registers. The PORTB module is located on the APB2 bus, and so we use the RCC->APB2ENR to turn on the clock for port B (section 7.3.7 of the manual).

The GPIO block is documented in section 9. We first talk to the low control register (CRL) which controls pins 0-7 of the 16-pin port. There are 4 bits per pin which describe the configuration grouped in to two 2-bit (see how many "2" sounding words I had there?) sections: The Mode and Configuration. The Mode sets the analog/input/output state and the Configuration handles the specifics of the particular mode. We have chosen output (Mode is 0b11) and the 50MHZ-capable output mode (Cfg is 0b00). I'm not fully sure what the 50MHz refers to yet, so I just kept it at 50MHz because that was the default value.

After talking to the CRL, we get to talk to the BSRR register. This register allows us to write a "1" to a bit in the register in order to either set or reset the pin's output value. We start by writing to the BS0 bit to set PB0 high and then writing to the BR0 bit to reset PB0 low. Pretty straightfoward.

It's not a complicated program. Half the battle is knowing where all the pieces fit. The STM32F1Cube zip file contains some examples which could prove quite revealing into the specifics on using the various peripherals on the device. In fact, it includes an entire hardware abstraction layer (HAL) which you could compile into your program if you wanted to. However, I have heard some bad things about it from a software engineering perspective (apparently it's badly written and quite ugly). I'm sure it works, though.

So, the next step is to compile the program. See the makefile in the repository. Basically what we are going to do is first compile the main source file, the assembly file we pulled in from the STM32Cube library, and the C file we pulled in from the STM32Cube library. We will then link them using the linker script from the STM32Cube and then dump the output into a binary file.

  1# Makefile for the STM32F103C8 blink program
  3# Kevin Cuzner
  6PROJECT = blink
  8# Project Structure
  9SRCDIR = src
 10COMDIR = common
 11BINDIR = bin
 12OBJDIR = obj
 13INCDIR = include
 15# Project target
 16CPU = cortex-m3
 18# Sources
 19SRC = $(wildcard $(SRCDIR)/*.c) $(wildcard $(COMDIR)/*.c)
 20ASM = $(wildcard $(SRCDIR)/*.s) $(wildcard $(COMDIR)/*.s)
 22# Include directories
 23INCLUDE  = -I$(INCDIR) -Icmsis
 25# Linker
 28# C Flags
 29GCFLAGS  = -Wall -fno-common -mthumb -mcpu=$(CPU) -DSTM32F103xB --specs=nosys.specs -g -Wa,-ahlms=$(addprefix $(OBJDIR)/,$(notdir $(<:.c=.lst)))
 31LDFLAGS += -T$(LSCRIPT) -mthumb -mcpu=$(CPU) --specs=nosys.specs
 32ASFLAGS += -mcpu=$(CPU)
 34# Flashing
 35OCDFLAGS = -f /usr/share/openocd/scripts/interface/stlink-v2.cfg \
 36                -f /usr/share/openocd/scripts/target/stm32f1x.cfg \
 37                -f openocd.cfg
 39# Tools
 40CC = arm-none-eabi-gcc
 41AS = arm-none-eabi-as
 42AR = arm-none-eabi-ar
 43LD = arm-none-eabi-ld
 44OBJCOPY = arm-none-eabi-objcopy
 45SIZE = arm-none-eabi-size
 46OBJDUMP = arm-none-eabi-objdump
 47OCD = openocd
 49RM = rm -rf
 51## Build process
 53OBJ := $(addprefix $(OBJDIR)/,$(notdir $(SRC:.c=.o)))
 54OBJ += $(addprefix $(OBJDIR)/,$(notdir $(ASM:.s=.o)))
 57all:: $(BINDIR)/$(PROJECT).bin
 59Build: $(BINDIR)/$(PROJECT).bin
 61install: $(BINDIR)/$(PROJECT).bin
 62     $(OCD) $(OCDFLAGS)
 64$(BINDIR)/$(PROJECT).hex: $(BINDIR)/$(PROJECT).elf
 65     $(OBJCOPY) -R .stack -O ihex $(BINDIR)/$(PROJECT).elf $(BINDIR)/$(PROJECT).hex
 67$(BINDIR)/$(PROJECT).bin: $(BINDIR)/$(PROJECT).elf
 68     $(OBJCOPY) -R .stack -O binary $(BINDIR)/$(PROJECT).elf $(BINDIR)/$(PROJECT).bin
 70$(BINDIR)/$(PROJECT).elf: $(OBJ)
 71     @mkdir -p $(dir $@)
 72     $(CC) $(OBJ) $(LDFLAGS) -o $(BINDIR)/$(PROJECT).elf
 73     $(OBJDUMP) -D $(BINDIR)/$(PROJECT).elf > $(BINDIR)/$(PROJECT).lst
 74     $(SIZE) $(BINDIR)/$(PROJECT).elf
 77     $(CC) $(GCFLAGS) -dM -E - < /dev/null
 79cleanBuild: clean
 82     $(RM) $(BINDIR)
 83     $(RM) $(OBJDIR)
 85# Compilation
 86$(OBJDIR)/%.o: $(SRCDIR)/%.c
 87     @mkdir -p $(dir $@)
 88     $(CC) $(GCFLAGS) -c $< -o $@
 90$(OBJDIR)/%.o: $(SRCDIR)/%.s
 91     @mkdir -p $(dir $@)
 92     $(AS) $(ASFLAGS) -o $@ $<
 95$(OBJDIR)/%.o: $(COMDIR)/%.c
 96     @mkdir -p $(dir $@)
 97     $(CC) $(GCFLAGS) -c $< -o $@
 99$(OBJDIR)/%.o: $(COMDIR)/%.s
100     @mkdir -p $(dir $@)
101     $(AS) $(ASFLAGS) -o $@ $<

The result of this makefile is that it will create a file called "bin/blink.bin" which contains our compiled program. We can then flash this to our microcontroller using openocd.

Step 7: Flashing the program to the microcontroller

Source for this step:

This is the very last step. We get to do some openocd configuration. Firstly, we need to write a small configuration script that will tell openocd how to flash our program. Here it is:

1# Configuration for flashing the blink program
3reset halt
4flash write_image erase bin/blink.bin 0x08000000
5reset run

Firstly, we init and halt the processor (reset halt). When the processor is first powered up, it is going to be running whatever program was previously flashed onto the microcontroller. We want to stop this execution before we overwrite the flash. Next we execute "flash write_image erase" which will first erase the flash memory (if needed) and then write our program to it. After writing the program, we then tell the processor to execute the program we just flashed (reset run) and we shutdown openocd.

Now, openocd requires knowledge of a few things. It first needs to know what programmer to use. Next, it needs to know what device is attached to the programmer. Both of these requirements must be satisfied before we can run our script above. We know that we have an stlinkv2 for a programmer and an stm32f1xx attached on the other end. It turns out that openocd actually comes with configuration files for these. On my installation these are located at "/usr/share/openocd/scripts/interface/stlink-v2.cfg" and "/usr/share/openocd/scripts/target/stm32f1x.cfg", respectively. We can execute all three files (stlink, stm32f1, and our flashing routine (which I have named "openocd.cfg")) with openocd as follows:

1openocd -f /usr/share/openocd/scripts/interface/stlink-v2.cfg \
2                -f /usr/share/openocd/scripts/target/stm32f1x.cfg \
3                -f openocd.cfg

So, small sidenote: If we left off the "shutdown" command, openocd would actually continue running in "daemon" mode, listening for connections to it. If you wanted to use gdb to interact with the program running on the microcontroller, that is what you would use to do it. You would tell gdb that there is a "remote target" at port 3333 (or something like that). Openocd will be listening at that port and so when gdb starts talking to it and trying to issue debug commands, openocd will translate those through the STLinkV2 and send back the translated responses from the microcontroller. Isn't that sick?

In the makefile earlier, I actually made this the "install" target, so running "sudo make install" will actually flash the microcontroller. Here is my output from that command for your reference:

 1kcuzner@kcuzner-laptop:~/Projects/ARM/stm32f103-blink$ sudo make install
 2arm-none-eabi-gcc -Wall -fno-common -mthumb -mcpu=cortex-m3 -DSTM32F103xB --specs=nosys.specs -g -Wa,-ahlms=obj/system_stm32f1xx.lst -Iinclude -Icmsis -c src/system_stm32f1xx.c -o obj/system_stm32f1xx.o
 3arm-none-eabi-gcc -Wall -fno-common -mthumb -mcpu=cortex-m3 -DSTM32F103xB --specs=nosys.specs -g -Wa,-ahlms=obj/main.lst -Iinclude -Icmsis -c src/main.c -o obj/main.o
 4arm-none-eabi-as -mcpu=cortex-m3 -o obj/startup_stm32f103x6.o src/startup_stm32f103x6.s
 5arm-none-eabi-gcc obj/system_stm32f1xx.o obj/main.o obj/startup_stm32f103x6.o -TSTM32F103X8_FLASH.ld -mthumb -mcpu=cortex-m3 --specs=nosys.specs  -o bin/blink.elf
 6arm-none-eabi-objdump -D bin/blink.elf > bin/blink.lst
 7arm-none-eabi-size bin/blink.elf
 8   text         data     bss     dec     hex filename
 9   1756         1092    1564    4412    113c bin/blink.elf
10arm-none-eabi-objcopy -R .stack -O binary bin/blink.elf bin/blink.bin
11openocd -f /usr/share/openocd/scripts/interface/stlink-v2.cfg -f /usr/share/openocd/scripts/target/stm32f1x.cfg -f openocd.cfg
12Open On-Chip Debugger 0.9.0 (2016-04-27-23:18)
13Licensed under GNU GPL v2
14For bug reports, read
16Info : auto-selecting first available session transport "hla_swd". To override use 'transport select <transport>'.
17Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
18adapter speed: 1000 kHz
19adapter_nsrst_delay: 100
20none separate
21Info : Unable to match requested speed 1000 kHz, using 950 kHz
22Info : Unable to match requested speed 1000 kHz, using 950 kHz
23Info : clock speed 950 kHz
24Info : STLINK v2 JTAG v17 API v2 SWIM v4 VID 0x0483 PID 0x3748
25Info : using stlink api v2
26Info : Target voltage: 3.335870
27Info : stm32f1x.cpu: hardware has 6 breakpoints, 4 watchpoints
28target state: halted
29target halted due to debug-request, current mode: Thread
30xPSR: 0x01000000 pc: 0x08000380 msp: 0x20004ffc
31auto erase enabled
32Info : device id = 0x20036410
33Info : flash size = 64kbytes
34target state: halted
35target halted due to breakpoint, current mode: Thread
36xPSR: 0x61000000 pc: 0x2000003a msp: 0x20004ffc
37wrote 3072 bytes from file bin/blink.bin in 0.249272s (12.035 KiB/s)
38shutdown command invoked

After doing that I saw the following awesomeness:


Wooo!!! The LED blinks! At this point, you have successfully flashed an ARM Cortex-M3 microcontroller with little more than a cheap programmer from eBay, a breakout board, and a few stray wires. Feel happy about yourself.


For me, this marks the end of one journey and the beginning of another. I can now feel free to experiment with ARM microcontrollers without having to worry about ruining a nice shiny development board. I can buy a obscenely powerful $1 STM32 microcontroller from eBay and put it into any project I want. If I were to try to do that with AVRs, I would be stuck with the ultra-low-end 8-pin ATTiny13A since that's about it for ~$1 AVR eBay offerings (don't worry...I've got plenty of ATMega328PB's...though they weren't $1). I sincerely hope that you found this tutorial useful and that it might serve as a springboard for doing your own dev board-free ARM development.

If you have any questions or comments (or want to let me know about any errors I may have made), let me know in the comments section here. I will try my best to help you out, although I can't always find the time to address every issue.

Writing a preemptive task scheduler for AVR

Wow it has been a while. Between school, work, and another project that I've been working on since last October (which, if ultimately successful, I will post here) I haven't had a lot of time to write about anything cool.

I wanted to share today something cool I wrote for my AVRs. Many of my recent AVR projects have become rather complex in that they usually are split into multiple parts in the software which interact with each other. One project in particular had the following components:

  • A state machine managing several PWM channels. It implemented behavior like pulsing, flashing, etc. It also provided methods for the other components to interact with it.
  • A state machine managing an NRF24L01+ radio module, again providing methods for components to interact with it.
  • A state machine managing several inputs, interpreting them and sending commands to the other two components

So, why did I use state machines rather than just implementing this whole thing in a giant loop with some interrupts mixed in? The answer is twofold:

  1. Spaghetti. Doing things in state machines avoided spaghetti code that would have otherwise made the system very difficult to modify. Each machine was isolated from the others both structurally and in software (static variables/methods, etc). The only way to interact with the state machines was to use my provided methods which lends itself to a clean interface between the different components of the program. Switching out the logic in one part of the program did not have any effect on the other components (unless method signatures were changed, of course).
  2. Speed. All of these state machines had a requirement of being able to respond quickly to events, whether they were input events from the user or an interrupt from a timer and whatnot. Interrupts make responding to things in a timely fashion easy, but they stop all other interrupts while running (unless doing nested interrupts, which is ...unwieldy... in an AVR) and so doing a lot of computation during an interrupt can slow down the other parts of the system. By organizing this into state machines, I could split up parts of computation into pieces that could execute fast and allow other parts to run in response to their interrupts quickly.

None of this, however, is particularly new. Everyone does state machines and they are comparatively easy to implement. In almost every project I have done there has been some form of state machine, whether it was just busy waiting for some flag somewhere and doing something after that flag was set, or doing something much more complex. What I wanted to show today was a different way of dealing with the issues of spaghetti and speed: Building a preempting task scheduler. These are considered a key and central component of Real-Time Operating Systems , so what you see in this article is the beginnings of a real-time kernel for an AVR microcontroller.

I should mention here that there is a certain level of ability assumed with AVRs in this post. I assume that the reader has a good knowledge of how programs work with the stack, the purpose and functioning of the general purpose registers, how function calls actually happen on the microcontroller, how interrupts work, the ability to read AVR assembly (don't worry most of this is in C...but there is some critical code written in assembly), and a general knowledge of the avr-gcc toolchain.


The definition of preemption in computing is being able to interrupt a task, without its cooperation, with the intention of executing it later, in order to run some other task.

Firstly we need to define a task: A task is a unit of execution; something that the program structure defines as a self-contained part of the program which accomplishes some purpose . In our case, a task is basically going to just be a method which never returns. The next thing to define is a scheduler . A scheduler is a software component which can decide from a list of tasks which task needs to run on the processor. The scheduler uses a dispatcher to actually change the code that is executing from the current task to another one. This is called saving and restoring the context of the task.

The cool thing about preemptive scheduling is that any task can be interrupted at any time and another task can start executing. Now, it could be asked, "Well, isn't that what interrupts do? What's the difference?". The difference is that an interrupt in a system using a preemptive scheduler can actually resume in a different place from when it started. Without a scheduler, when an interrupt ends the processor will send the code right back to where it was executing when the interrupt occurred. In contrast, the scheduler actually allows the program to have a little more control and move from one task to another in response to an interrupt or some other stimuli (yes, it doesn't even need to be an could be another task!).

In the state machine example above, I used state machines as a way to break up computations so that the other machines could run at predetermined points. I noticed that in most of these cases, this point was when waiting for some user input or an interrupt. Although I used interrupts extensively in the application, there were a lot of flags to be polled and this happened inside state machine tick functions. When an interrupt occurred and set a flag, it would need to wait for the "main" code to get around to executing the tick function for the particular state machine that listens to the flag before anything could happen. This introduces a lot of latency and jitter (differences in the amount of time it takes the system to respond to an interrupt from one moment to the next). Not good.

Using tasks removes a lot of these latency problems since the interrupt can halt the current task (block ) and begin executing another which was waiting on the interrupt to occur (unblock or resume ). Once the higher priority task is blocked again (through a call to some function asking it to wait for some event), the scheduler will change back to the original task and things go on as usual. Using tasks also has the effect of making the code easier to read. While state machines are easy to write, they are not always easy to follow. A function is invoked over and over again and that requires more thought than simply reading a linear function. Tasks can be very linear since the state machine is embodied in calls which could possibly block the task.

A positive side effect of doing things this way (with a scheduler) is that we can now implement the familiar things such as semaphores and queues to communicate between our tasks in a fine grained manner. At their core these are simply methods that can manipulate the list of tasks and call the scheduler to decide which task to execute next.

Summary:  Using a preemptive scheduler can allow for lower latency and jitter between an interrupt occurring and some non-ISR code responding to it when compared to using several state machines. This means it will respond more consistently to interrupts occurring, though not necessarily in a more timely fashion. It also allows for fine-grained priority control with these responses.

Pros and Cons

Before continuing, I would like to point out some pros and cons that I see of writing a task scheduler lest we fall into the "golden hammer" antipattern. There are certainly more, but here is my list (feel free to comment with comments on this).

Some Pros

  • Can reduce the jitter (and possibly the latency) in responding to interrupts. This is of paramount importance in some embedded systems which will have problems if the system cannot respond in a predictable manner to external stimuli.
  • Can greatly simplify application code by using familiar constructs such as semaphores and queues. Compared to state machines, this code can be easier to read as it can be written very linearly (no switches, if's etc). This can reduce the initial bugs found in programs.
  • Can entirely remove the need for busy waits (loops polling a flag). A properly designed state machine shouldn't have these either, but it can take a large amount of effort to design these kinds of machines. They also can take up a lot of program space when space is at a premium (not always true).
  • Can reduce application code size. This is weak, but since the code can be made more linear with calls to the scheduler rather than returning all the time, there is no need for switch statements and ifs which can compile to some beastly assembly code.

Some Cons

  • Can add unnecessary complexity to the program in general. A task scheduler is no small thing and brings with it all of the issues seen in concurrent programming in general. However, these issues usually already exist when using interrupts and such.
  • Can be very hard to debug. I needed an emulator to get this code working correctly. Anything where we mess with the stack pointer or program counter is going to be a very precise exercise.
  • Can make the application itself hard to debug. Is it a problem with the scheduler? Or is it a problem with the program itself? It is an additional component to consider when debugging.
  • Adds additional program weight. My base implementation uses ~450 bytes of program memory. While quite tiny compared to many programs, this would be unacceptably high on a smaller AVR such as the ATTiny13A which only has 1K of program memory.

So...lots of those are contradictory. What is a pro can also be a con. Anyway, I'm just presenting this as something cool to do, not as the end all be all of ways to structure an embedded program. If you have a microcontroller that is performing a lot of tasks that need to be able to react reliably to an interrupt, this might be the way to go for you. However, if your microcontroller is just toggling some gpios and reacting to some timers, this might be overkill. It all depends on the application.


Mmmkay here's the fun part. At this point you may be asking, "How in the world can we make something that can interrupt during one function and resume into another?" I recently completed a course on Real-Time Operating Systems (RTOS) at my university which opened my eyes into how this can be done (we wrote one for the awesome!), so I promptly wrote one for the AVR. For those who come by here who have taken the same course at BYU, they will notice some distinct similarities since I went with what I knew. I've named it KOS, for "Kevin's Operating System", but this was just so I had an easy prefix for my types and function names. If you're going to implement your own based on this article, don't worry about naming it like mine (though a mention of this article somewhere would be cool).

Disclaimer: I have only started to scratch the surface of this stuff myself and I may have made some errors. I appreciate any insight anyone can give me into either suggestions for this or problems with my implementation. Just leave it in the comments :)

All of the code can be found here: ` <>`__

The focus of a scheduler/dispatcher system for tasks is manipulating the stack pointer and the stack itself. "Traditionally," programs written for microcontrollers have a single stack which grows from the bottom of memory up and all code is executed on that stack. The concept here is that we still start out with that stack, but we actually execute the tasks on their own separate stacks. When we want to switch to a task, we point the AVR's stack pointer to the desired task's stack and start executing (its the "start executing" part where things get fun).

First, let's take a look at the structure which represents a task:

1typedef enum { TASK_READY, TASK_SEMAPHORE, TASK_QUEUE } KOS_TaskStatus;
3typedef struct KOS_Task {
4    void *sp;
5    KOS_TaskStatus status;
6    struct KOS_Task *next;
7    void *status_pointer;
8} KOS_Task;

The very first item in this struct is the pointer to the stack pointer (*sp). It is a void* because we don't normally access anything on it...we just make the SP register point to it when we want to execute the task.

The next item in the struct is a status enum. This is used by my primitive scheduler to determine if a task is "READY" to execute. If a task is ready to execute, then it is not waiting on anything (i.e. blocked) and it can be resumed at any time. In the case where the task is waiting on something like a semaphore, this status would be changed to SEMAPHORE. The semaphore posting code would then change the status back to READY once somebody posted to the semaphore. This is called "unblocking".

After the status comes the *next pointer. The tasks are arranged in a linked list because they have a priority attached to them. This priority determines which tasks get executed first. At the top of the linked list is the highest priority task and at the end of the list is the lowest priority task.

Finally, we have the *status_pointer. This is used by our functions which can unblock tasks to determine why tasks are blocked in the first place. We will see more about this when we make a primitive semaphore.

Ok, so for the basic task scheduling and dispatching functionality we are going to implement some functions (these are declared in a header):

 1typedef void (*KOS_TaskFn)(void);
 3extern KOS_Task *kos_current_task;
 6 * Initializes the KOS kernel
 7 */
 8void kos_init(void);
11 * Creates a new task
12 * Note: Not safe
13 */
14void kos_new_task(KOS_TaskFn task, void *sp);
17 * Puts KOS in ISR mode
18 * Note: Not safe, assumes non-nested isrs
19 */
20void kos_isr_enter(void);
23 * Leaves ISR mode, possibly executing the dispatcher
24 * Note: Not safe, assumes non-nested isrs
25 */
26void kos_isr_exit(void);
29 * Runs the kernel
30 */
31void kos_run(void);
34 * Runs the scheduler
35 */
36void kos_schedule(void);
39 * Dispatches the passed task, saving the context of the current task
40 */
41void kos_dispatch(KOS_Task *next);

As for source files, we will only have a single C file for the implementation, but there will be some inline assembly because we are going to have to fiddle with registers. Yay! I'll just go through the functions one by one and afterwards I'll go through my design decisions and how they affect things. This is not the only, nor the best, way to do this.

Implementation: kos_init and kos_new_task

Firstly, we have the kos_init and kos_new_task functions, which come with some baggage:

 1static KOS_Task tasks[KOS_MAX_TASKS + 1];
 2static uint8_t next_task = 0;
 3static KOS_Task *task_head;
 4KOS_Task *kos_current_task;
 6static uint8_t kos_idle_task_stack[KOS_IDLE_TASK_STACK];
 7static void kos_idle_task(void)
 9    while (1) { }
12void kos_init(void)
14    kos_new_task(&kos_idle_task, &kos_idle_task_stack[KOS_IDLE_TASK_STACK - 1]);
17void kos_new_task(KOS_TaskFn task, void *sp)
19    int8_t i;
20    uint8_t *stack = sp;
21    KOS_Task *tcb;
23    //make space for pc, sreg, and 32 registers
24    stack[0] = (uint16_t)task & 0xFF;
25    stack[-1] = (uint16_t)task >> 8;
26    for (i = -2; i > -34; i--)
27    {
28        stack[i] = 0;
29    }
30    stack[-34] = 0x80; //sreg, interrupts enabled
32    //create the task structure
33    tcb = &tasks[next_task++];
34    tcb->sp = stack - 35;
35    tcb->status = TASK_READY;
37    //insert into the task list as the new highest priority task
38    if (task_head)
39    {
40        tcb->next = task_head;
41        task_head = tcb;
42    }
43    else
44    {
45        task_head = tcb;
46    }

Here we have two concepts that are embodied. The first is the context . The context the data pushed onto the stack that the dispatcher is going to use in order to restore the task before executing it. This is similar (identical even) to the procedure used with interrupt service routines, except that we store every single one of the 32 registers instead of just the ones that we use. The next concept is that of the idle task . As an optimization, there is a task which has the lowest priority and is never blocked. It is always ready to execute, so when all other tasks are blocked, it will run. This means that we don't have to deal with the case in the scheduler when there is no tasks to execute since there will always be a task.

The kos_init function performs only one operation: Add the idle task to the list of tasks to execute. Notice that there was some space allocated for the stack of the idle task. This stack must be at least as large as the entire context (35 bytes here) plus enough for any interrupts which may occur during program execute. I chose 48 bytes, but it could be as large as you want. Also take note of the pointer that we pass for the stack into kos_new_task: It is a pointer to the end of our array. This is because stacks grow "up" in memory, meaning a push decrements the address and a pop increments it. If we passed the beginning of the array, the first push would make us point before the memory allocated to the stack since arrays are allocated "downwards" in memory.

The kos_new_task function is a little more complex. It performs two operations: setting up the initial context for the function and adding the Task structure to the linked list of tasks. The context needs to be set up initially because from the scheduler's perspective, the new task is simply an unblocked task that was blocked before. Therefore, it expects that some context is stored on that task's stack. Our context is ordered such that the PC (program counter) is first, the 32 registers are next, and the status register is last. Since the stack is last-in first-out, the SREG is popped first, then the 32 registers, and then the PC. We can see at the beginning of the function that we take the function pointer (they are usually 16 bits on most AVRs...the ones with lots of flash do it differently, so consult your datasheets) and set it up to be the program counter. It is arranged LSB-first, so the LSByte is "pushed" before the MSByte. The order here is very important and the reason why will become very apparent when we see the code for the dispatcher. After that, we put 32 0's onto the stack. These are the initial values for the registers and 0 seemed like a sensible value. The very last byte "pushed" is the status register. We set it to 0x80 so that the interrupt flag is set. This is a design decision to prevent problems with forgetting to enable interrupts for every task and having one task where we forgot to enable it prevent all interrupts from executing. Finally, the top of the stack (note the subtraction of 35 bytes from the stack pointer) is stored on the Task struct along with the initial task state. We add it to the task list as the head of the list, so the last task added is the task with the highest priority.

Implementation: kos_run and kos_schedule

Next we have the kos_run function:

1void kos_run(void)
3    kos_schedule();

Well that's simple: it just calls the scheduler. So, let's look at kos_schedule:

 1void kos_schedule(void)
 3    if (kos_isr_level)
 4        return;
 6    KOS_Task *task = task_head;
 7    while (task->status != TASK_READY)
 8        task = task->next;
10    if (task != kos_current_task)
11    {
13        {
14            kos_dispatch(task);
15        }
16    }

The very first thing to notice is the kos_isr_level reference. This solves a very specific problem that occurs with ISRs which I talk about in the next section. Other than that bit, however, this is also simple. Because our tasks in the linked list are ordered by priority, we can simply start at the top and move along the linked list until we locate the first task that is ready (unblocked). Once that task is found, we will call the dispatcher if the task we found is not the currently executing task.

The purpose of the ATOMIC_BLOCK is to ensure that interrupts are disabled when the dispatcher runs. Since the stack is going to be manipulated, the entire dispatcher is considered to be a critical section of code and must be run atomically. The ATOMIC_BLOCK will restore the interrupt status after kos_dispatch returns (which is after the task has been resumed).

Implementation: kos_enter_isr and kos_exit_isr

We are faced with a very particular problem when we want to call our scheduler inside of an interrupt. Let's imagine a scenario where we have two tasks, Task A and Task B (Task A has higher priority than Task B), in addition to the idle task. Task A uses waits on two semaphores (semaphores 1 and 2) that is signaled by an ISR. When task A is running, it signals another semaphore that Task B waits on (semaphore 3). Here is what happens:

  1. The idle task is running because both Task A and Task B are waiting on semaphores.
  2. An interrupt occurs (note that it happens during the idle task) and the ISR begins executing immediately. An ISR can be thought of as a super high priority task since it will interrupt anything.
  3. The ISR posts to semaphore 1 which Task A is waiting on. The very next statement is going to be to signal semaphore 2 as well. However, this happens next:
  4. After signaling semaphore 1, the dispatcher runs and Task A begins to execute. Task A signals semaphore 3 which will cause Task B to run. Since Task A has a higher priority than B, however, Task B isn't executed yet. Task A goes on to wait on semaphore 2. This then causes Task B to be dispatched.
  5. Task B takes a really long time to run, but it finally ends. There are no more tasks on the ready list, so the idle task begins to run.
  6. The idle task resumes inside the ISR and posts to semaphore 2.
  7. Task A begins running again.

As straightforward as that may seem, that isn't the intended behavior. Imagine if a task with an even higher priority than A had the ISR occur while it was executing. The sequence above would be totally different because Task A wouldn't be dispatched after the 1st semaphore being posted (item #4). Let's see what happens:

  1. The idle task is running because both Task A and Task B are waiting on semaphores.
  2. An interrupt occurs (note that it happens during the idle task) and the ISR begins executing immediately. An ISR can be thought of as a super high priority task since it will interrupt anything.
  3. The ISR posts to semaphore 1 which task A is waiting on.
  4. After signaling semaphore 1, the scheduler notices that the current task has a higher priority than Task A, so it does not dispatch.
  5. The ISR posts to semaphore 2.
  6. Same as #4. The ISR ends. Let's say that the high priority task blocks soon afterwards.
  7. Once the high priority task has blocked, Task A is executed. It posts to semaphore 3 and then waits on semaphore 2. Since semaphore 2 has already been posted, it continues right on through without a task switch to Task B. This is a major difference in the order of operations.
  8. After Task A finally blocks, Task B executes.

Because of the inconsistency and the fact that the ISR "priority" when viewed by the scheduler is determined by possibly random ISRs (making it non-deterministic), we need fix this. The solution I went with was to make two methods: kos_enter_isr and kos_exit_isr. These should be called when an ISR begins and when an ISR ends to temporarily hold off calling the scheduler until the very end of the ISR. This has the effect of giving an ISR an apparently high priority since it will not switch to another task until it has completely finished. So, although the idle task may be running when the ISR occurs, while the ISR is running no context switches will occur until the very end. Here is some code:

 1static uint8_t kos_isr_level = 0;
 2void kos_isr_enter(void)
 4    kos_isr_level++;
 7void kos_isr_exit(void)
 9    kos_isr_level--;
10    kos_schedule();

As seen in kos_schedule, we use the kos_isr_level variable to indicate to the scheduler whether we are in an ISR or not. When kos_isr_level finally returns to 0, the scheduler will actually perform scheduling when it is called at the end of kos_isr_exit. The second set of events described earlier will now happen every time, even if the idle task is interrupted.

These functions must be run with interrupts disabled since they don't use any sort of locking, but they should support nested interrupts so long as they are called at the point in the interrupt when interrupts have been disabled.

Implementation: kos_dispatch

The dispatcher is written basically entirely in inline assembly because it does the actual stack manipulation:

  1void kos_dispatch(KOS_Task *task)
  3    // the call to this function should push the return address into the stack.
  4    // we will now construct saving context. The entire context needs to be
  5    // saved because it is very possible that this could be called from within
  6    // an isr that doesn't use the call-used registers and therefore doesn't
  7    // save them.
  8    asm volatile (
  9            "push r31 \n\t"
 10            "push r30 \n\t"
 11            "push r29 \n\t"
 12            "push r28 \n\t"
 13            "push r27 \n\t"
 14            "push r26 \n\t"
 15            "push r25 \n\t"
 16            "push r24 \n\t"
 17            "push r23 \n\t"
 18            "push r22 \n\t"
 19            "push r21 \n\t"
 20            "push r20 \n\t"
 21            "push r19 \n\t"
 22            "push r18 \n\t"
 23            "push r17 \n\t"
 24            "push r16 \n\t"
 25            "push r15 \n\t"
 26            "push r14 \n\t"
 27            "push r13 \n\t"
 28            "push r12 \n\t"
 29            "push r11 \n\t"
 30            "push r10 \n\t"
 31            "push r9 \n\t"
 32            "push r8 \n\t"
 33            "push r7 \n\t"
 34            "push r6 \n\t"
 35            "push r5 \n\t"
 36            "push r4 \n\t"
 37            "push r3 \n\t"
 38            "push r2 \n\t"
 39            "push r1 \n\t"
 40            "push r0 \n\t"
 41            "in   r0, %[_SREG_] \n\t" //push sreg
 42            "push r0 \n\t"
 43            "lds  r26, kos_current_task \n\t"
 44            "lds  r27, kos_current_task+1 \n\t"
 45            "sbiw r26, 0 \n\t"
 46            "breq 1f \n\t" //null check, skip next section
 47            "in   r0, %[_SPL_] \n\t"
 48            "st   X+, r0 \n\t"
 49            "in   r0, %[_SPH_] \n\t"
 50            "st   X+, r0 \n\t"
 51            "1:" //begin dispatching
 52            "mov  r26, %A[_next_task_] \n\t"
 53            "mov  r27, %B[_next_task_] \n\t"
 54            "sts  kos_current_task, r26 \n\t" //set current task
 55            "sts  kos_current_task+1, r27 \n\t"
 56            "ld   r0, X+ \n\t" //load stack pointer
 57            "out  %[_SPL_], r0 \n\t"
 58            "ld   r0, X+ \n\t"
 59            "out  %[_SPH_], r0 \n\t"
 60            "pop  r31 \n\t" //status into r31: andi requires register above 15
 61            "bst  r31, %[_I_] \n\t" //we don't want to enable interrupts just yet, so store the interrupt status in T
 62            "bld  r31, %[_T_] \n\t" //T flag is on the call clobber list and tasks are only blocked as a result of a function call
 63            "andi r31, %[_nI_MASK_] \n\t" //I is now stored in T, so clear I
 64            "out  %[_SREG_], r31 \n\t"
 65            "pop  r0 \n\t"
 66            "pop  r1 \n\t"
 67            "pop  r2 \n\t"
 68            "pop  r3 \n\t"
 69            "pop  r4 \n\t"
 70            "pop  r5 \n\t"
 71            "pop  r6 \n\t"
 72            "pop  r7 \n\t"
 73            "pop  r8 \n\t"
 74            "pop  r9 \n\t"
 75            "pop  r10 \n\t"
 76            "pop  r11 \n\t"
 77            "pop  r12 \n\t"
 78            "pop  r13 \n\t"
 79            "pop  r14 \n\t"
 80            "pop  r15 \n\t"
 81            "pop  r16 \n\t"
 82            "pop  r17 \n\t"
 83            "pop  r18 \n\t"
 84            "pop  r19 \n\t"
 85            "pop  r20 \n\t"
 86            "pop  r21 \n\t"
 87            "pop  r22 \n\t"
 88            "pop  r23 \n\t"
 89            "pop  r24 \n\t"
 90            "pop  r25 \n\t"
 91            "pop  r26 \n\t"
 92            "pop  r27 \n\t"
 93            "pop  r28 \n\t"
 94            "pop  r29 \n\t"
 95            "pop  r30 \n\t"
 96            "pop  r31 \n\t"
 97            "brtc 2f \n\t" //if the T flag is clear, do the non-interrupt enable return
 98            "reti \n\t"
 99            "2: \n\t"
100            "ret \n\t"
101            "" ::
102            [_SREG_] "i" _SFR_IO_ADDR(SREG),
103            [_I_] "i" SREG_I,
104            [_T_] "i" SREG_T,
105            [_nI_MASK_] "i" (~(1 << SREG_I)),
106            [_SPL_] "i" _SFR_IO_ADDR(SPL),
107            [_SPH_] "i" _SFR_IO_ADDR(SPH),
108            [_next_task_] "r" (task));

So, a lot is happening here. There are 4 basic steps: Save the current context, update the current task's stack pointer, change the stack pointer to the next task, and restore the next task's context.

Inline assembly has an interesting syntax in GCC. I don't believe it is fully portable into non-GCC compilers, so this makes the code depend more or less on GCC. Inline assembly works by way of placeholders (called Operands in the manual). At the very end of the assembly statement, we see a series of comma-separated statements which define these placeholders/operands and how the assembly is going to use registers and such. First off, we pass in the SREG, SPL, and SPH registers as type "i", which is a constant number known at compile-time. These are simply the IO addresses for these registers (found in avr/io.h if you follow the #include chain deep enough). The next couple parameters are also "i" and are simply bit numbers and masks. The last parameter is the next task pointer passed in as an argument. This is the part where we see the reason why it is more convenient to do this in inline assembly rather than writing it up in an assembly file. While it is possible to look up how avr-gcc passes arguments to functions and discover that the arguments are stored in a certain order in certain registers, it is far simpler and less breakable to allow gcc to fill in the blanks for us. By stating that the _next_task_ placeholder is of type "r" (register), we force GCC to place that variable into some registers of its choosing. Now, if we were using some global variable or a static local, gcc would generate some code before our asm block placing those values into some registers. For this application, that could be quite bad since we depend on no (possibly stack-manipulating) code appearing between the function label and our asm block (more on this in the next paragraph). However, since arguments are passed by way of register, gcc will simply give us the registers by which they are passed in to the function. Since pointers are usually 16 bits on an 8-bit AVR (larger ones will have 3 bytes maybe...but I'm really not sure about this), it fits into two registers. We reference these in the inline assembly by way of "%A[_next_task_]" and "%B[_next_task_]" (note the A and B...these denote the LSB and MSB registers).

Storing the context is pretty straightforward: push all of the registers and push the status register. At this point you may ask, "What about the program counter? Didn't we have to push that earlier during kos_new_task?" When the function was called (using the CALL instruction), the return address was pushed onto the stack as a side-effect of that instruction. So, we don't need to push the program counter because it is already on there. This is also why it would be very bad if some code appeared before our asm block. It is likely that gcc will clear out some space on the stack and so we would end up with some junk between the return address on the stack and our first "push" instruction. This would mess up the task context frame and we will see later in the code that this will prevent this function from dispatching the task correctly when it became time for the task to be resumed.

Updating the stack pointer is slightly more tricky. Interrupts are disabled first because it would really suck if we got interrupt during this part (anytime the stack pointer is manipulated is a critical section). We then get to dereference the kos_current_task variable which contains our current task. If we remember from above, the very first thing in the KOS_Task structure is the stack pointer, so if we dereference kos_current_task, we are left with the address at which to store the stack pointer. From there, its as simple as loading the stack pointer into some registers and saving it into Indirect Register X (set by registers 26 and 27).

I should note here something about clearing the interrupt flag. Normally, we would want to check to see if interrupts were enabled beforehand so that we can know if we need to restore them. This code lacks an explicit check because of the fact that the status register (with interrupts possibly enabled) has already been stored. Later, when the current task is restored, the SREG will be restored and thus interrupts will be turned back on if they need to be. Similarly, if the next task has interrupts enabled, they will turned on in the same fashion.

After updating kos_current_task's stack pointer, we get to move the stack to the next task and set kos_current_task to point to the next task. This is essentially the reverse of the previous operation. Instead of writing to Indirect Register X (which points to the stack pointer of the task), we get to read from it. We also slip in a couple instructions to update the kos_current_task pointer so that it points to the next task. After we have changed the SPL and SPH registers to point to our new stack, the task passed into kos_dispatch is ready to be resumed.

Resuming the next task's context is a little less straightforward than saving it. We need to prevent interrupts from occurring while we restore the context. The reason for this is to ensure that we don't end up storing more than one context on that task's stack (and thereby increase the minimum required stack size to prevent a stack overflow). The problem here is that when we restore the status register, interrupts could be enabled at that point, rather that at the end when the context is done being restored. So, we need to restore in three steps: Restore the status register without the interrupt flag, restore all other registers, and then restore the interrupt flag. This is done by transferring the interrupt flag in the status register into the T (transfer) bit in the status register (that's the "bst" and "bld" instructions), clearing the interrupt flag, and then later executing either the ret or reti instruction based on this flag. The side effect is that we trash the T bit. I am not sure I can actually do this. This is one part that is tricky: The avr-gcc manual states that the T flag is a scratchpad, just like r0, and doesn't need to be restored by called functions. My logic here is that since the only way for a task to become blocked is either it being executed initially or from a call to kos_dispatch, gcc sees the dispatch call as a normal function call and will not assume that the T flag will remain unchanged.

After dancing around with bits and restoring the modified SREG, we proceed to pop off the rest of the registers in the reverse order that they were stored at the beginning of the function. At the very end, we use a T flag branch instruction to determine which return instruction to use. "ret" will return normally without setting the interrupt flag and "reti" will set the interrupt flag.

Implementation: Results by code size

So, at this point we have implemented a task scheduler and dispatcher. Here is how it weighs in with avr-size when compiled for an ATMega48A running just the idle task:

 1avr-size -C --mcu=atmega48a bin/kos.elf
 2AVR Memory Usage
 4Device: atmega48a
 6Program:     474 bytes (11.6% Full)
 7(.text + .data + .bootloader)
 9Data:        105 bytes (20.5% Full)
10(.data + .bss + .noinit)

Not the best, but its reasonable. The data usage could be taken down by reducing the number of maximum tasks. There are other RTOS available for AVR which can compile smaller. We could do several optimizations which I will discuss in the conclusion

Example: A semaphore

So, we now have a task scheduler. The thing is, although capable of running multiple tasks, it is not possible for multiple tasks to actually run. Why? Because kos_dispatch is never called! We need something that causes the task to become blocked.

As a demonstration, I'm going to implement a simple semaphore. I won't go into huge detail since that isn't the point of this article (and it has been long enough), but here is the code:

Header contents:

 1typedef struct {
 2    int8_t value;
 3} KOS_Semaphore;
 6 * Initializes a new semaphore
 7 */
 8KOS_Semaphore *kos_semaphore_init(int8_t value);
11 * Posts to a semaphore
12 */
13void kos_semaphore_post(KOS_Semaphore *sem);
16 * Pends from a semaphore
17 */
18void kos_semaphore_pend(KOS_Semaphore *sem);

Source contents:

 1static KOS_Semaphore semaphores[KOS_MAX_SEMAPHORES + 1];
 2static uint8_t next_semaphore = 0;
 4KOS_Semaphore *kos_semaphore_init(int8_t value)
 6    KOS_Semaphore *s = &semaphores[next_semaphore++];
 7    s->value = value;
 8    return s;
11void kos_semaphore_post(KOS_Semaphore *semaphore)
14    {
15        KOS_Task *task;
16        semaphore->value++;
18        //allow one task to be resumed which is waiting on this semaphore
19        task = task_head;
20        while (task)
21        {
22            if (task->status == TASK_SEMAPHORE && task->status_pointer == semaphore)
23                break; //this is the task to be restored
24            task = task->next;
25        }
27        task->status = TASK_READY;
28        kos_schedule();
29    }
32void kos_semaphore_pend(KOS_Semaphore *semaphore)
35    {
36        int8_t val = semaphore->value--; //val is value before decrement
38        if (val <= 0)
39        {
40            //we need to wait on the semaphore
41            kos_current_task->status_pointer = semaphore;
42            kos_current_task->status = TASK_SEMAPHORE;
44            kos_schedule();
45        }
46    }

So, our semaphore will cause a task to become blocked when kos_semaphore_pend is called (and the semaphore value was <= 0) and when kos_semaphore_post is called, the highest priority task that is blocked on the particular semaphore will be made ready.

Just so this makes sense, let's go through an example sequence of events:

  1. Task A is created. There are now two tasks on the task list: Task A and the idle task.
  2. Semaphore is initialized to 1 with kos_semaphore_init(1);
  3. Task A calls kos_semaphore_pend on the semaphore. The value is decremented, but it was >0 before the decrement, so the pend immediately returns.
  4. Task A calls kos_semaphore_pend again. This time, the kos_current_task (which points to Task A) state is set to blocked and the blocking data points to the semaphore. The scheduler is called and since Task A is now blocked, the idle task will be dispatched by kos_dispatch.
  5. The idle task runs and runs
  6. Eventually, some interrupt could occur (like a timer or something). During the course of the ISR, kos_semaphore_post is called on the semaphore. Every call to kos_semaphore_post allows exactly one task to be resumed, so it goes through the list looking for the highest priority task which is blocked on the semaphore. Task A is resumed at the point immediately after the call to kos_dispatch in kos_schedule. kos_schedule returns after a couple instructions restoring the interrupt flag state and now Task A will run until it is blocked.

Here's a program that does just this:

 2 * Main file for OS demo
 3 */
 5#include "kos.h"
 7#include <avr/io.h>
 8#include <avr/interrupt.h>
10#include "avr_mcu_section.h" //these two lines are for simavr
11AVR_MCU(F_CPU, "atmega48");
13static KOS_Semaphore *sem;
15static uint8_t val;
17static uint8_t st[128];
18void the_task(void)
20    TCCR0B |= (1 << CS00);
21    TIMSK0 |= (1 << TOIE0);
22    while (1)
23    {
24        kos_semaphore_pend(sem);
25        TCCR0B = 0;
27        val++;
28    }
31int main(void)
33    kos_init();
35    sem = kos_semaphore_init(0);
37    kos_new_task(&the_task, &st[127]);
39    kos_run();
41    return 0;
46    kos_isr_enter();
47    kos_semaphore_post(sem);
48    kos_isr_exit();

Running this with avr-gdb and simavr we can see this in action. I placed breakpoints at the val++ line and the kos_semaphore_post line. Here's the output with me pressing Ctrl-C at the end once it got into and stayed in the infinite loop in the idle task:

 1(gdb) break main.c:27
 2Breakpoint 1 at 0x35a: file src/main.c, line 27.
 3(gdb) break main.c:47
 4Breakpoint 2 at 0x38a: file src/main.c, line 47.
 5(gdb) continue
 7Note: automatically using hardware breakpoints for read-only addresses.
 9Breakpoint 2, __vector_16 () at src/main.c:47
1047       kos_semaphore_post(sem);
11(gdb) continue
14Breakpoint 2, __vector_16 () at src/main.c:47
1547       kos_semaphore_post(sem);
16(gdb) continue
19Breakpoint 2, __vector_16 () at src/main.c:47
2047       kos_semaphore_post(sem);
21(gdb) continue
24Breakpoint 1, the_task () at src/main.c:27
2527           val++;
26(gdb) continue
29Breakpoint 1, the_task () at src/main.c:27
3027           val++;
31(gdb) continue
34Breakpoint 1, the_task () at src/main.c:27
3527           val++;
36(gdb) continue
39Program received signal SIGTRAP, Trace/breakpoint trap.
40kos_idle_task () at src/kos.c:27
4127   {

You may have noticed that the interrupt was called three times before we even got to val++. The reason for this is that timer0 is an 8-bit timer and I used no prescaler for its clock, so the interrupt will happen every 255 cycles. Given that the dispatcher is nearly 100 instructions and the scheduler isn't exactly short either, the interrupt could easily be called three times before it manages to resume the task after it blocks (including the time it takes to block it).

A word on debugging

Before I finish up I want to mention a few things about debugging with avr-gdb. This project was the first time I had ever needed to use an simulator and debugger to even get the program to run. It would have been impossible to write this using an actual device since very little is revealed when operating the device. Here are a few things I learned:

  • avr-gdb is not perfect. For example, it is confused by the huge number of push statements at the beginning of kos_dispatch and will crash if stepped into that function (if it receives a break inside kos_dispatch that seems to work sometimes). This is due to avr-gdb attempting to decode the stack and finding that the frame size of the function is too big. It's weird and I didn't quite understand why that limitation was there, so I didn't really muck around with it. This made debugging the dispatcher super difficult.
  • Stack bugs are hard to find. I would recommend placing a watch on the top of your stack (the place where the variable actually points) and then setting that value to something unlikely like 0xAA. If you see this value modified, you know that there is a problem since you are about to exceed your stack size. I spent hours staring at a problem with that semaphore example above before I realized that the idle task stack had encroached on the semaphore variables. Even then, I was looking at something totally different and just noticed that the stack pointer was too small. As it turns out, my original stack size of 48 was too small. The dispatcher will always require at least 35 free bytes on the stack and any ISR that calls a function will require at least 17 bytes due to the way that functions are called in avr-gcc. 35+17 = 52 which is greater than yeah. Not good.
  • Simavr is pretty good. It supports compiling a program that embeds simavr which can be used to emulate the hardware around the microcontroller rather than just the microcontroller itself. I didn't use this functionality for this project, but that is a seriously cool thing.


This has been a long post, but it is a complicated topic. Writing something like this is actually considered writing an operating system (albeit just the kernel portion and a small one at that) and the debug along for just this post took me a while. One must have a good knowledge of how exactly the processor works. I found my knowledge lacking, actually, and I learned a lot about how the AVR works. The other thing is that things like concurrency and interrupts must be considered from the very beginning. They can't be an afterthought.

The scheduler and dispatcher I have described here are not perfect nor are they the most optimal efficient design. For one thing, my design uses a huge amount of RAM compared to other RTOS options. My scheduler and dispatcher are also inefficient, with the scheduler having an O(N) complexity depending on the number of tasks. My structure does, however, allow for O(1) time when suspending a task (although I question the utility of worked better with the 8086 scheduler I made for class than with the AVR). Another problem is that kos_dispatch will not work with avr-gdb if the program is stopped during this function (it has a hard time decoding the function prologue because of the large number of push instructions). I haven't found a solution to this problem and it certainly made debugging a little more difficult.

So, now that I've told you some of what's wrong with the above, here are two RTOS which can be used with the AVR and are well tested:

  • FemtoOS. This is an extremely tiny and highly configurable RTOS. The bare implementation needs only 270 bytes of flash and 10 bytes of RAM. Ridiculous! My only serious issue with it is that it is GPLv3 licensed and due to how the application is compiled, licensing can be troublesome unless you want to also be GPLv3.
  • FreeRTOS. Very popular RTOS that has all sorts of support for many processors (ARM, PPC, name it). I've never used it myself, but it also seems to have networking support and stuff like that. The site says that it's "market leading."

Anyway, I hope that this article is useful and as usual, any suggestions and such can be left in the comments. As mentioned before, the code for this article can be found on github here:

Teensy 3.1 bare metal: Writing a USB driver

One of the things that has intrigued me for the past couple years is making embedded USB devices. It's an industry standard bus that just about any piece of computing hardware can connect with yet is complex enough that doing it yourself is a bit of a chore.

Traditionally I have used the work of others, mainly the V-USB driver for AVR, to get my devices connected. Lately I have been messing around more with the ARM processor on a Teensy 3.1 which has an integrated USB module. The last microcontrollers I used that had these were the PIC18F4550s that I used in my dot matrix project. Even with those, I used microchip's library and drivers.

Over the thanksgiving break I started cobbling together some software with the intent of writing a driver for the USB module in the Teensy myself. I started originally with my bare metal stuff, but I ended up going with something closer to Karl Lunt's solution. I configured code::blocks to use the arm-none-eabi compiler that I had installed and created a code blocks project for my code and used that to build it (with a post-compile event translating the generated elf file into a hex file).

This is a work in progress and the git repository will be updated as things progress since it's not a dedicated demonstration of the USB driver.

The github repository here will be eventually turned in to a really really rudimentary 500-800ksps oscilloscope.

The code:

The code for this post was taken from the following commit:

At the end of this post, I will have outlined all of the pieces needed to have a simple USB device setup that responds with a descriptor on endpoint 0.

USB Basics

I will actually not be talking about these here as I am most definitely no expert. However, I will point to the page that I found most helpful when writing this:

This site explained very clearly exactly what was going on with USB. Coupled with my previous knowledge, it was almost all I needed in terms of getting the protocol.

The Freescale K20 Family and their USB module

The one thing that I don't like about all of these great microcontrollers that come out with USB support is that all of them have their very own special USB module which doesn't work like anyone else. Sure, there are similarities, but there are no two exactly alike. Since I have a Teensy and the K20 family of microcontrollers seem to be relatively popular, I don't feel bad about writing such specific software.

There are two documents I found to be essential to writing this driver:

  1. The family manual. Getting a correct version for the MK20DX256VLH7 (the processor on the Teensy) can be a pain. PJRC comes to the rescue here: (note, the Teensies based on the MK20DX128VLH5 use a different manual)
  2. The Kinetis Peripheral Module Quick Reference: This specifies the initialization sequence and other things that will be needed for the module.

There are a few essential parts to understand about the USB module:

  • It needs a specific memory layout. Since it doesn't have any dedicated user-accessible memory, it requires that the user specify where things should be. There are specific valid locations for its Buffer Descriptor Table (more on that later) and the endpoint buffers. The last one bit me for several days until I figured it out.
  • It has several different clock inputs and all of them must be enabled. Identifying the different signals is the most difficult part. After that, its not hard.
  • The module only handles the electrical aspect of things. It doesn't handle sending descriptors or anything like that. The only real things it handles are the signaling levels, responding to USB packets in a valid manner, and routing data into buffers by endpoint. Other than that, its all user software.
  • The module can act as both a host (USB On-the-go (OTG)) and a device. We will be exclusively focusing on using it as a device here.

In writing this, I must confess that I looked quite a lot at the Teensyduino code along with the V-USB driver code (even though V-USB is for AVR and is pure software). Without these "references", this would have been a very difficult project. Much of the structure found in the last to parts of this document reflects the Teensyduino USB driver since they did it quite efficiently and I didn't spend a lot of time coming up with a "better" way to do it, given the scope of this project. I will likely make more changes as I customize it for my end use-case.

Part 1: The clocks

The K20 family of microcontrollers utilizes a miraculous hardware module which they call the "Multipurpose Clock Generator" (hereafter called the MCG). This is a module which basically allows the microcontroller to take any clock input between a few kilohertz and several megahertz and transform it into a higher frequency clock source that the microcontroller can actually use. This is how the Teensy can have a rated speed of 96Mhz but only use a 16Mhz crystal. The configuration that this project uses is the Phase Locked Loop (PLL) from the high speed crystal source. The exact setup of this configuration is done by the sysinit code.

The PLL operates by using a divider-multiplier setup where we give it a divisor to divide the input clock frequency by and then a multiplier to multiply that result by to give us the final clock speed. After that, it heads into the System Integration Module (SIM) which distributes the clock. Since the Teensy uses a 16Mhz crystal and we need a 96Mhz system clock (the reason will become apparent shortly), we set our divisor to 4 and our multiplier to 24 (see common.h). If the other type of Teensy 3 is being used (the one with the MK20DX128VLH5), the divisor would be 8 and the multiplier 36 to give us 72Mhz.

Every module on a K20 microcontroller has a gate on its clock. This saves power since there are many modules on the microcontroller that are not being used in any given application. Distributing the clock to each of these is expensive in terms of power and would be wasted if that module wasn't used. The SIM handles this gating in the SIM_SCGC* registers. Before using any module, its clock gate must be enabled. If this is not done, the microcontroller will "crash" and stop executing when it tries to talk to the module registers (I think a handler for this can be specified, but I'm not sure). I had this happen once or twice while messing with this. So, the first step is to "turn on" the USB module by setting the appropriate bit in SIM_SCGC4 (per the family manual mentioned above, page 252):


Now, the USB module is a bit different than the other modules. In addition to the module clock it needs a reference clock for USB. The USB module requires that this reference clock be at 48Mhz. There are two sources for this clock: an internal source generated by the MCG/SIM or an external source from a pin. We will use the internal source:


The first line here selects that the USB reference clock will come from an internal source. It also specifies that the internal source will be using the output from the PLL in the MCG (the other option is the FLL (frequency lock loop), which we are not using). The second line sets the divider needed to give us 48Mhz from the PLL clock. Once again there are two values: The divider and the multiplier. The multiplier can only be 1 or 2 and the divider can be anywhere from 1 to 16. Since we have a 96Mhz clock, we simply divide by 2 (the value passed is a 1 since 0 = "divide by 1", 1 = "divide by 2", etc). If we were using the 72Mhz clock, we would first multiply by 2 before dividing by 3.

With that, the clock to the USB module has been activated and the module can now be initialized.

Part 2: The startup sequence

The Peripheral Module Quick Reference guide mentioned earlier contains a flowchart which outlines the exact sequence needed to initialize the USB module to act as a device. I don't know if I can copy it here (yay copyright!), but it can be found on page 134, figure 15-6. There is another flowchart specifying the initialization sequence for using the module as a host.

Our startup sequence goes as follows:

 1//1: Select clock source
 2SIM_SOPT2 |= SIM_SOPT2_USBSRC_MASK | SIM_SOPT2_PLLFLLSEL_MASK; //we use MCGPLLCLK divided by USB fractional divider
 3SIM_CLKDIV2 = SIM_CLKDIV2_USBDIV(1); //(USBFRAC + 0)/(USBDIV + 1) = (1 + 0)/(1 + 1) = 1/2 for 96Mhz clock
 5//2: Gate USB clock
 8//3: Software USB module reset
12//4: Set BDT base registers
13USB0_BDTPAGE1 = ((uint32_t)table) >> 8;  //bits 15-9
14USB0_BDTPAGE2 = ((uint32_t)table) >> 16; //bits 23-16
15USB0_BDTPAGE3 = ((uint32_t)table) >> 24; //bits 31-24
17//5: Clear all ISR flags and enable weak pull downs
18USB0_ISTAT = 0xFF;
21USB0_USBTRC0 |= 0x40; //a hint was given that this is an undocumented interrupt bit
23//6: Enable USB reset interrupt
31//7: Enable pull-up resistor on D+ (Full speed, 12Mbit/s)

The first two steps were covered in the last section. The next one is relatively straightfoward: We ask the module to perform a "reset" on itself. This places the module to its initial state which allows us to configure it as needed. I don't know if the while loop is necessary since the manual says that the reset bit always reads low and it only says we must "wait two USB clock cycles". In any case, enough of a wait seems to be executed by the above code to allow it to reset properly.

The next section (4: Set BDT base registers) requires some explanation. Since the USB module doesn't have a dedicated memory block, we have to provide it. The BDT is the "Buffer Descriptor Table" and contains 16 * 4 entries that look like so:

1typedef struct {
2    uint32_t desc;
3    void* addr;
4} bdt_t;

"desc" is a descriptor for the buffer and "addr" is the address of the buffer. The exact bits of the "desc" are explained in the manual (p. 971, Table 41-4), but they basically specify ownership of the buffer (user program or USB module) and the USB token that generated the data in the buffer (if applicable).

Each entry in the BDT corresponds to one of 4 buffers in one of the 16 USB endpoints: The RX even, RX odd, TX even, and TX odd. The RX and TX are pretty self explanatory...the module needs somewhere to read the data its going to send and somewhere to write the data it just received. The even and odd are a configuration that I have seen before in the PIC 18F4550 USB module: Ping-pong buffers. While one buffer is being sent/received by the module, the other can be in use by user code reading/writing (ping). When the user code is done with its buffers, it swaps buffers, giving the usb module control over the ones it was just using (pong). This allows seamless communication between the host and the device and minimizes the need for copying data between buffers. I have declared the BDT in my code as follows:

1#define BDT_INDEX(endpoint, tx, odd) ((endpoint << 2) | (tx << 1) | odd)
2__attribute__ ((section(".usbdescriptortable"), used))
3static bdt_t table[(USB_N_ENDPOINTS + 1)*4]; //max endpoints is 15 + 1 control

One caveat of the BDT is that it must be aligned with a 512-byte boundary in memory. Our code above showed that only 3 bytes of the 4 byte address of "table" are passed to the module. This is because the last byte is basically the index along the table (the specification of this is found in section 41.4.3, page 970 of the manual). The #define directly above the declaration is a helper macro for referencing entries in the table for specific endpoints (this is used later in the interrupt). Now, accomplishing this boundary alignment requires some modification of the linker script. Before this, I had never had any need to modify a linker script. We basically need to create a special area of memory (in the above, it is called ".usbdescriptortable" and the attribute declaration tells the compiler to place that variable's reference inside of it) which is aligned to a 512-byte boundary in RAM. I declared mine like so:

1.usbdescriptortable (NOLOAD) : {
2     . = ALIGN(512);
3     *(.usbdescriptortable*)
4} > sram

The position of this in the file is mildly important, so looking at the full linker script would probably be good. This particular declaration I more or less lifted from the Teensyduino linker script, with some changes to make it fit into my linker script.

Steps 5-6 set up the interrupts. There is only one USB interrupt, but there are two registers of flags. We first reset all of the flags. Interestingly, to reset a flag we write back a '1' to the particular flag bit. This has the effect of being able to set a flag register to itself to reset all of the flags since a flag bit is '1' when it is triggered. After resetting the flags, we enable the interrupt in the NVIC (Nested Vector Interrupt Controller). I won't discuss the NVIC much, but it is a fairly complex piece of hardware. It has support for lots and lots of interrupts (over 100) and separate priorities for each one. I don't have reliable code for setting interrupt priorities yet, but eventually I'll get around to messing with that. The "enable_irq()" call is a function that is provided in arm_cm4.c and all that it does is enable the interrupt specified by the passed vector number. These numbers are specified in the datasheet, but we have a #define specified in the mk20d7 header file (warning! 12000 lines ahead) which gives us the number.

The very last step in initialization is to set the internal pullup on D+. According to the USB specification, a pullup on D- specifies a low speed device (1.2Mbit/s) and a pullup on D+ specifies a full speed device (12Mbit/s). We want to use the higher speed grade. The Kinetis USB module does not support high speed (480Mbit/s) mode.

Part 3: The interrupt handler state machine

The USB protocol can be interpreted in the context of a state machine with each call to the interrupt being a "tick" in the machine. The interrupt handler must process all of the flags to determine what happened and where to go from there.

 1#define ENDP0_SIZE 64
 4 * Endpoint 0 receive buffers (2x64 bytes)
 5 */
 6static uint8_t endp0_rx[2][ENDP0_SIZE];
 8//flags for endpoint 0 transmit buffers
 9static uint8_t endp0_odd, endp0_data = 0;
12 * Handler functions for when a token completes
13 * TODO: Determine if this structure really will work for all kinds of handlers
14 *
15 * I hope this looks like a dynamic jump table to the compiler
16 */
17static void (*handlers[USB_N_ENDPOINTS + 2]) (uint8_t);
19void USBOTG_IRQHandler(void)
21    uint8_t status;
22    uint8_t stat, endpoint;
24    status = USB0_ISTAT;
26    if (status & USB_ISTAT_USBRST_MASK)
27    {
28        //handle USB reset
30        //initialize endpoint 0 ping-pong buffers
32        endp0_odd = 0;
33        table[BDT_INDEX(0, RX, EVEN)].desc = BDT_DESC(ENDP0_SIZE, 0);
34        table[BDT_INDEX(0, RX, EVEN)].addr = endp0_rx[0];
35        table[BDT_INDEX(0, RX, ODD)].desc = BDT_DESC(ENDP0_SIZE, 0);
36        table[BDT_INDEX(0, RX, ODD)].addr = endp0_rx[1];
37        table[BDT_INDEX(0, TX, EVEN)].desc = 0;
38        table[BDT_INDEX(0, TX, ODD)].desc = 0;
40        //initialize endpoint0 to 0x0d (41.5.23)
41        //transmit, recieve, and handshake
44        //clear all interrupts...this is a reset
45        USB0_ERRSTAT = 0xff;
46        USB0_ISTAT = 0xff;
48        //after reset, we are address 0, per USB spec
49        USB0_ADDR = 0;
51        //all necessary interrupts are now active
52        USB0_ERREN = 0xFF;
57        return;
58    }
59    if (status & USB_ISTAT_ERROR_MASK)
60    {
61        //handle error
64    }
65    if (status & USB_ISTAT_SOFTOK_MASK)
66    {
67        //handle start of frame token
69    }
70    if (status & USB_ISTAT_TOKDNE_MASK)
71    {
72        //handle completion of current token being processed
73        stat = USB0_STAT;
74        endpoint = stat >> 4;
75        handlers[endpoint](stat);
78    }
79    if (status & USB_ISTAT_SLEEP_MASK)
80    {
81        //handle USB sleep
83    }
84    if (status & USB_ISTAT_STALL_MASK)
85    {
86        //handle usb stall
88    }

The above code will be executed whenever the IRQ for the USB module fires. This function is set up in the crt0.S file, but with a weak reference, allowing us to override it easily by simply defining a function called USBOTG_IRQHandler. We then proceed to handle all of the USB interrupt flags. If we don't handle all of the flags, the interrupt will execute again, giving us the opportunity to fully process all of them.

Reading through the code is should be obvious that I have not done much with many of the flags, including USB sleep, errors, and stall. For the purposes of this super simple driver, we really only care about USB resets and USB token decoding.

The very first interrupt that we care about which will be called when we connect the USB device to a host is the Reset. The host performs this by bringing both data lines low for a certain period of time (read the USB basics stuff for more information). When we do this, we need to reset our USB state into its initial and ready state. We do a couple things in sequence:

  1. Initialize the buffers for endpoint 0. We set the RX buffers to point to some static variables we have defined which are simply uint8_t arrays of length "ENDP0_SIZE". The TX buffers are reset to null since nothing is going to be transmitted. One thing to note is that the ODDRST bit is flipped on in the USB0_CTL register. This is very important since it "syncronizes" the USB module with our code in terms of knowing whether the even or odd buffer should be used next for transmitting. When we do ODDRST, it sets the next buffer to be used to be the even buffer. We have a "user-space" flag (endp0_odd) which we reset at the same time so that we stay in sync with the buffer that the USB module is going to use.
  2. We enable endpoint 0. Specifically, we say that it can transmit, receive, and handshake. Enabled endpoints always handshake, but endpoints can either send, receive, or both. Endpoint 0 is specified as a reading and writing endpoint in the USB specification. All of the other endpoints are device-specific.
  3. We clear all of the interrupts. If this is a reset we obviously won't be doing much else.
  4. Set our USB address to 0. Each device on the USB bus gets an address between 0 and 127. Endpoint 0 is reserved for devices that haven't been assigned an address yet (i.e. have been reset), so that becomes our address. We will receive an address later via a command sent to endpoint 0.
  5. Activate all necessary interrupts. In the previous part where we discussed the initialization sequence we only enabled the reset interrupt. After being reset, we get to enable all of the interrupts that we will need to be able to process USB events.

After a reset the USB module will begin decoding tokens. While there are a couple different types of tokens, the USB module has a single interrupt for all of them. When a token is decoded the module gives us information about what endpoint the token was for and what BDT entry should be used. This information is contained in the USB0_STAT register.

The exact method for processing these tokens is up to the individual developer. My choice for the moment was to make a dynamic jump table of sorts which stores 16 function pointers which will be called in order to process the tokens. Initially, these pointers point to dummy functions that do nothing. The code for the endpoint 0 handler will be discussed in the next section.

Our code here uses USB0_STAT to determine which endpoint the token was decoded for, finds the appropriate function pointer, and calls it with the value of USB0_STAT.

Part 4: Token processing & descriptors

This is one part of the driver that isn't something that must be done a certain way, but however it is done, it must accomplish the task correctly. My super-simple driver processes this in two stages: Processing the token type and processing the token itself.

As mentioned in the previous section, I had a handler for each endpoint that would be called after a token was decoded. The handler for endpoint 0 is as follows:

 1#define PID_OUT   0x1
 2#define PID_IN    0x9
 3#define PID_SOF   0x5
 4#define PID_SETUP 0xd
 6typedef struct {
 7    union {
 8        struct {
 9            uint8_t bmRequestType;
10            uint8_t bRequest;
11        };
12        uint16_t wRequestAndType;
13    };
14    uint16_t wValue;
15    uint16_t wIndex;
16    uint16_t wLength;
17} setup_t;
20 * Endpoint 0 handler
21 */
22static void usb_endp0_handler(uint8_t stat)
24    static setup_t last_setup;
26    //determine which bdt we are looking at here
27    bdt_t* bdt = &table[BDT_INDEX(0, (stat & USB_STAT_TX_MASK) >> USB_STAT_TX_SHIFT, (stat & USB_STAT_ODD_MASK) >> USB_STAT_ODD_SHIFT)];
29    switch (BDT_PID(bdt->desc))
30    {
31    case PID_SETUP:
32        //extract the setup token
33        last_setup = *((setup_t*)(bdt->addr));
35        //we are now done with the buffer
36        bdt->desc = BDT_DESC(ENDP0_SIZE, 1);
38        //clear any pending IN stuff
39        table[BDT_INDEX(0, TX, EVEN)].desc = 0;
40        table[BDT_INDEX(0, TX, ODD)].desc = 0;
41        endp0_data = 1;
43        //run the setup
44        usb_endp0_handle_setup(&last_setup);
46        //unfreeze this endpoint
48        break;
49    case PID_IN:
50        if (last_setup.wRequestAndType == 0x0500)
51        {
52            USB0_ADDR = last_setup.wValue;
53        }
54        break;
55    case PID_OUT:
56        //nothing to do here..just give the buffer back
57        bdt->desc = BDT_DESC(ENDP0_SIZE, 1);
58        break;
59    case PID_SOF:
60        break;
61    }

The very first step in handling a token is determining the buffer which contains the data for the token transmitted. This is done by the first statement which finds the appropriate address for the buffer in the table using the BDT_INDEX macro which simply implements the addressing form found in Figure 41-3 in the family manual.

After determining where the data received is located, we need to determine which token exactly was decoded. We only do things with four of the tokens. Right now, if a token comes through that we don't understand, we don't really do anything. My thought is that I should be initiating an endpoint stall, but I haven't seen anywhere that specifies what exactly I should do for an unrecognized token.

The main token that we care about with endpoint 0 is the SETUP token. The data attached to this token will be in the format described by setup_t, so the first step is that we dereference and cast the buffer into which the data was loaded into a setup_t. This token will be stored statically since we need to look at it again for tokens that follow, especially in the case of the IN token following the request to be assigned an address.

One part of processing a setup token that tripped me up for a while was what the next DATA state should be. The USB standard specifies that the data in a frame is either marked DATA0 or DATA1 and it alternates by frame. This information is stored in a flag that the USB module will read from the first 4 bytes of the BDT (the "desc" field). Immediately following a SETUP token, the next DATA transmitted must be a DATA1.

After this, the setup function is run (more on that next) and as a final step, the USB module is "unfrozen". Whenever a token is being processed, the USB module "freezes" so that processing can occur. While I haven't yet read enough documentation on the subject, it seems to me that this is to give the user program some time to actually handle a token before the USB module decodes another one. I'm not sure what happens if the user program takes to long, but I imagine some error flag will go off.

The guts of handling a SETUP request are as follows:

  1typedef struct {
  2    uint8_t bLength;
  3    uint8_t bDescriptorType;
  4    uint16_t bcdUSB;
  5    uint8_t bDeviceClass;
  6    uint8_t bDeviceSubClass;
  7    uint8_t bDeviceProtocol;
  8    uint8_t bMaxPacketSize0;
  9    uint16_t idVendor;
 10    uint16_t idProduct;
 11    uint16_t bcdDevice;
 12    uint8_t iManufacturer;
 13    uint8_t iProduct;
 14    uint8_t iSerialNumber;
 15    uint8_t bNumConfigurations;
 16} dev_descriptor_t;
 18typedef struct {
 19    uint8_t bLength;
 20    uint8_t bDescriptorType;
 21    uint8_t bInterfaceNumber;
 22    uint8_t bAlternateSetting;
 23    uint8_t bNumEndpoints;
 24    uint8_t bInterfaceClass;
 25    uint8_t bInterfaceSubClass;
 26    uint8_t bInterfaceProtocol;
 27    uint8_t iInterface;
 28} int_descriptor_t;
 30typedef struct {
 31    uint8_t bLength;
 32    uint8_t bDescriptorType;
 33    uint16_t wTotalLength;
 34    uint8_t bNumInterfaces;
 35    uint8_t bConfigurationValue;
 36    uint8_t iConfiguration;
 37    uint8_t bmAttributes;
 38    uint8_t bMaxPower;
 39    int_descriptor_t interfaces[];
 40} cfg_descriptor_t;
 42typedef struct {
 43    uint16_t wValue;
 44    uint16_t wIndex;
 45    const void* addr;
 46    uint8_t length;
 47} descriptor_entry_t;
 50 * Device descriptor
 51 * NOTE: This cannot be const because without additional attributes, it will
 52 * not be placed in a part of memory that the usb subsystem can access. I
 53 * have a suspicion that this location is somewhere in flash, but not copied
 54 * to RAM.
 55 */
 56static dev_descriptor_t dev_descriptor = {
 57    .bLength = 18,
 58    .bDescriptorType = 1,
 59    .bcdUSB = 0x0200,
 60    .bDeviceClass = 0xff,
 61    .bDeviceSubClass = 0x0,
 62    .bDeviceProtocol = 0x0,
 63    .bMaxPacketSize0 = ENDP0_SIZE,
 64    .idVendor = 0x16c0, //VOTI VID/PID for use with libusb
 65    .idProduct = 0x05dc,
 66    .bcdDevice = 0x0001,
 67    .iManufacturer = 0,
 68    .iProduct = 0,
 69    .iSerialNumber = 0,
 70    .bNumConfigurations = 1
 74 * Configuration descriptor
 75 * NOTE: Same thing about const applies here
 76 */
 77static cfg_descriptor_t cfg_descriptor = {
 78    .bLength = 9,
 79    .bDescriptorType = 2,
 80    .wTotalLength = 18,
 81    .bNumInterfaces = 1,
 82    .bConfigurationValue = 1,
 83    .iConfiguration = 0,
 84    .bmAttributes = 0x80,
 85    .bMaxPower = 250,
 86    .interfaces = {
 87        {
 88            .bLength = 9,
 89            .bDescriptorType = 4,
 90            .bInterfaceNumber = 0,
 91            .bAlternateSetting = 0,
 92            .bNumEndpoints = 0,
 93            .bInterfaceClass = 0xff,
 94            .bInterfaceSubClass = 0x0,
 95            .bInterfaceProtocol = 0x0,
 96            .iInterface = 0
 97        }
 98    }
101static const descriptor_entry_t descriptors[] = {
102    { 0x0100, 0x0000, &dev_descriptor, sizeof(dev_descriptor) },
103    { 0x0200, 0x0000, &cfg_descriptor, 18 },
104    { 0x0000, 0x0000, NULL, 0 }
107static void usb_endp0_transmit(const void* data, uint8_t length)
109    table[BDT_INDEX(0, TX, endp0_odd)].addr = (void *)data;
110    table[BDT_INDEX(0, TX, endp0_odd)].desc = BDT_DESC(length, endp0_data);
111    //toggle the odd and data bits
112    endp0_odd ^= 1;
113    endp0_data ^= 1;
117 * Endpoint 0 setup handler
118 */
119static void usb_endp0_handle_setup(setup_t* packet)
121    const descriptor_entry_t* entry;
122    const uint8_t* data = NULL;
123    uint8_t data_length = 0;
126    switch(packet->wRequestAndType)
127    {
128    case 0x0500: //set address (wait for IN packet)
129        break;
130    case 0x0900: //set configuration
131        //we only have one configuration at this time
132        break;
133    case 0x0680: //get descriptor
134    case 0x0681:
135        for (entry = descriptors; 1; entry++)
136        {
137            if (entry->addr == NULL)
138                break;
140            if (packet->wValue == entry->wValue && packet->wIndex == entry->wIndex)
141            {
142                //this is the descriptor to send
143                data = entry->addr;
144                data_length = entry->length;
145                goto send;
146            }
147        }
148        goto stall;
149        break;
150    default:
151        goto stall;
152    }
154    //if we are sent here, we need to send some data
155    send:
156        if (data_length > packet->wLength)
157            data_length = packet->wLength;
158        usb_endp0_transmit(data, data_length);
159        return;
161    //if we make it here, we are not able to send data and have stalled
162    stall:

This is the part that took me the longest once I managed to get the module talking. Handling of SETUP tokens on endpoint 0 must be done in a rather exact fashion and the slightest mistake gives some very cryptic errors.

This is a very very very minimalistic setup token handler and is not by any means complete. It does only what is necessary to get the computer to see the device successfully read its descriptors. There is no functionality for actually doing things with the USB device. Most of the space is devoted to actually returning the various descriptors. In this example, the descriptor is for a device with a single configuration and a single interface which uses no additional endpoints. In a real device, this would almost certainly not be the case (unless one uses V-USB...this is how V-USB sets up their device if no other endpoints are compiled in).

The SETUP packet comes with a "request" and a "type". We process these as one word for simplicity. The above shows only the necessary commands to actually get this thing to connect to a Linux machine running the standard USB drivers that come with the kernel. I have not tested it on Windows and it may require some modification to work since it doesn't implement all of the necessary functionality. A description of the functionality follows:

  • Set address (0x0500): This is a very simple command. All it does is wait for the next IN token. Upon receipt of this token, the address is considered "committed" and the USB module is told of its new address (see the endpoint 0 handler function above (not the setup handler)).
  • Set configuration (0x0900): This command can be complex, but I have stripped it down for the purposes of this example. Normally, during this command the USB module would be set up with all the requisite BDT entries for the endpoints described by the selected configuration. Since we only have one possible configuration and it doesn't use any additional endpoints, we basically do nothing. Once I start added other endpoints to this, all of the setup for those endpoints will go in here. This is the equivalent of the RESET handler for non-zero endpoints in terms of the operations that occur. If the Set Interface command was implemented, it would have similar functionality. More about this command can be read in the referenced USB basics website.
  • Get descriptor (0x0680, 0x0681): In reality, this is two commands: Get descriptor and get interface. However, due to the structure we have chosen in storing the descriptors, these two commands can be merged. This is the most complex part of this particular driver and is influenced heavily by the way things are done with the Teensyduino driver since I thought they had a very efficient pattern. Basically, it uses the wIndex and wValue to find a pointer to some data to return, whether that be the device descriptor, the configuration descriptor, a string, or something else. In our case, we have only the device descriptor and the configuration descriptor. Adding a string would be trivial, however, and the exact wIndex and wValue combination for that is described in the USB basics. The wIndex for strings matches with any of the several i* (iManufacturer, iProduct, etc) which may be specified.
  • default: When an unrecognized command is received, we enter a stall. This is basically the USB way of saying "uhh...I don't know what to do here" and requires the host to un-stall the endpoint before it can continue. From what I gather, there isn't really much the user code has to do other than declare that a stall has occurred. The USB module seems to take care of the rest of that.

After handling a command and determining that it isn't a stall, the transmission is set up. At the moment, I only have transmission set up for a maximum of 64 bytes. In reality, this is limited by the wLength transmitted with the setup packet (note the if statement before the call to usb_endp0_transmit), but as far as I have seen this is generally the same as the length of the endpoint (I could be very wrong watch out for that one). However, it would be fairly straightfoward to allow it to transmit more bytes: Upon receipt of an IN token, just check if we have reached the end of what we are supposed to transmit. If not, point the next TX buffer to the correct starting point and subtract the endpoint size from the remaining length until we have transmitted all of the bytes. Although the endpoint size is 64 bytes, it is easy to transmit much more than that; it just takes multiple IN requests. The data length is given by the descriptors, so the host can determine when to stop sending IN requests.

During transmission, both the even and data flags are toggled. This ensures that we are always using the correct TX buffer (even/odd) and the DATA flag transmitted is valid.

The descriptors are the one part that can't really be screwed up here. Screwing up the descriptors causes interesting errors when the host tries to communicate. I did not like how the "reference" usb drivers I looked at generally defined descriptors: They used a char array. This works very well for the case where there are a variable number of entries in the descriptor, but for my purposes I decided to use named structs so that I could match the values I had specified on my device to values I read from the host machine without resorting to counting bytes in the array. It's simply for easier reading and doesn't really give much more than that. It may even be more error prone because I am relying on the compiler packing the struct into memory in the correct order for transmission and in later versions I may end up using the char array method.

I won't delve into a long and drawn out description of what the USB descriptor has in it, but I will give a few points:

  • In Linux, the device descriptor is requested first and then the configuration descriptor after that. They are two separate commands, hence the two separate descriptor entries in my descriptor table.
  • The device descriptor must NOT be "const". For my compiler at least, this causes it to be placed into flash which, while a perfectly valid memory address that in general can be read, is inaccessible to the USB module. I spent a long time banging my head on this one saying "but it should work! why doesn't it work???" Moral of the story: Anything that is pointed to by a BDT entry (transmit buffers, receive buffers) must be located in main RAM, not in the flash. It must not be const.
  • A device must have at least one configuration. Linux, at least, didn't seem to like it very much when there were zero configurations and would put lots of errors into my log.
  • The configuration needs to have at least one interface. Specifying no interfaces caused the same problems as not specifying any configurations.
  • The configuration indices (bConfigurationValue) are 1-based and the interface indices (bInterfaceNumber) are zero based. I haven't fooled around with these enough to test the veracity of this claim fully, but it was the only configuration that I managed to get things working in.
  • The length values are very important. If these are not correct, the host will have some serious troubles reading the descriptors. I spend a while troubleshooting these. The main one to make sure of is the wTotalLength value in the configuration descriptor. Most of the others are pretty much always going to be the same.

Where to go from here

The driver I have implemented leaves much to be desired. This isn't meant to be a fully featured driver. Instead, its meant to be something of an introduction to getting the USB module to work on the bare metal without the support of some external dependency. A few things that would definitely need to be implemented are:

  • The full set of commands for the endpoint 0 SETUP token processing
  • A more expansive configuration that allows for having some bulk endpoints for sending data. The 64-byte limitation of packet size for endpoint 0 can cause some issues when attempting to actually utilize the full 12Mbit/s bandwidth. The USB protocol does actually add overhead and the less times that a token has to be invoked, the better.
  • Strings in the configuration. Right now, the configuration is essentially "blank" because it uses a shared VID/PID and doesn't specify a manufacturer, product, or serial number. It would be rather hard to identify this device using libusb on a system with multiple devices using that VID/PID combination.
  • Real error handling. Right now, the interrupt basically ignores the errors. In a real application, these would need to be handled.
  • A better structure. I am not a real fan of how I have structured this, but my idea was to make it "expandable" without needing to recompile usb.c every time a change was made. It doesn't achieve that yet, but in future iterations I hope to have a relatively portable usb driver module that I can port to other projects without modification, placing the other device-specific things into another, mimimalistic, file.


I can only hope that this discussion has been helpful. I spent a long time reading documentation, writing code, smashing my keyboard, and figuring things out and I would like to see that someone else could benefit from this. I hope as I learn more about using the modules on my Teensy that I will become more competent in understanding how many of the systems I rely on on a daily basis function.

The code I have included above isn't always complete, so I would definitely recommend actually reading the code in the repository referenced at the beginning of this article.

If there are any mistakes in the above, please let me know in the comments or shoot me an email.

A new server

So for the past couple months my server has been going on and off due to the fact that rackspace increased their retirements of swapping and such. I made the swap to Amazon EC2 today and so over the next couple weeks we'll see how this works out.

Extreme Attributed Metadata with Autofac


If you are anything like me, you love reflection in any programming language. For the last two years or so I have been writing code for work almost exclusively in C# and have found its reflection system to be a pleasure to use. Its simple, can be fast, and can do so much.

I recently started using Autofac at work to help achieve Inversion of Control within our projects. It has honestly been the most life changing C# library (sorry Autofac, jQuery and Knockout still take the cake for "life-changing in all languages") I have ever used and has changed the way I decompose problems when writing programs.

This article will cover some very interesting features of the Autofac Attributed Metadata module. It is a little lengthy, so I have here what will be covered:

  • What is autofac?
  • Attributed Metadata: The Basics
  • The IMetadataProvider interface
  • IMetadataProvider: Making a set of objects
  • IMetadataProvider: Hierarchical Metadata

What is Autofac?

This post assumes that the reader is at least passingly familiar with Autofac. However, I will make a short introduction: Autofac allows you to "compose" your program structure by "registering" components and then "resolving" them at runtime. The idea is that you define an interface for some object that does "something" and create one or more classes that implement that interface, each accomplishing the "something" in their own way. Your parent class, which needs to have one of those objects for doing that "something" will ask the Autofac container to "resolve" the interface. Autofac will give back either one of your implementations or an IEnumerable of all of your implementations (depending on how you ask it to resolve). The "killer feature" of Autofac, IMO, is being able to use constructor arguments to recursively resolve the "dependencies" of an object. If you want an implementation of an interface passed into your object when it is resolved, just put the interface in the constructor arguments and when your object is resolved by Autofac, Autofac will resolve that interface for you and pass it in to your constructor. Now, this article isn't meant to introduce Autofac, so I would definitely recommend reading up on the subject.

Attributed Metadata: The Basics

One of my most favorite features has been Attributed Metadata. Autofac allows Metadata to be included with objects when they are resolved. Metadata allows one to specify some static parameters that are associated with a particular implementation of something registered with the container. This Metadata is normally created during registration of the particular class and, without this module, must be done "manually". The Attributed Metadata module allows one to use custom attributes to specify the Metadata for the class rather than needing to specify it when the class is registered. This is an absurdly powerful feature which allows for doing some pretty interesting things.

For my example I will use a "extendible" letter formatting program that adds some text to the content of a "letter". I define the following interface:

1interface ILetterFormatter
3    string FormatLetter(string content);

This interface is for something that can "format" a letter in some way. For starters, I will define two implementations:

 1class ImpersonalLetterFormatter : ILetterFormatter
 3    public string MakeLetter(string content)
 4    {
 5        return "To Whom It May Concern:nn" + content;
 6    }
 9class PersonalLetterFormatter : ILetterFormatter
11    public string MakeLetter(string content)
12    {
13        return "Dear Individual,nn" + content;
14    }

Now, here is a simple program that will use these formatters:

 1class MainClass
 3    public static void Main (string[] args)
 4    {
 5        var builder = new ContainerBuilder();
 7        //register all ILetterFormatters in this assembly
 8        builder.RegisterAssemblyTypes(typeof(MainClass).Assembly)
 9            .Where(c => c.IsAssignableTo<ILetterFormatter>())
10            .AsImplementedInterfaces();
12        var container = builder.Build();
14        using (var scope = container.BeginLifetimeScope())
15        {
16            //resolve all formatters
17            IEnumerable<ILetterFormatter> formatters = scope.Resolve<IEnumerable<ILetterFormatter>>();
19            //What do we do now??? So many formatters...which is which?
20        }
21    }

Ok, so we have ran into a problem: We have a list of formatters, but we don't know which is which. There are a couple different solutions:

  • Use the "is" test or do a "soft cast" using the "as" operator to a specific type. This is bad because it requires that the resolver know about the specific implementations of the interface (which is what we are trying to avoid)
  • Just choose one based on order. This is bad because the resolution order is just as guaranteed as reflection order in C#...which is not guaranteed at all. We can't be sure they will be resolved in the same order each time.
  • Use metadata at registration time and resolve it with metadata. The issue here is that if we used RegisterAssemblyTyps like above, it makes registration difficult. Also, once we get any sizable number of things registered with metadata, it becomes unmanageable IMO.
  • Use attributed metadata! Example follows...

We define another class:

 2sealed class LetterFormatterAttribute : Attribute
 4    public string Name { get; private set; }
 6    public LetterFormatterAttribute(string name)
 7    {
 8        this.Name = name;
 9    }

Marking it with System.ComponetModel.Composition.MetadataAttributeAttribute (no, that's not a typo) will make the Attributed Metadata module place the public properties of the Attribute into the metadata dictionary that is associated with the class at registration time.

We mark the classes as follows:

2class ImpersonalLetterFormatter : ILetterFormatter
6class PersonalLetterFormatter : ILetterFormatter

And then we change the builder to take into account the metadata by asking it to register the Autofac.Extras.Attributed.AttributedMetadataModule. This will cause the Attributed Metadata extensions to scan all of the registered types (past, present, and future) for MetadataAttribute-marked attributes and use the public properties as metadata:

1var builder = new ContainerBuilder();
6    .Where(c => c.IsAssignableTo<ILetterFormatter>())
7    .AsImplementedInterfaces();

Now, when we resolve the ILetterFormatter classes, we can either use Autofac.Features.Meta<TImplementation> or Autofac.Features.Meta<TImplementation, TMetadata>. I'm a personal fan of the "strong" metadata, or the latter. It causes the metadata dictionary to be "forced" into a class rather than just directly accessing the metadata dictionary. This removes any uncertainty about types and such. So, I will create a class that will hold the metadata when the implementations are resolved:

1class LetterMetadata
3    public string Name { get; set; }

It would worthwhile to note that the individual properties must have a value in the metadata dictionary unless the DefaultValue attribute is applied to the property. For example, if I had an integer property called Foo an exception would be thrown when metadata was resolved since I have no corresponding Foo metadata. However, if I put DefaultValue(6) on the Foo property, no exception would be thrown and Foo would be set to 6.

So, we now have the following inside our using statement that controls our scope in the main method:

 1//resolve all formatters
 2IEnumerable<Meta<ILetterFormatter, LetterMetadata>> formatters = scope.Resolve<IEnumerable<Meta<ILetterFormatter, LetterMetadata>>>();
 4//we will ask how the letter should be formatted
 6foreach (var formatter in formatters)
 8    Console.Write("- ");
 9    Console.WriteLine(formatter.Metadata.Name);
12ILetterFormatter chosen = null;
13while (chosen == null)
15    Console.WriteLine("Choose a formatter:");
16    string name = Console.ReadLine();
17    chosen = formatters.Where(f => f.Metadata.Name == name).Select(f => f.Value).FirstOrDefault();
19    if (chosen == null)
20        Console.WriteLine(string.Format("Invalid formatter: {0}", name));
23//just for kicks, we say the first argument  is our letter, so we format it and output it to the console

The IMetadataProvider Interface

So, in the contrived example above, we were able to identify a class based solely on its metadata rather than doing type checking. What's more, we were able to define the metadata through attributes. However, this is old hat for Autofac. This feature has been around for a while.

When I was at work the other day, I needed to be able to handle putting sets of things into metadata (such as a list of strings). Autofac makes no prohibition on this in its metadata dictionary. The dictionary is of the type IDictionary<string, object>, so it can hold pretty much anything, including arbitrary objects. The problem is that the Attributed Metadata module had no way to do this easily. Attributes can only take certain types as constructor arguments and that seriously places a limit on what sort of things could be put into metadata via attributes easily.

I decided to remedy this and after submitting an idea for autofac via a pull request, having some discussion, changing the exact way to accomplish this goal, and fixing things up, my pull request was merged into autofac which resulted in a new feature: The IMetadataProvider interface. This interface provides a way for metadata attributes to control how exactly they produce metadata. By default, the attribute would just have its properties scanned. However, if the attribute implemented the IMetadataProvider interface, a method will be called to get the metadata dictionary rather than doing the property scan. When an IMetadataProvider is found, the GetMetadata(Type targetType) method will be called with the first argument set to the type that is being registered. This allows the IMetadataProvider the opportunity to know which class it is actually applied to; something normally not possible without explicitly passing the attribute a Type in a constructor argument.

To get an idea of what this would look like, here is a metadata attribute which implements this interface:

 2class LetterFormatterAttribute : Attribute, IMetadataProvider
 4    public string Name { get; private set; }
 6    public LetterFormatterAttribute(string name)
 7    {
 8        this.Name = name;
 9    }
11    #region IMetadataProvider implementation
13    public IDictionary<string, object> GetMetadata(Type targetType)
14    {
15        return new Dictionary<string, object>()
16        {
17            { "Name", this.Name }
18        };
19    }
21    #endregion

This metadata doesn't do much more than the original. It actually returns exactly what would be created via property scanning. However, this allows much more flexibility in how MetadataAttributes can provide metadata. They can scan the type for other attributes, create arbitrary objects, and many other fun things that I can't even think of.

IMetadataProvider: Making a set of objects

Perhaps the simplest application of this new IMetadataProvider is having the metadata contain a list of objects. Building on our last example, we saw that the "personal" letter formatter just said "Dear Individual" every time. What if we could change that so that there was some way to pass in some "properties" or "options" provided by the caller of the formatting function? We can do this using an IMetadataProvider. We make the following changes:

 1class FormatOptionValue
 3    public string Name { get; set; }
 4    public object Value { get; set; }
 7interface IFormatOption
 9    string Name { get; }
10    string Description { get; }
13interface IFormatOptionProvider
15    IFormatOption GetOption();
18interface ILetterFormatter
20    string FormatLetter(string content, IEnumerable<FormatOptionValue> options);
24sealed class LetterFormatterAttribute : Attribute, IMetadataProvider
26    public string Name { get; private set; }
28    public LetterFormatterAttribute(string name)
29    {
30        this.Name = name;
31    }
33    public IDictionary<string, object> GetMetadata(Type targetType)
34    {
35        var options = targetType.GetCustomAttributes(typeof(IFormatOptionProvider), true)
36            .Cast<IFormatOptionProvider>()
37            .Select(p => p.GetOption())
38            .ToList();
40        return new Dictionary<string, object>()
41        {
42            { "Name", this.Name },
43            { "Options", options }
44        };
45    }
48//note the lack of the [MetadataAttribute] here. We don't want autofac to scan this for properties
49[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
50sealed class StringOptionAttribute : Attribute, IFormatOptionProvider
52    public string Name { get; private set; }
54    public string Description { get; private set; }
56    public StringOptionAttribute(string name, string description)
57    {
58        this.Name = name;
59        this.Description = description;
60    }
62    public IFormatOption GetOption()
63    {
64        return new StringOption()
65        {
66            Name = this.Name,
67            Description = this.Description
68        };
69    }
72public class StringOption : IFormatOption
74    public string Name { get; set; }
76    public string Description { get; set; }
78    //note that we could easily define other properties that
79    //do not appear in the interface
82class LetterMetadata
84    public string Name { get; set; }
86    public IEnumerable<IFormatOption> Options { get; set; }

Ok, so this is just a little bit more complicated. There are two changes to pay attention to: Firstly, the FormatLetter function now takes a list of FormatOptionValues. The second change is what enables the caller of FormatLetter to know which options to pass in. The LetterFormatterAttribute now scans the type in order to construct its metadata dictionary by looking for attributes that describe what options it needs. I feel like the usage of this is best illustrated by decorating our PersonalLetterFormatter for it to have some metadata describing the options that it requires:

 2[StringOption(ToOptionName, "Name of the individual to address the letter to")]
 3class PersonalLetterFormatter : ILetterFormatter
 5    const string ToOptionName = "To";
 7    public string FormatLetter(string content, IEnumerable<FormatOptionValue> options)
 8    {
 9        var toName = options.Where(o => o.Name == ToOptionName).Select(o => o.Value).FirstOrDefault() as string;
10        if (toName == null)
11            throw new ArgumentException("The " + ToOptionName + " string option is required");
13        return "Dear " + toName + ",nn" + content;
14    }

When the metadata for the PersonalLetterFormatter is resolved, it will contain an IFormatOption which represents the To option. The resolver can attempt to cast the IFormatOption to a StringOption to find out what type it should pass in using the FormatOptionValue.

This can be extended quite easily for other IFormatOptionProviders and IFormatOption pairs, making for a very extensible way to easily declare metadata describing a set of options attached to a class.

IMetadataProvider: Hierarchical Metadata

The last example showed that the IMetadataProvider could be used to scan the class to provide metadata into a structure containing an IEnumerable of objects. It is a short leap to see that this could be used to create hierarchies of arbitrary objects.

For now, I won't provide a full example of how this could be done, but in the future I plan on having a gist or something showing arbitrary metadata hierarchy creation.


I probably use Metadata more than I should in Autofac. With the addition of the IMetadataProvider I feel like its quite easy to define complex metadata and use it with Autofac's natural constructor injection system. Overall, the usage of metadata & reflection in my programs has made them quite a bit more flexible and extendable and I feel like Autofac and its metadata system complement the built in reflection system of C# quite well.

Teensy 3.1 Bare-Metal


A couple of weeks ago I saw a link on hackaday to an article by Karl Lunt about using the Teensy 3.1 without the Arduino IDE and building for the bare metal. I was very intrigued as the Arduino IDE was my only major beef with developing stuff for the Teensy 3.1 and I wanted to be able to do things without having to use the IDE. I read through the article and although it was geared towards windows, I decided to try to adapt it to my development style. There were a few things I wanted to do:

  • No additional code dependencies other than the teensyduino installation which I already had
  • Use local binaries for compilation, not the ones included with teensyduino (it just felt uncomfortable to use theirs)
  • Separation of src, obj, and bin directories
  • Mixture of c and cpp files in the src directory
  • Not needing to explicitly list the files in the src directory to compile
  • Selective inclusion of features from the teensyduino installation

I have very little experience writing more complex Makefiles. When I say "complex" I am referring to makefiles which have the src, obj, bin separation and pull in objects from multiple sources. While this may not seem complex to many people, its something I have very little experience actually doing by hand (I would normally use a generator of some sort).

I'm writing this in the hope that those without mad Makefile skills, such as myself, can liberate themselves from the Arduino IDE when developing for an awesome platform like the Teensy 3.1.

All code for this example can be found here:


As my first order of business, I located the arm-none-eabi binaries for my linux distribution. These can also be found for Windows as noted in Karl Lunt's article. Random sidenote: I found this description of why arm-none-eabi is called arm-none-eabi. Very informative. Anyway, for those who run archlinux, the following packages are needed:

  • arm-none-eabi-gcc (contains the compilers)
  • arm-none-eabi-binutils (contains the linker, objdump, and other things for manipulating the binaries into hex files)
  • make (we are using a makefile...)

Hopefully this gives a bit of a hint on what packages may need to be installed on other systems. For Windows, the compiler is here and make can be found here or by googling around. I haven't tested any of this on Windows and would advocate using Linux for this, but it shouldn't be hard to modify the Makefile for Windows.

My Flow

For C and C++ development I have a particular flow that I like to follow. This is heavily influenced by my usage of Code::Blocks and Visual Studio. I like to have a src directory where I put all of my sources, an include directory where I put all of my headers, an obj directory for all the obj, d, & lst files, and a bin directory for my executable output. I've always had such a hard time with raw Makefiles because I could never quite get that directory structure working. I was never quite satisfied with my feeble Makefile attempts which ended up placing the object files in the root directory where the sources had to be. This Makefile represents my first time I was ever able to actually have a real bin, obj, src structure that works.

Compiling object files to obj & looking in src for source

A working description of this can be found in the Makefile in my github repository I mentioned earlier.

Makefiles work by defining a series of "targets" which have "dependencies". Every dependency can also be the name of a target and a target may have multiple ways of being resolved (this I never realized before). So, here is the parts of the Makefile which enable searching in src for both c and cpp and doing specific actions for each, comping them into the obj directory:

 1# Project C & C++ files which are to be compiled
 2CPP_FILES = $(wildcard $(SRCDIR)/*.cpp)
 3C_FILES = $(wildcard $(SRCDIR)/*.c)
 5# Change project C & C++ files into object files
 6OBJ_FILES := $(addprefix $(OBJDIR)/,$(notdir $(CPP_FILES:.cpp=.o))) $(addprefix $(OBJDIR)/,$(notdir $(C_FILES:.c=.o)))
 8# Example build target
 9build: $(OUTPUTDIR)/$(PROJECT).elf
11# Linker invocation
13    @mkdir -p $(dir $@)
14    $(CC) $(OBJ_FILES) $(LDFLAGS) -o $(OUTPUTDIR)/$(PROJECT).elf
16# C file compilation for some object file
17$(OBJDIR)/%.o : $(SRCDIR)/%.c
18    @echo Compiling $<, writing to $@...
19    @mkdir -p $(dir $@)
20    $(CC) $(GCFLAGS) -c $< -o $@ > $(basename $@).lst
22# C++ file compilation for some object file
23$(OBJDIR)/%.o : $(SRCDIR)/%.cpp
24    @mkdir -p $(dir $@)
25    @echo Compiling $<, writing to $@...
26    $(CC) $(GCFLAGS) -c $< -o $@

Each section above has a specific purpose and the order can be rather important. The first part uses $(wildcard ...) to pick up all of the C++ and C files. The CPP_FILES variable, for example, will become "src/file1.cpp src/file2.cpp src/etc.cpp" if we had "file1.cpp", "file2.cpp" and "etc.cpp" in the src directory. Similarly, the C_FILES would pick up any files in src with a c file extension. Next, the filenames are transformed into object filenames living in the obj directory. This is done by first changing the file extension of the files to .o using the $(CPP_FILES:.cpp=.o) or $(C_FILES:.c=.o) syntax. However, these files still look like they are in the src directory (e.g. src/file1.o) so the directory is next stripped off each file using $(nodir...). Removing the directory doesn't allow for a nested src directory, but that wasn't one of our objectives here. At this point, the files are just names with no directories (e.g. file1.o) and so the last step is to change them to live in the obj directory using $(addprefix $(OBJDIR)/,..). This completes our transformation, populating OBJ_FILES to look like "obj/file1.o obj/file2.o" etc.

The next part is where we take that list of object files and use them as dependencies for a target. Targets are defined by <target name>: <dependency list> followed by a list of commands to execute after resolving the dependencies. IMPORTANT: The list of commands needs to be indented by a tab (t) character. Spaces will not work (it will say something like "missing separator" with a line number). A target is anything that we pass into make. The default target is 'all'. The "dependencies" are files which much be "up to date" before the target is run.

In our example, we use $(OBJ_FILES) as a dependency of "$(OUTPUTDIR)/$(PROJECT).elf" which is required as a dependency of "build". This tells make that when we run "make build", it needs to try to resolve the dependency of "bin/<project>.elf" which in turn needs to resolve "obj/file1.o", "obj/file2.o", and "obj/etc.o" (going from our example in the previous paragraph). This is where the next couple targets come in. A target will only be executed if it can find some rule to resolve all of the dependencies. We will use "obj/file1.o" as an example here. There are 2 targets with that name, actually: "$(OBJDIR)/%.o: $(SRCDIR)/%.c" and "$(OBJDIR)/%.o: $(SRCDIR)/%.cpp". It would be good to note that the target names here the exact same even though the dependencies are different. Now, how does "$(OBJDIR)%.o" match "obj/file1.o"? A Makefile does something called "pattern matching" when the % sign is used. It says "match something that looks like $(OBJDIR)<some file>.o" which our "obj/file1.o" happens to match. The cool part is that once the target name is resolved using a %, the dependencies get to use % to substitute the exact same thing. Thus, our % here is "file1", so it follows that its dependency must be "$(SRCDIR)/file1.c". Now, our example used "file1.cpp", not "file1.c" and this is where defining multiple targets with the same names but different dependencies comes in. A target will only be executed if the dependencies can be resolved to either an actual file and/or another target. Our first target won't be a match since it says that the source file should be a C file. So, it goes to the next target that matches the name which has a dependency of "$(SRCDIR)/file1.cpp". This one matches, and so commands following that target are executed.

When executing a target ("$(OBJDIR)/%.o: $(SRCDIR)/%.cpp" in our example), there are some special variables which are available for use. These are described here, but I will discuss two important ones that I used: $@ and $<. $@ is the name of the target (so, "obj/file.o" in our case) and $< is the name of the first dependency ("src/file.cpp" in our case). This lets us pass these arguments into the commands that we execute. Our Makefile will first create the obj directory by calling "mkdir -p $(dir $@)" which is translated into "mkdir -p obj" since $(dir $@) will give us "obj". Next, we actually compile the $< (which is translated to "src/file.cpp"), outputting it to $< which is translated to "obj/file.o".

Outputting everything to bin

Compared to the pattern matching and multiple target definitions that we discussed above, this is comparatively simple. We simply get to prefix all of our "binary" output files with some directory which is set as $(OUTPUTDIR) in my Makefile. Here is an example:

 1all:: $(OUTPUTDIR)/$(PROJECT).hex $(OUTPUTDIR)/$(PROJECT).bin stats dump
 4    $(OBJCOPY) -O binary -j .text -j .data $(OUTPUTDIR)/$(PROJECT).elf $(OUTPUTDIR)/$(PROJECT).bin
 7    $(OBJCOPY) -R .stack -O ihex $(OUTPUTDIR)/$(PROJECT).elf $(OUTPUTDIR)/$(PROJECT).hex
 9#  Linker invocation
11    @mkdir -p $(dir $@)
12    $(CC) $(OBJ_FILES) $(LDFLAGS) -o $(OUTPUTDIR)/$(PROJECT).elf

We see here that any output that we are creating as a result of the compilation (.elf, .hex, .bin) is going to end up in $(OUTPUTDIR). Futher, we see that our "all" target asks the Makefile to create both a bin file and a hex file along with two other targets called "stats" and "dump". These are just scripts that execute the "size" and "objdump" commands on our bin file.

Using Teensyduino without compiling everything

This was by far the most frustrating part to get working. Everything about the makefiles was readily available online, with some serious googling. However, getting things to actually compile was a little different story.

The thing that makes this complex is the fact that it seems the Teensyduino libraries were not designed to be used independently of each other. I will cover, in order, what steps I had to take in order to get this to work.

The most important file we need is called "mk20dx128.c". This sets up a lot of things relating to interrupts along with the Phase Lock Loop (PLL) which controls the speed of the Teensy's processor. Without this configuration, we don't get interrupts and the processor runs at a pitiful 16Mhz. The only problem is that "mk20dx128" references a few functions that are either part of the standard library and not used often (making them difficult to search for) or are defined in other files, increasing our dependency count.

My first mistake was explicitly using the linker to link all of my object files (wait...aren't we supposed to use the linker? Read on.). Since arm-none-eabi is not dependent on a specific architecture, it doesn't know which standard library (libc) to use. This results in an undefined reference to "__libc_init_array()", a function used during the initialization phase of a program which is not often invoked in code outside the standard library itself. mk20dx128.c uses this function in its custom startup code which prepares the processor for running our program. To solve this, I wanted to tell the linker that I was using a cortex-m4 cpu so that it would know which libc to include and thereby resolve the reference. However, this proved difficult to do when directly invoking the linker. Instead, I took a hint from the Makefile that comes with Teensyduino and used the following command to link the objects:


Which more or less translates to (using our example from earlier):

1arm-none-eabi-gcc obj/file1.o obj/file2.o obj/etc.o obj/mk20dx128.o $(LDFLAGS) -o bin/$(PROJECT).elf

We would have thought that we should be using arm-none-eabi-ld instead of arm-none-eabi-gcc. However, by using arm-non-eabi-gcc I was able to pass the argument "-mcpu=cortex-m4" which then allowed GCC to instruct the linker which standard library to use. Wonderful, right? So all of our problems are solved? Not yet.

The next thing is that mk20dx128.c has a lot of external dependencies. It uses a function defined in pins_teensy.c which in turn requires functions defined in both analog.c and usb_dev.c which opens another can of worms. Ugh. I didn't want this many dependencies and I couldn't see a way to escape compiling nearly the entire Teensyduino library just to run my simple blinking program. Then, it dawned on me: I could use the same technique that mk20dx128.c uses to define its ISRs to "define" the functions that pins_teensy.c was calling that I didn't really want. So, I made a file called "shim.c" which contained the following:

1void unused_void(void) { }
3void usb_init(void) __attribute__ ((weak, alias("unused_void")));

I decided that I would include "yield.c" and "analog.c" since those weren't too big. This left just the usb stuff. The only function that was actually called from pins_teensy.c was "usb_init". What the above statement says to the compiler is "I am defining usb_init(void) here (which points to unused_void(void)) unless you find another definition of usb_init(void) somewhere". The "weak" attribute makes this "strong" symbol of usb_init a "weak" symbol reference to which is basically the same as just making a declaration (in contrast to the definition a function, which is usually a strong reference). Sidenote: A program can have any number of weak symbol references to a specific function/variable (declarations), but only one strong symbol reference (definition) of that function/variable. The "alias" attribute allows us to say "when I say usb_init I really mean unused_void". The end result of this is that if nobody defines usb_init(void) anywhere, as would be situation if I were to decide not to include usb_dev.c, any calls to usb_init(void) will actually call unused_void(void). However, if somebody did define usb_init(void), my definition of usb_init would be ignored in favor of using their definition. This lets me include usb support in the future if I wanted to. Isn't that cool? That fixed all of my reference issues and let me actually build the project.


Armed with my new Makefile and a better understanding of how the Teensy 3.1 works from a software perspective, I managed to compile and upload my "blinky" program which just blinks the onboard LED (pin 13) on and off every 1/4 second. The overall program size was 3% of the total space, which is much more reasonable compared to the 10-20% it was taking when compiled using the Arduino IDE.

Again, all files from this escapade can be found here:

First thoughts on the Teensy 3.1

Wow it has been a while; I have not written since August.

I entered a contest of sorts this past week which involves building an autonomous turret which uses an ultrasonic sensor to locate a target within 10 feet and fire at it with a tiny dart gun. The entire assembly is to be mounted on servos. This is something my University is doing as an extra-curricular for engineers and so when a friend of mine asked if I wanted to join forces with him and conquer, I readily agreed.

The most interesting part to me, by far, is the processor to be used. It is going to be a Teensy 3.1:

This board contains a Freescale ARM Cortex-M4 microcontroller along with a smaller non-user-programmable microcontroller for assistance in the USB bootloading process (the exact details of that interaction are mostly unknown to me at the moment). I have never used an ARM microcontroller before and never a microcontroller with as many peripherals as this one has. The datasheet is 1200 pages long and is not really even being very verbose in my opinion. It could easily be 3000 pages if they included the level of detail usually included in AVR and PIC datasheets (code examples, etc). The processor runs at 96Mhz as well, making it the most powerful embedded computer I have used aside from my Raspberry Pi.

The Teensy 3.1 is Arduino-compliant and is designed that way. However, it can also be used without the Arduino software. I have not used an Arduino before since I rather enjoy using microcontrollers in a bare-bones fashion. However, it is become increasingly more difficult for me to be able to experiment with the latest in microcontroller developments using breadboards since the packages are becoming increasingly more surface mount.

The Arduino IDE

Oh my goodness. Worst ever. Ok, not really, but I really have a hard time justifying using it other than the fact that it makes downloading to the Teensy really easy. This post isn't meant to be a review of the arduino IDE, but the editor could use some serious improvements IMHO:

  • Tab indentation level: Some of us would like to use something other than 2 spaces, thank you very much. We don't live in the 70's where horizontal space is at a premium and I prefer 4 spaces. Purely personal preference, but I feel like the option should be there
  • Ability to reload files: The inability to reload the files and the fact that it seems to compile from a cache rather than from the file itself makes the arduino IDE basically incompatible with git or any other source control system. This is a serious problem, in my opinion, and requires me to restart the editor frequently whenever I check out a different branch.
  • Real project files: I understand the aim for simplicity here, but when you have a chip with 256Kb of flash on it, your program is not going to be 100 lines and fit into one file. At the moment, the editor just takes everything in the directory and compiles it by file extension. No subdirectories and every file will be displayed as a separate tab with no way to close it. I am in the habit of separating my source and not having the ability to structure my files how I please really makes me feel hampered. To make matters worse, the IDE saves the original sketch file (which is just a cpp file that will be run through their preprocessor) with its own special file extension (*.ino) which makes it look like it should be a project file, but in reality it is not.

There are few things I do like, however. I do like their library of things that make working with this new and foreign processor rather easy. I also like that their build system is very cross-platform and easy to use.

First impression of the processor

I must first say that the level of work that has gone into the surrounding software (the header files, the teensy loader, etc) truly shows and makes it a good experience to use the Teensy, even if the Arduino IDE sucks. I tried a Makefile approach using Code::Blocks, but it was difficult for me to get it to compile cross-platform and I was afraid that I would accidentally overwrite some bootloader code that I hadn't known about. So, I ended up just going with the Ardiuno IDE for safety reasons.

The peripherals on this processor are many and it is hard at times to figure out basic functions, such as the GPIO. The manual for the peripherals is in the neighborhood of 60 chapters long, with each chapter describing a peripheral. So far, I have messed with just the GPIOs and pin interrupts, but I plan on moving on to the timer module very soon. This project likely won't require the DMA or the variety of onboard bus modules (CAN, I2C, SPI, USB, etc), but in the future I hope to have a Teensy of my own to experiment on. The sheer number of registers combined with the 32-bit width of everything is a totally new experience for me. Combine that with the fact that I don't have to worry as much about the overhead of using certain C constructs (struct and function pointers for example) and I am super duper excited about this processor. Tack on the stuff that PJRC created for using the Teensy such as the nice header files and the overall compatibility with some Arduino libraries, and I have had an easier time getting this thing to work than with most of my other projects I have done.


Although the Teensy is for a specific contest project right now, at the price of $19.80 for the amount of power that it gives, I believe I will buy one for myself to mess around with. I am looking forward to getting more familiar with this processor and although I resent the IDE I have to work with at the moment, I hope that I will be able to move along to better compilation options that will let me move away from the arduino IDE.

Pop 'n Music controller...AVR style

Every time I do one of these bus emulation projects, I tell myself that the next time I do it I will use an oscilloscope or DLA. However, I never actually break down and just buy one. Once more, I have done a bus emulation project flying blind. This is the harrowing tale:

Code & Schematics (kicad):


A couple of days ago, I was asked to help do some soldering for a modification someone was trying to do to a PS1 controller. He informed me that it was for the game Pop 'n Music and that it required a special controller to be played properly. Apparently, official controllers can sell for $100 or more, so modifying an existing controller was the logical thing to do. After much work and pain, it was found that while modifying an existing controller was easy, it wasn't very robust and could easily fall apart and so I built one using an ATMega48 and some extra components I had lying around. The microcontroller emulates the PSX bus which is used to communicate between the controller and the playstation/computer. As my reference for the bus, I used the following two web pages:

The complete schematics and software can be found on my github.

The first attempt: Controller mod

The concept behind the controller mod was simple: Run wires from the existing button pads to some arcade-style buttons arranged in the pattern needed for the controller. It worked well at first, but after a little while we began to have problems:

  • The style of pad that he purchased had conductive rubber covering all of the copper for the button landings. In order to solder to this, it was necessary to scrape off the rubber. This introduced a tendency for partially unclean joints, giving rise to cold connections. While with much effort I was able to mitigate this issue (lots of scraping and cleaning), the next problem began to manifest itself.
  • The copper layout for each button pad was of a rather minimalist design. While some pads shown online had nice large areas for the button to contact, this particular controller had 50-100 mil lines arranged in a circular pattern rather than one huge land. While I imagine this is either economical or gives better contact, it sure made soldering wires onto it difficult. I would get the wire soldered, only to have it decide that it wanted to come off and take the pad with it later. This was partly due to bad planning on my part and using wire that wasn't flexible enough, but honestly, the pads were not designed to be soldered to.
  • With each pad that lifted, the available space for the wires on certain buttons to be attached to began to become smaller and smaller. Some buttons were in the large land style and were very easy to solder to and the joints were strong (mainly the arrow pad buttons). The issue was with the start and select buttons (very narrow) and the X, square, triangle, and O buttons (100mil spiral thing mentioned earlier). Eventually, I was resorting to scraping the solder mask off and using 30awg wire wrapping wire to solder to the traces. It just got ridiculous and wasn't nearly strong enough to hold up as a game controller.
  • In order for the controller to be used with a real playstation, rather than an emulator, the Left, Right, and Down buttons had to be pressed at the same time to signify to the game that it was a Pop 'n Music controller. Emulators generally can't handle this sort of behavior when mapping the buttons around, so putting a switch was considered. However, any reliable switch (read: Nice shiny toggle switch) was around $3. Given the low cost aim of this project, it was becoming more economical to explore other options

So, we began exploring other options. I found this site detailing emulation of these controllers using either 74xx logic or a microcontroller. It is a very good resource, and is mostly correct about the protocol. After looking at the 74xx logic solution and totaling up the cost, I noticed that my $1.75 microcontroller along with the required external components would actually come out to be cheaper than buying 4 chips and sockets for them. Even better, I already had a microcontroller and all the parts on hand, so there was no need for shipping. So, I began building and programming.

AVR PSX Bus Emulation: The Saga of the Software

PSX controllers communicate using a bus that has a clock, acknowledge, slave select, psx->controller (command) line, and controller->psx (data) line. Yes, this looks a lot like an SPI bus. In fact, it is more or less identical to a SPI Mode 3 bus with the master-in slave-out line driven open collector. I failed to notice this fact until later, much to my chagrin. Communication is accomplished using packets that have a start signal followed by a command and waiting for a response from the controller. During the transaction, the controller declares its type, the number of words that it is going to send, and the actual controller state. I was emulating a standard digital controller, so I had to tell it that my controller type was 0x41, which is digital with 1 word data. Then, I had to send a 0x5A (start data response byte) and two bytes of button data. My initial approach involved writing a routine in C that would handle pin changes on INT0 and INT1 which would be connected to the command and clock lines. However, I failed to anticipate that the bus would be somewhere in the neighborhood of 250Khz-500Khz and this caused some serious performance problems and I was unable to complete a transaction with the controller host. So, I decided to try writing the same routine in assembly to see if I could squeeze every drop of performance out of it possible. I managed to actually get it to complete a transaction this way, but without sending button data. To make matters worse, every once in a while it would miss a transaction and this was quite noticeable when I made an LED change state with every packet received. It was very inconsistent and that was without even sending button data. I eventually realized the problem was with the fact that making the controller do so much between cycles of the clock line actually caused it to miss bits. So, I looked at the problem again. I noticed that the ATMega48A had an SPI module and that the PSX bus looked similar, but not exactly like, an SPI bus. However, running the bus in mode 3 with the data order reversed and the MISO driving the base of a transistor operating in an open-collector fashion actually got me to be able to communicate to the PSX bus on almost the first try. Even better, the only software change that had to be made was inverting the data byte so that the signal hitting the base of the transistor would cause the correct changes on the MISO line. So, I hooked up everything as follows:


After doing that, suddenly I got everything to work. It responded correctly to the computer when asked about its inputs and after some optimization, stopped skipping packets due to taking too much time processing button inputs. It worked! Soon after getting the controller to talk to the computer, I discovered an error in the website I mentioned earlier that detailed the protocol. It mentioned that during transmission of the data about the buttons that the control line was going to be left high. While its a minor difference, I thought I might as well mention this site, which lists the commands correctly and was very helpful. As I mentioned before, one problem that was encoutered was that in order for the controller to be recognized as a pop-n-music controller by an actual playstation, the left, right, and down buttons must be pressed. However, it seems that the PSX->USB converter that we were using was unable to handle having those 3 pressed down at once. So, there needed to be a mode switch. The way for switching modes I came up with was to hold down both start and select at the same time for 3 seconds. After the delay, the modes would switch. The UI interaction for this is embodied in two LEDs. One LED is lit for when it is in PSX mode and the other when it is in emulator mode. When both buttons are pressed, both LEDs light up until the one for the previous mode shuts off. At first, I had the mode start out every time the controller was started in the same mode, no matter what the previous mode was before it was shut off. It soon became apparent that this wouldn't do, and so I looked in to using the EEPROM to store the flag value I was using to keep the state of the controller. Strangely, it worked on the first try, so the controller will stay in the same mode from the last time it was shut off. My only fear is that switching the mode too much could degrade the EEPROM. However, the datasheet says that it is good for 100,000 erase/write cycles, so I imagine it would be quite a while before this happens and other parts of the controller will probably fail first (like the switches).

On to the hardware!

I next began assembly. I went the route of perfboard with individual copper pads around each hole because that's what I have. Here are photos of the assembly, sadly taken on my cell phone because my camera is broken. Sorry for the bad quality...

0810131701.jpg 0810131746.jpg 0810131753.jpg 0810131809.jpg 0810131954.jpg 0811131258a.jpg 0812132143.jpg


So, with the controller in the box and everything assembled, it seems that all will be well with the controller. It doesn't seem to miss keypresses or freeze and is able to play the game without too many hiccups (the audio makes it difficult, but that's just a emulator tweaking issue). The best part about this project is that in terms of total work time, it probably took only about 16 hours. Considering that most of my projects take months to finish, this easily takes the cake as one of my quickest projects start to finish.