I. Introduction

If you read a bit about Adlib and the OPL1 or OPL2 chips, many manuals and texts always tell you that you have to wait a number of cycles after a register-select write to the chip, and even more cycles after a register-value write. I also used that in my Edlib music player for the C64 based on that idea. 

When I read through the music player code for the MSX game Xak (that also uses OPL1, the Y8950 or the poor man's version of the OPL2, the YM2413), I noticed that the programmers only waited a bit for the register select output to the YM3812 and did not put any extra waiting loop after register value write. See below for the code. Thus, I went ahead and wanted to know what is the minimum required waiting needed to use the OPL on the C64, and leave more cycles for other coding. Read the whole article by clicking on the Read More button. 

 Here's the MSX Z80 code:

CEA3:   push   af         		// push af 												*** change PSG value of register B to A
CEA4:   ld     a,b        		// store b in a
CEA5:   out    (#a0),a    		// select register a in PSG
CEA7:   pop    af         		// get af back 
CEA8:   out    (#a1),a    		// store a (value for register to change to)
CEAA:   ret     				// done, return to caller           
CEAB:   ld     a,b        		// switch register number in b to a to select it		*** read register value of PSG register B and store it in A
CEAC:   out    (#a0),a    		// select register a of PSG
CEAE:   in     a,(#a2)    		// read the value of the register select through A2
CEB0:   ret      				// done, return to caller          
CEB1:   push   bc         		// store BC												*** change MSX-MUSIC (OPLL) register B to value in A
CEB2:   push   af         		// store AF
CEB3:   ld     a,b        		// store 
CEB4:   ld     bc,(#cec1) 		// port 7C is stored at $cec1 (OPLL register select)
CEB8:   out    (c),a      		// select OPLL register (now in A)
CEBA:   nop               		// wait 5 cycles (rather: t-states of 0.28 microseconds = ~ 1.4 microseconds)
CEBB:   inc    c          		// increase c (another 5 cycles = 1.4 microseconds)
CEBC:   pop    af         		// get af value back ( 11 T-states = 3.07 microseconds)
CEBD:   out    (c),a      		// set OPLL register to value now in A (taking 14 t-states, 12 for real, 
// and the first 7 can be considered "waiting" adding 1.95 microseconds) CEBF: pop bc // get bc back (4 t-states) CEC0: ret // return to caller

 

I realized what the coders did there this made sense, since the player might do other stuff first before writing data to the OPL register again, causing a natural delay.  

Reading through the Game Engine Black Book: Wolfenstein 3D v2.1, I came to the Adlib programming section.  Apparently, manuals for the old PC Adlib cards first recommended to provide data to the chips "as fast as possible"; the PC CPU at the time was 4.77 Mhz and could not outpace the card. It was only when CPUs became faster that the apparent architecture of the cards demanded to user to wait a bit (and then more) before sending the next data. 

However, an MSX operates on a Zilog running at 3.58 MHz. The 6502 in a Commodore 64 runs at 1 MHz. This is way slower than the CPUs (4.77 MHz) in the PCs at the time of the Adlib sound card, so in theory we should actually be good to send data to the chip as recommended by Adlib back then "as fast as possible".  Therefore, we can probably do with much less delays in writing to the chip (as even the MSX could), or even none at all. 

Of note, in the Adlib manual it was later written: 

Wait 3.3 microseconds for the address, then 23 microseconds for the data. 

That was based on the YM3812 Application Manual. The version of 1994.6 states on page 6 the following: 

When an address or data is written to an internal register of the OPLII the following wait period is required before the next operation is performed. The CPU generates the wait period shown in the table for the OPLII. Data integrity cannot be assured if this wait period is ignored.

In master clock cycles: Address write mode, 12. Data write mode, 84 cycles. 

The YM3812 dances to the tune of the master clock crystal that fires away at 3.58 Mhz. Which means the chip runs at 3.58 cycles per microsecond (one standardized cycle is 1 cycle per microsecond on a 1 Mhz clock frequency).  Therefore, the CPU that controls the YM3812 is advised to wait 12/3.58 = ~ 3.3 microseconds after the address write, and 84/3.58 = ~ 23,5 microseconds after the data write. (On a PAL C64 about 0,99 cycles per microsecond are performed, rounding the numbers to 3.4 and 23.7). 

Now let's take a look at what that means for the C64. 

II. Factory write limits applied to the C64

Address writes: 

After doing a STX $DF40 (register select, or address write), for example, we might need to wait 3.4 microseconds for the address write to complete. Two NOP instructions (2 cycles each) would do. Then again, any other instruction following the STX $DF40 would also take cycles. Suppose we'd set Y to the data to be written at the chip. Now, following the initial STX $DF40 with STY $DF50 (data write) will provide us with 3 cycles "waiting"  before the data is written. This is because STY takes in total 4 cycles, and it is in the fourth cycle when the data will be written to the address $DF50 and the IOL2 line goes low during ø2 (which causes chip select, CS, and activate the YM3812). Either we can take our chances and just leave out any NOPs, or we only have to add one NOP at most.

Should you require to still need to fetch the data to write at the chip register from some table in memory, that process alone will already give you enough waiting time there anyway, and no additional waiting time is needed before you can store the data value at the chip! 

Looking at the MSX Z80 chip, each cycle (or T-state as they are called) takes about 0.28 microseconds. In the code above, I show how many T-states pass before the data write is invoked on the OPL chip: I roughly calculate 1.4 +1.4 + 3.07 + 1.95 = 7.82 microseconds!  If we simply need to satisfy the 3.3 microseconds demanded by Yamaha, then the MSX code was also not needing the extra NOP there at all. 

Data writes

On the C64 we'd need to wait for approx 24 cycles to meet the advised 23 microseconds. We could do that by waiting 12 NOPs, or other cycle wasting opcodes. In my first released Edlib player code, I wait 43 cycles in total before returning to the subroutine caller, by adding a wait loop of 35 cycles in the code after the data store at $DF50! Totally overdid it there. The RTS alone wastes 6 cycles, needing to wait only 18 cycles more in the following code. Looking at the following code in detail, most of the time it takes 48 cycles before a next write to the chip is done! The wait time I implemented is totally unneeded, just like the coders of the MSX game above realized of course. Silly me. 

II. Pushing the limits - Minimum cycles needed for register data write 

However, heeding the notion that these master clock wait cycles may have been added only when CPUs became faster, I wanted to know if we can push back the need to wait on the C64 a bit, since we do need all the raster time we can get, right? To test this, I wrote a few routines and examined the effect of delay (from 0 to higher cycle delays) on the output of the FM-YAM. 

For testing I used one of my C64Cs with FM-YAM plugged in. VICE is not a good emulator to test hardware timing, nor does it fully emulate the wiring of the Sound Expander/FM-YAM as it should. In VICE there is no delay, no latency of the YM3812 so its use is limited. It can be used to hear how limited latency might sound like though. 

Experiment: Voice on/off rapid succession

For this experiment I wrote a test routine to listen to the output of Voice 0 as I would quickly turn it off (and on) for a part of the current frame. I initialized the chip and provided Voice 0 with enough parameters to produce a sound.  Using OPL register $B0, I could turn on/off the voice by setting bit 5 to 1 (on) or 0 (off).

STY $DF50 ; turn voice off
<delay of n cycles>
STX $DF50 ; turn voice on 

FLD was used to stop VIC invoking bad lines and steal the 6510's cycles for a fixed amount of rasterlines (where voice would be on) and on the last FLDed rasterline the FM Voice 0 would be turned to off. A SEI disabled maskable interrupts. 

This way I was able to ascertain that the borderline number of delays between turn off and turn on as shown in the code above was 15 cycles for the sound to reach a stable buzz. 16 cycles was the sweet spot for the stable buzz. 

Based on this, on the C64, I'd say the minimum wait time between two fast YM3812 register data writes would be 16 cycles (= about 16.24 microseconds) + 3 cycles of the STX = 19 cycles (19.29 microseconds). 

Of course, 19 cycles is not a whole lot, so depending on the player code you create, you may not need any hard-code delays, but use the natural flow of your code to delay the data writes to the OPL. 

Sidenote: I tried also changing the frequency of Voice 0 briefly (instead of turning the voice off), while keeping a steady other frequency for main part. 19.29 microseconds is not enough to hear any change in frequency. In fact, much longer delays are needed to hear any difference, which may be due to the internal process in the chip: if the register write in itself requires at least 19 cycles, the OPL chip needs longer time to actually produce the FM sound for output based on this new frequency, including sending it to the YM3014 for DA conversion. FYI, the YM3812 uses just two ROMS that are larger than 16 bytes, which are a log-transformed sine waveform table of 256 samples, and and exponential table, 256 samples long, for the internal FM sound generation. More master clock cycles would be needed to produce the output from the register write, my routine didn't wait for that. 

III. About CPU-YM3812 timing

Looking at the datasheets for the 6510, YM3812 and architecture, I have calculated that the time available for a register write to get to the YM3812 is ~369 ns. This is more than enough to get the information to the chip. CS (chip select is going low only if IO2 goes low while ø2 is high. Remember that ø2 is the phase in the C64 timing where the CPU is allowed to use the address bus, which should usually be around 450 ns of the 0.985 microsecond cycle (PAL). According to Frank Buss's timing diagram of using IOL and write date to it, the IOL2 goes low around 70 ns after ø2 kicks in. This is when CS can go low on the OPL and it's processes are activated. By that time, the address is already stable on the bus, and WR has also reached stability. Data seems to become available around 11 ns later. Ergo: 450 - 70 - 11 = 369 ns. As said, this is more than enough, even including the propagation time through the C64 mainboard and that of the FM-YAM to the OPL. 

IV. About Port addresses

The manuals on the Sound Expander and FM-YAM will tell you that there are three ports to be used in IOL2 space:

1. DF40 (register select/address select) - write only

2. DF50 (write contents to register/data write) - write only 

3. DF60 (chip status) - read only

Just so you aware, the logic and architecture of the cartridges does not really care about those addresses. In particular, A4 of the address bus is directly connected to A0 on the chip. A5 is indirectly connected to RD on the chip, but via logic with R/W and ø2. All other address "wires" are not used. Thus, the state of bit 4 and bit 5 of the address matter only. In decimals, 16 and 32 ($10, $20). You might as well use DF00, DF10 and DF20 instead of the ones mentioned above. Or DF07, DF55 and DF6A. Or whatever! Just make sure you have A4 and A5 at 0 if you wish to do a register select write, have bit 4 selected if you wish to do register data write, and bit 5 selected if you need to read the status register. 

VICE doesn't know this, it requires you to set the ports as mentioned above, I noticed. 

Anyway, I hope this topic was of interest to some of you, for me it was certainly very helpful to dive into at the level that I did. 

Cheers, 

Mr.Mouse/XeNTaX

23rd of December 2019