October 28, 2023
My code is now here. I’ve changed some things since last time, but the model is mostly the same. The most significant changes are:
I now have an embedding for each token’s position in its chord, in addition to the pitch and duration. (A “chord” is a group of notes that start simultaneously, along with the subsequent “time passage” token.)
The first block can only attend to the last few tokens. So it’s strictly local attention.
I got rid of most of the musical relative position encodings. I don’t think they made a big difference after all.
The model now has three connected stacks of transformer blocks: a stack for predicting pitches, a stack for predicting time steps, and a shared stack that feeds into the other two. Before, the model was more like a pipeline. Now it has a “branch” in the middle.
The stack that predicts the next time step now uses cross attention. I find this more elegant than my previous method, which was a concatenation followed by a linear map and then some standard self-attention blocks.
I use a shared rotary position encoding in all but the first transformer block. I rolled my own implementation that uses xPos.
I use drophead. I don’t know if this helps; I just found the idea appealing.
And, of course, I have more training data now, but not as much as I’d like. It’s awfully hard to find high-quality MIDIs. And I’ve found it important to check the output of my translation process to make sure it doesn’t mess anything up.
I’m not keen on scaling up the model at this point. So the obvious thing to try is RLHF. I haven’t started trying to build a reward model yet, so that’s the next step.