The DSP Concepts Machine Learning enablement program continues to ramp up. We wanted to clarify some questions that have come up as our customer and partner community gets familiar with the purpose, function, and benefits of ML for embedded audio DSP. We sat down with Josh Morris, ML Engineering Manager to talk through the topics. Josh is a speaker at Arm® DevSummit 2022, presenting the "AI-Accelerated Audio on the Edge'' masterclass available on-demand today.
We can tentatively describe Machine learning (ML) as a branch of computer science related to algorithms that perform pattern recognition based on input data, which are employed in models that improve automatically through training experience. There is a lot of nuance about what “experience” means in this context. How might you further refine this description?
Supervised learning is probably the most popular form of ML in production today. At a high level, models trained using supervised learning are learning from experience. Two things are required for supervised learning to be successful: a loss function and a dataset. The dataset provides a mapping between the input features and the desired ground truth output. The loss function is a differentiable equation that gives an estimate of how accurately the model’s predicted output matches the ground truth. Using backpropagation, we can update the weights of our model based on how correct the prediction was. By showing a model the dataset many times and updating the weights each time, the model slowly learns from experience.
What are the primary differences between the two concepts of ML algorithms and ML models?
Algorithm is a fairly broad term that encompasses a lot of things outside of machine learning. When we talk about models, we’re usually talking about a trained instance of a learning algorithm like a neural network. The model itself isn’t the algorithm, but is more an artifact of the training process, which is an algorithm. You could think of it as the saved file format of the result of the training process.
What are the most important considerations when it comes to training machine learning models?
Data and process. A lot of algorithms are defined in frameworks already. Your organization and data practices are the real differentiating factors when it comes to the quality of your model. You should have a strong understanding of the data you are using and have practices that enforce reproducibility.
So the quality and quantity of training data are crucial to the eventual performance of a machine learning model. Can you tell us a bit more about features and labeling?
Yes, and more so quality than quantity. A large amount of data is always nice to have though!
Labeling can take many forms. I think of it as a mapping of input to output for the task you want the model to solve. In the case of the classification, these are labels like “dog” or “cat.” In the case of denoising algorithms the targets would be clean speech recordings.
Features engineering is taking your incoming data and transforming it into a form that is conducive for your model. For a lot of audio applications, that is taking in your time domain audio and transforming it into the frequency domain via an FFT. By converting the audio to the frequency domain, your data now has an inherent 2D structure with frequency on the Y-axis and time on the X-axis. Convolutional layers in a neural network are able to take advantage of this structural information because they pass 2D filters over the incoming data. It’s possible to skip this transformation, but it would come at the cost of a much larger model because it has to work harder to extract the relevant information. This is why feature engineering and domain expertise are still very relevant to machine learning.
We can make a distinction between what we would call human intuition, and the way the machine learns patterns. What are the key differences to consider when applying ML to a task?
Humans have a much deeper understanding of what they are doing. They are also able to learn new tasks much more quickly than current ML techniques. I tend to think of ML models as correlation machines that have a strong mapping of input to output based on the data that they are trained with. In general, models aren’t good at extrapolating or generalizing to data that is dissimilar to the data that it is trained with.
How do you discern whether machine learning would perform well with a given task or problem?
Funnily, a lot of the time the gut check is whether a human can discern a pattern given the input data. There’s also a lot of intuition that comes from matching the right type of model to the type of data you have and the task you are solving.
What are some audio applications of machine learning?
Cognition, transcription and denoising are all popular applications of machine learning in the audio domain.
Finally, what are some audio-related tasks that DSP Concepts hopes to approach with ML in the near future?
Right now, we are very focused on the experience of audio application developers using Audio Weaver as a development and prototyping platform. One of the goals of my team is to reduce the time it takes to get models into production by leveraging Audio Weaver at key points in the ML lifecycle. We’re excited to release the Audio Weaver ML Module Pack in January. The Audio Weaver ML Module Pack provides the necessary support for feature extraction, model execution, and model tuning on the platform.
Watch the DSP Concepts and Alif Semiconductor presentation at Arm DevSummit 2022 (video on demand) to hear more from Josh Morris about ML techniques commonly used for audio, the features and benefits of the Audio Weaver platform, and how to build innovative ML designs that leverage the power of the Arm Cortex®-M55 and Ethos™-U55 processors featured on Alif Ensemble™ family of MCUs.
For more information about how the Audio Weaver platform and the Audio Weaver ML Module Pack can accelerate and expand DSP processing with neural networks, visit the IP Ecosystem page or contact DSP Concepts.