MATLAB for Data Science and Machine Learning
Dr J Rogel-Salazar2021-10-18 | 9 min read
The opportunities to solve problems with the use of data are greater than ever, and as different industries embrace them, the available data has been steadily increasing and the number of tools expanded. A typical question that new data scientists ask is related to the best programming language to learn, either to get a good understanding of coding or to future-proof their skills. Typically, the question is centered around some usual suspects such as R and Python. I have also been asked about Java and I have provided an answer to that query elsewhere. In reality, there may not be a single best tool to use, and I have long argued for using a toolbox approach to the data science practice. I would like to advocate for one of those tools: MATLAB.
What is MATLAB?
MATLAB first appeared commercially in the mid-1980s, and its use of expert toolboxes has been a defining feature of the language and ecosystem. Many engineering and science courses have embraced MATLAB as a teaching tool. As a result, it is widely used by scientists and engineers in many fields, providing excellent capabilities for data analysis, visualization, and more.
MATLAB is a high-level technical computing environment that integrates computation, visualization, and development in a single place. Its interactive environment serves as a playground to develop, design, and consume applications with the advantage of having a wide variety of mathematical functions at your fingertips, such as statistics, linear algebra, Fourier analysis, and optimization algorithms among others.
MATLAB provides useful development tools that improve code maintenance and performance as well as integrating with other programming languages such as Fortran, C/C++, .NET, or Java. These are some of the reasons I wrote Essential MATLAB and Octave, a book to introduce my own physics, mathematics, and engineering students to solve computational problems with MATLAB.
Today, MATLAB is a widely-used programming language that many industries trust and whose users can benefit from when trying to integrate machine learning techniques into their applications.
Why Should you Choose MATLAB for Data Science?
In the area of data science and machine learning MATLAB is perhaps not one of the first programming environments that come to mind. Partly this may be due to the attention gathered by languages like Python, R, and Scala; it may also be the case that being a proprietary language is sometimes seen as a barrier. I would argue, however, that in many industries and applications, such as aerospace, military, medical or financial, it is an advantage to have a supported and externally validated set of tools, backed by years of development and commercial success.
From a technical perspective, data scientists and machine learning practitioners require a language that enables them to manipulate objects that are suitable for vector or matrix operations. A programming language whose name is actually an abbreviation of “Matrix Laboratory” instills reassurance that matrices are a natural way to express the required computational operations, in a syntax that is close to the original linear algebra notation. In other words, for MATLAB, the basic object of operation is a matrix element. In this way, an integer for example can be considered as a 1x1 matrix. This means that a wide range of mathematical algorithms that are constructed for vectors or matrices are built into MATLAB from the start: cross and dot products, determinants, inverse matrices, etc. are natively available. In turn, this implies that a lot of the implementation work that machine learning techniques require is made much easier in MATLAB. Think, for example, of the representation of a corpus in natural language processing: We require large matrices to represent documents. For example, the columns of a matrix may represent the words in a document and the rows may be the sentences, pages, or documents in our corpus. In the case of machine vision, it is not unusual to represent images as matrices and MATLAB provides for the manipulation of these kind of objects.
Furthermore, the number of toolboxes that are available in MATLAB makes it easy to create structured data pipelines that do not require us to worry about compatibility issues, and all is done within the same computational environment. Some toolboxes have been part of the language for a long time such as the Symbolic Math, Optimization, and Curve Fitting Toolboxes, but new ones such as the Text Analytics, Statistics and Machine Learning, and Deep Learning Toolboxes are putting MATLAB back in the game.
MATLAB’s strong engineering credentials mean that there are readily available mechanisms that enable the acquisition of data directly from hardware such as circuit boards, measurement instruments, and imaging devices. These capabilities, together with simulation tools such as SIMULINK, make it irresistible to use machine learning techniques in a cohesive environment. Just in case you have never heard of SIMULINK, it is an interactive, graphical environment for modeling dynamic systems. It lets the user create virtual prototypes that can serve as digital twins to try things on the fly or analyze what-if scenarios.
Let us map the use of MATLAB to the typical data science workflow and see how it can support us:
- Data accessing and exploration - MATLAB lets us ingest a variety of data formats including text files, spreadsheets, and MATLAB files, but also images, audio, video, XML, or Parquet formatted data. As we mentioned above, it is possible to read data directly from hardware too. Data exploration can be implemented thanks to the interactive IDE provided and the data visualization capabilities of the ecosystem.
- Data pre-processing and cleaning - As a natural next step from data exploration, MATLAB makes it easy to use a live editor to clean outliers as well as find, fill, or remove missing data, remove trends, or normalize attributes. MATLAB also provides the user with domain-specific pre-processing tools for images, video, and audio. This means that we can apply suitable steps to our data prior to training MATLAB’s Deep Network Designer app to build complex network architectures or modify trained networks for transfer learning.
- Predictive modeling - Toolboxes are available to implement logistic regression, classification trees, or support vector machines as well as dedicated deep learning tools to implement convolutional neural networks (ConvNets, CNNs) and long short-term memory (LSTM) networks on image, time-series, and text data.
More Machine Learning with MATLAB
Your journey with MATLAB may have started in a desktop computer as part of your engineering or science courses. Today MATLAB is available in dedicated cloud resources, such as in The Domino Enterprise MLOps Platform, where your models can be trained on suitable GPUs. This has been the case for some time, and in a previous blog post, we looked at code parallelization in some of the more popular languages supported in the Domino platform.
The possibilities of implementing data science and machine learning models with MATLAB are endless: from the comparison of models, AutoML for feature and model selection as well as hyperparameter tuning or the ability to scale processing to dedicated clusters, to the generation of suitable code for high-performance computing in languages such as C++ and the integration with simulation platforms like SIMULINK.
With tools such as the deep learning toolbox, MATLAB is firmly positioned for not only the training and deployment of deep neural networks (including the design of network architectures), but also for supporting the preparation and labeling of data including image, video, and audio formats. MATLAB also enables us to use frameworks such as PyTorch or TensorFlow, all in the same environment.
Summary
There are several great tools that data scientists can use. Although languages like R and Python take the limelight, engineers, scientists, econometricians, and financial engineers trained in MATLAB can continue to use the capabilities that this rich and powerful ecosystem offers. With the continued support and development that Mathworks provides to their users, and with partnerships such as that with Domino, MATLAB will continue to grow in the data science and machine learning arena.
Dr Jesus Rogel-Salazar is a Research Associate in the Photonics Group in the Department of Physics at Imperial College London. He obtained his PhD in quantum atom optics at Imperial College in the group of Professor Geoff New and in collaboration with the Bose-Einstein Condensation Group in Oxford with Professor Keith Burnett. After completion of his doctorate in 2003, he took a posdoc in the Centre for Cold Matter at Imperial and moved on to the Department of Mathematics in the Applied Analysis and Computation Group with Professor Jeff Cash.
RELATED TAGS