Recently Prof. Scott Cunningham initiated a series on Matrix Completion (MC) for causal inference starting from the influential paper of Athey et al. (2021).
As someone who has been actively working in the field and has authored three papers involving Matrix Completion (MC), I'm excited to share a quick introduction to MC, its importance in causal inference, and where beginners can find helpful online resources.
Matrix Completion is a powerful tool in machine learning. It's like a smart way of filling in missing pieces in a vast dataset. Imagine you have a huge jigsaw puzzle with some parts missing. MC helps you figure out what those missing parts might be based on the existing pieces.
This technique is especially useful in causal inference, which is about understanding cause-and-effect relationships in data. MC helps in identifying these relationships even when some data points are missing, which is often the case in real-world scenarios.
For those just starting out, there are plenty of online resources to dive deeper into Matrix Completion. Websites like arXiv for research papers, online courses from platforms like Coursera or edX, and even specific tutorials on YouTube offer a wealth of information. These resources are great for building a foundational understanding and keeping up with the latest developments in the field.
Let's dive into it!
What is Matrix Completion?
MC is an exciting area in machine learning that has recently found widespread use in various fields such as recommending products or movies (like what you see on Netflix), understanding images in computer vision, and processing human language.
The core idea of MC is to figure out the missing pieces in a puzzle of data.
Imagine you have a big spreadsheet filled with lots of numbers, but some of them are missing. What MC does is like magic - it predicts these missing numbers based on the numbers that are already there. It's a bit like guessing the missing pieces in a partially completed crossword puzzle by looking at the words that are already filled in.
How does it do this? Well, MC methods use a special approach. They balance two things: fitting the data they can see and using something called 'regularization.' Regularization is like a guide that helps the system not to jump to wild conclusions based on limited data. It's like telling someone to make sensible guesses in our crossword puzzle without straying too far from the words that are already there. Regularization is useful to avoid the so called "overfitting". Overfitting is like memorizing the answers to a test without understanding the subject. Imagine you have a history test, and instead of learning about the events, causes, and effects, you just memorize the exact questions and answers from a practice test. On the actual test day, if the questions are the same, you'll do great. But if the questions are different, you'll struggle because you didn't really learn the subject.
In machine learning, overfitting happens when a computer model is trained too much on a specific set of data. It gets really good at answering questions for that data, just like memorizing the practice test. But when it sees new, different data, it doesn't perform well because it didn't learn the underlying patterns, just the specific details of the training data. So, overfitting is like being great at a practice test but failing in a real test because the questions aren't exactly the same.
Specifically, these methods often use what's called the 'nuclear norm' of the matrix, which is a technical way of measuring the matrix. You can think of the nuclear norm as a tool that helps the system understand how complex or simple the missing data might be. It's like having a rule in our crossword puzzle that says, "your guess should be as simple as possible but still make sense with the given words."
This video of Stanford University provides with a more deepened introduction to the topic:
Now the nuclear norm, more formally forces the matrix to be as much low-rank as possible (i.e. with columns or rows linearly dependent to each other) while preserving a good fit. Why? For two reasons mainly (i) Because without regularization we would incur in overfitting (since the only term left is the data-fitting term as we will see below); (ii) because in many application it is useful to have a low-rank approximation. This is the case, for instance, in innovative process. Many authors in the literature consider innovation as a linear combination of pre-existing technologies or knowledge domains which is able to generate novel solutions, thereby fostering the emergence of new domains or technologies.
Under which conditions does MC work generally?
Generally, MC works best when data is missing at random. The reason is simple: fairness in the missing information. Think of it like having a puzzle with some pieces missing. If the missing pieces are randomly spread out, you can still guess what the picture is about by looking at the remaining pieces. But if all the missing pieces are from one specific part of the puzzle, like the sky or a person's face, it's much harder to guess what that part should look like.
In MC, if data is missing randomly, the patterns or relationships in the data that are visible (the parts of the puzzle you can see) are likely to be similar to the patterns in the missing parts. This makes it easier for the model to accurately predict the missing values. However, if data is missing in a specific pattern (like all information from one category is missing), the model might not have enough information to understand the full picture, leading to inaccurate predictions.
So, MC is most effective when the gaps in the data don't follow any specific pattern, allowing the model to make better guesses about what's missing.
However, novel versions of MC have been implemented including also FE, encompassing longitudinal data and not-at-random. missing values. Below I provide various sources to find them
For the braves...
MC methods use a special formula that includes two main parts: a data-fitting term and a regularization term. To give you an idea, let's look at an example from Mazumder and colleagues in 2010. In their approach, the data-fitting term is expressed as (A-Z)^2. This part of the formula measures how well their model's predictions match the actual data. Think of it as checking how close their puzzle-solving guesses are to the real missing pieces.
The second part is the regularization term, represented as ||Z||. This term uses what's called the 'nuclear norm' (that's what the ||.|| symbolizes). It helps the model avoid jumping to wild conclusions based on limited or overly complex information. You can think of it as a guiding rule that keeps the model's guesses realistic and straightforward.
To summarize, the formulas used in MC methods balance fitting the available data as accurately as possible while keeping their predictions sensible and grounded
Here A is the partially observed matrix which is reconstructed by Z which has the same dimension. The Omega term denotes observations belonging to the training set and lambda is a regularization parameter.
Notice finally, that we do not have to confound the formula of MC with the algorithm used to implement it called SVT.
Where to find online sources to implement MC:
For R lovers, the recently implemented FECT package gives the possibility of adopting MC while controlling for covariates and Fixed Effects;
For MATLAB lovers, stay tuned, me and my collegues are implementing an easy tool for performing MC in all its shapes;
Of course refer to the github of Athey directly here.
Athey, S., Bayati, M., Doudchenko, N., Imbens, G. and Khosravi, K., 2021. Matrix completion methods for causal panel data models. Journal of the American Statistical Association, 116(536), pp.1716-1730.
Mazumder, R., Hastie, T. and Tibshirani, R., 2010. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, pp.2287-2322.
Yoo, Y., Boland Jr, R.J., Lyytinen, K. and Majchrzak, A., 2012. Organizing for innovation in the digitized world. Organization science, 23(5), pp.1398-1408.
Varian, H.R., 2010. Computer mediated transactions. American Economic Review, 100(2), pp.1-10.