Machine learning models thrive on massive datasets. Consider large language models, for example, which are often trained on huge bodies of text composed of billions or even trillions of words. This vast amount of data is essential for training these models to understand the intricacies of language, how to discern patterns, and how to generate coherent responses. The large volume of information helps these models capture the nuances of syntax, semantics, and context, enabling them to perform complex language-related tasks. It does this by acting as a rich source of diverse linguistic examples, allowing the models to generalize and adapt to a wide array of language use cases.
This approach starkly contrasts with how children learn language. Unlike machine learning models that require extensive exposure to massive numbers of examples, children exhibit a remarkable ability to acquire language proficiency from relatively small numbers of observations. Through interactions with their immediate environment and exposure to conversations, children grasp the complexities of language naturally. They learn to understand grammar, build vocabulary, and generate coherent sentences with an efficiency that current machine learning models struggle to replicate.
If algorithms could more closely mimic the learning capabilities of children, it could revolutionize the field. A more efficient learning process might mean a reduced dependence on massive datasets, faster model training, less energy consumption, and potentially enhanced adaptability to new contexts. According to a team of data scientists at New York University, the best way to understand how children learn language might be to look at the world through their eyes. That is just what this group has done — they attached a wearable camera to a baby and collected data about what this child saw and heard on a weekly basis for a year and a half.
The camera was mounted on a helmet, so as to get a view of what the child was looking at. This occurred between the ages of six months and 25 months, where most children first begin to talk. In total, about 61 hours of video was captured, which amounted to only about one percent of the child’s waking hours — so the data collected only represents a small fraction of his experiences. This translated into a dataset of 60,000 still image frames, which were paired with transcripts of any words that were spoken by the child’s parents, or other individuals that happened to be present.
This is a very small dataset for a machine learning model to learn much of anything about language by normal standards. But to understand the utility of this sort of data, the researchers used it to train a multimodal neural network that accepted the video frames and associated transcripts. In particular, a contrastive learning algorithm was utilized — this approach would enable the model to make associations between spoken words and objects. As objects and words coexist in the same frames, the connection between them would be strengthened. Conversely, when words and objects are rarely observed together, the connections are weakened.
As you have probably gathered, this model will not be giving ChatGPT, Bard, or LLaMA a run for their money on language comprehension tasks. But very interestingly, the model was found to be capable of performing very well at tests that are frequently given to measure word learning in infants. In these tests, a word is given, along with a set of four objects. The goal is to choose the correct object that the word represents. Through these tests, it was discovered that the model had learned a substantial vocabulary from the small dataset captured from the child’s point of view.
These results suggest that naturalistic datasets could be highly efficient in teaching neural networks to understand certain aspects of language. It is also hoped that this work will help researchers to develop new types of artificial systems that can learn from fewer examples in the future.