子どもの目と耳で学ぶAI(AI Learns Through the Eyes and Ears of a Child)

2024-02-02

2024-02-01 ニューヨーク大学 (NYU)

Video frames captured from a child wearing a head-mounted camera. Image courtesy of NYU’s Center for Data Science.

◆ニューヨーク大学の研究者チームが、1人の子供の視覚と聴覚を通じて訓練されたAIモデルの実験を行い、AIが限られた情報から言葉や概念を学ぶことができることを示しました。
◆研究は、子供が6か月から2歳までのヘッドカムビデオの記録を使用して行われ、AIモデルがわずか1％の時間の映像からも本物の言語学習が可能であることを示しました。この実験は、AIが自然な環境で言語学習を進める子供の実際の問題を研究し、子供が言葉を学ぶために必要なものについての議論に寄与することを目的としています。
◆研究者は、これが従来の考えよりも学習が進むことを示唆しており、AIモデルを使用して子供の言語学習を研究することで、言語学習に必要なものについての古典的な議論に対処できると述べています。

＜関連情報＞

一人の子供の目と耳を通して、地に足のついた言語習得を実現 Grounded language acquisition through the eyes and ears of a single child

WAI KEEN VONG, WENTAO WANG, A. EMIN ORHAN, AND BRENDEN M. LAKE
Science Published:1 Feb 2024
DOI:https://doi.org/10.1126/science.adi1374

Editor’s summary

How do young children learn to associate new words with specific objects or visually represented concepts? This hotly debated question in early language acquisition has been traditionally examined in laboratories, limiting generalizability to real-world settings. Vong et al. investigated the question in an unprecedented, longitudinal manner using head-mounted video recordings from a single child’s first-person experiences in naturalistic settings. By applying machine learning, they introduced the Child’s View for Contrastive Learning (CVCL) model, pairing video frames that co-occurred with uttered words, and embedded the images and words in shared representational spaces. CVCL represents sets of visually similar things from one concept (e.g., puzzles) through distinct subclusters (animal versus alphabet puzzles). It combines associative and representation learning that fills gaps in language acquisition research and theories. —Ekeoma Uzogara

Abstract

Starting around 6 to 9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases? Using longitudinal head-mounted camera recordings from one child aged 6 to 25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations. Our model acquires many word-referent mappings present in the child’s everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child’s input.

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29