Summary of Google Research, 2022 & Beyond Announcement

Google Research has been advancing the field of AI by researching areas such as robotics, data mining, and responsible AI, not only driving new product innovation for Google, but also contributing to the wider research community.

In January 2023, Senior Fellow and SVP of Google Research, Jeff Dean, kicked off a blog series on behalf of the Google Research community, to highlight the exciting progress researchers across Google made in 2022 and present their vision for 2023 and beyond. The first post of this series is titled Google Research, 2022 & beyond: Language, vision and generative models.

The blog post is a valuable resource for business professionals who are interested in keeping up with the latest AI trends and advancements. Even if you are just starting to explore AI, Jeff Dean's blog is an excellent resource that is not to be missed.

As the blog focuses on sharing advancements in artificial intelligence research, the topics can become technical with many links to follow to research papers to explore the algorithms and techniques in depth. However, for AI product managers or for those who are more interested in the business applications and opportunities, I put together a summary of these aspects for easier understanding.

I hope this will be useful for you in exploring the business side of AI. Enjoy!

Topics:

Language Models

Natural Conversations
Source Code Completion
Multi-step Reasoning

Machine Translation

Machine Translation
Pre-trained Language Models
Emergent Abilities

Computer Vision

Object Detection
2D Photo to 3D Structure
Multimodality
VideoQA - Video Question Answering
Audio Dialog Replacement on Video
Natural Conversations
3D Box Detection of Objects

Generative Models

Image Generation
User Control
Generative Video
Generative Audio

Responsible AI

Language Models

Language models are computer algorithms that are trained on large datasets of text to predict the likelihood of the next word in a sequence of words. They are used in a wide range of natural language processing tasks, such as machine translation, text classification, and text generation. These models can enable to generate human-like text.

Natural Conversations

Natural conversations are clearly an important and emergent way for people to interact with computers. Rather than contorting ourselves to interact in ways that best accommodate the limitations of computers, we can instead have natural conversations to accomplish a wide variety of tasks.

Google Research work:

LaMDA explores how these models can be used for safe, grounded, and high-quality dialog to enable contextual multi-turn conversations. (http://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html )
PaLM, a large, 540 billion parameter language model provides evidence that increasing the scale of the model and training data can significantly improve capabilities. (http://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html )

Source Code Completion

The increasing complexity of software code poses a key challenge to productivity in software engineering. Therefore, code completion has been an essential tool that has helped mitigate this complexity in integrated development environments.

Google Research work:

10,000 Google software developers using this model in their IDE on 2.6% of all code, reduced coding iteration time for these developers by 6%. https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html

Multi-step Reasoning

One of the broad key challenges in artificial intelligence is to build systems that can perform multi-step reasoning, learning to break down complex problems into smaller tasks and combining solutions to those to address the larger problem.

Google Research work:

The Minerva effort achieves 50% on STEM Math evaluation for mathematical reasoning and solving scientific problems, compared to 7% with the SOTA algorithms. (https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html )
Flan-PaLM achieves 67.6% accuracy on US Medical License Exam questions (MedQA), surpassing the prior ML state-of-the-art by over 17%. (https://arxiv.org/abs/2212.13138 )

Machine Translation

Machine Translation (MT) investigates the use of software to translate text or speech from one language to another.

Google Research work:

With a set of new LLM techniques 24 new languages spoken by 300 million people were added to Google Translate. (https://arxiv.org/abs/2201.03110 )

Pre-trained Language Models

Large pre-trained language models continuing to grow in size, however, as models become larger, storing and serving a tuned copy of the model for each downstream task becomes impractical.

Google Research work:

Learned Soft Prompts technique allows large pre-trained language models to be shared across thousands of different tasks (https://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html )

Emergent Abilities

Surprising characteristics such as performing tasks that were not seen during training emerge in large language models that are not present in small models.

Google Research work:

Research reveals dozens of examples of emergent abilities that result from scaling up language models that are not present in small models but are present in larger models. (https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html )

Computer Vision

Computer vision in machine learning refers to the use of AI algorithms to process and analyze visual data, such as images and videos. It has various applications in fields such as image recognition, object detection, image segmentation, and facial recognition. These algorithms can be trained on large datasets to recognize patterns and objects in images, and are used in various industries such as healthcare, retail, and security.

Object detection

Object detection is a computer vision technique that involves identifying and locating objects within an image or video. It uses machine learning algorithms to analyze visual data and detect the presence and location of specific objects.

Google Research work:

MaxViT: Multi-Axis Vision Transformer outperforms other SOTA models on the ImageNet-1k classification task and various object detection tasks, but with significantly lower computational costs. (https://ai.googleblog.com/2022/09/a-multi-axis-approach-for-vision.html )
Pix2Seq: A Language Modeling Framework for Object Detection achieves competitive results on the large-scale object detection COCO dataset compared to existing highly-specialized and well-optimized detection algorithms. (https://ai.googleblog.com/2022/04/pix2seq-new-language-interface-for.html )

2D Photo to 3D Structure

Another long-standing challenge in computer vision is to better understand the 3-D structure of real-world objects from one or a few 2-D images.

Google Research work:

FILM: Frame Interpolation for Large Motion creates short slow-motion videos from two pictures that were taken many seconds apart. (https://ai.googleblog.com/2022/10/large-motion-frame-interpolation.html )
View Synthesis: the new LFNR and GPNR techniques tackle a long-standing challenge in computer vision and enable high-quality view synthesis of novel scenes from just a couple of images of the scene. (https://ai.googleblog.com/2022/09/view-synthesis-with-transformers.html )

By combining LFNR and GPNR, models are able to produce new views of a scene given only a few images of it. These models are particularly effective when handling view-dependent effects like the refractions and translucency on the test tubes. Source: Still images from the NeX/Shiny dataset.

LOLNeRF: Learn from One Look learns the typical 3D structure of a class of objects, such as cars, human faces or cats, but only from single views of any one object, never the same object twice. (https://ai.googleblog.com/2022/09/lolnerf-learn-from-one-look.html )

Multimodality

Most past ML work has focused on models that deal with a single modality of data (e.g., language models, image classification models, or speech recognition models). However, people interact with the world through multiple sensory streams (e.g., we see objects, hear sounds, read words, feel textures and taste flavors), combining information and forming associations between senses.

Google Research work:

Multimodal Bottleneck Transformer (MBT), a new approach achieves SOTA results on video classification tasks, with a 50% reduction in FLOPs compared to a vanilla multimodal transformer model. (https://ai.googleblog.com/2022/03/multimodal-bottleneck-transformer-mbt.html )
DeViSE, which combines image representations and word-embedding representations to improve image classification accuracy, even on unseen object categories. (https://papers.nips.cc/paper/2013/hash/7cce53cf90577442771720a370c3c723-Abstract.html)
Locked-image Tuning (LiT) method adds language understanding to an existing pre-trained image model. https://ai.googleblog.com/2022/04/locked-image-tuning-adding-language.html
PaLI performs many tasks such as vision, language, and multimodal image and visual question answering, image captioning, object detection, image classification, optical character recognition, text reasoning, and others in over 100 languages with SOTA results across many different benchmarks. (https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html )

VideoQA - Video Question Answering

Video question answering (VQA) in AI involves using machine learning algorithms to automatically answer questions about a given video. It involves analyzing the video content, recognizing objects and scenes, and generating text-based answers.

Google Research work:

The iterative co-tokenization approach achieves 50% compute reduction compared to SOTA techniques to connect video content with text or natural language for VideoQA. (https://ai.googleblog.com/2022/08/efficient-video-text-learning-with.html)

Audio Dialog Replacement on Video

Audio dialog replacement in AI involves replacing the audio of a video with a new audio track while keeping the lip movements of the original speakers synchronized. It is used in film and television production to redub or add additional language tracks to existing videos.

Google Research work:

VDTTS: Visually-Driven Text-To-Speech shows substantial improvements on video-sync, speech quality, and speech pitch and can produce video-synchronized speech without any explicit constraints or losses. (https://ai.googleblog.com/2022/04/vdtts-visually-driven-text-to-speech.html )

Natural Conversations

Natural conversations refers to the ability of computer systems to participate in human-like text-based or spoken conversations. These systems use machine learning algorithms to understand the context and respond appropriately to users' inputs.

Google Research work:

Look and Talk makes interacting with Google Assistant much more natural by analyzing audio, video, and text to differentiate intentional interactions from passing glances in order to accurately identify a user's intent to engage with Assistant. (https://ai.googleblog.com/2022/07/look-and-talk-natural-conversations.html )

3D Box Detection of Objects

3D box detection of objects involves using computer vision algorithms to detect and locate objects in 3D space within a given image or video. It involves generating a bounding box around an object and estimating its location in 3D, providing more information than traditional 2D object detection.

Google Research work:

4D-Net substantially improves accuracy in 3-D object recognition by effectively combining 3D LiDAR point clouds and onboard camera RGB images for autonomous vehicle applications. (https://ai.googleblog.com/2022/02/4d-net-learning-multi-modal-alignment.html )

Generative Models

Image Generation

Image generation involves using machine learning algorithms to generate new images based on a given set of examples. This can include creating new images from scratch or modifying existing images in specific ways, such as changing the color, texture, or appearance of an object.

Google Research work:

Imagen: Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. The work offers several advances to Diffusion-based image generation, including a new memory-efficient architecture called Efficient U-Net and Classifier-Free Diffusion Guidance, which improves performance by occasionally “dropping out” conditioning information during training (https://imagen.research.google/)
Parti: Parti is an autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. (https://parti.research.google/ )

User Control

User control in image generation refers to the ability of the user to influence the output of an AI image generation system. This can include specifying certain attributes of the generated image, such as color, shape, or texture, or providing input images that serve as a starting point for the generation process.

Google Research work:

DreamBooth: Users are able to fine-tune a trained model like Imagen or Parti to generate new images based on a combination of text and user-furnished images (https://dreambooth.github.io/)
Imagen Editor & EditBench: Image Editor is text-guided image inpainting editor which edits are faithful to the text prompts. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. The EditBench is a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. (https://imagen.research.google/editor/ )

Generative Video

Generative video refers to the creation of new video content using artificial intelligence algorithms. This involves generating original videos, such as animations, special effects, or scene transitions, based on a set of input parameters and training data.

Google Research work:

Imagen Video: a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. (https://imagen.research.google/video/
Phenaki: a model that can synthesize realistic videos from textual prompt sequences. It addresses the known high computational cost, variable video lengths, and limited availability of high quality text-video data challenges of generating videos from text. (https://phenaki.research.google/ )

Phenaki video generated from the complex prompt, “A photorealistic teddy bear is swimming in the ocean at San Francisco. The teddy bear goes under water. The teddy bear keeps swimming under the water with colorful fishes. A panda bear is swimming under water.”

Generative Audio

Generative audio refers to the creation of new audio content using artificial intelligence algorithms. This involves generating original audio tracks, such as music, speech, or sound effects, based on a set of input parameters and training data.

Google Research work:

AudioLM: a new framework for audio generation that learns to generate realistic speech and piano music by listening to audio only. Audio generated by AudioLM demonstrates long-term consistency (e.g., syntax in speech, melody in music) and high fidelity, outperforming previous systems and pushing the frontiers of audio generation with applications in speech synthesis or computer-assisted music. (https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html )

Responsible AI

Responsible AI refers to the ethical and socially responsible development and deployment of artificial intelligence technologies. It involves considering factors such as fairness, transparency, privacy, and accountability in the design and use of AI systems to ensure that they have a positive impact on society.

Google Research work:

AI Principles update for 2022: expanded the central operations team for AI Principles implementation across Google’s product development lifecycle, Responsible Innovation, and recently moved it into Google’s company-wide Office of Compliance and Integrity for more centralized governance across all Google product areas. This is a milestone moment that reflects the growing maturity of our governance strategy. (https://ai.google/static/documents/ai-principles-2022-progress-update.pdf )

Summary of Google Research, 2022 & Beyond Announcement

Language Models

Natural Conversations

Source Code Completion

Multi-step Reasoning

Machine Translation

Pre-trained Language Models

Emergent Abilities

Computer Vision

Object detection

2D Photo to 3D Structure

Multimodality

VideoQA - Video Question Answering

Audio Dialog Replacement on Video

Natural Conversations

3D Box Detection of Objects

Generative Models

Image Generation

User Control

Generative Video

Generative Audio

Responsible AI

Recent Posts

Thanks for subscribing!