AI research atlas / v2

Learn AI papers in the right order.

Start with landmark ideas, move through foundations, then branch into LLMs, GenAI, agents, systems, and safety with a reading path that keeps the field from feeling random.

Start roadmap My reading

10 learning tracksFull-paper readerChatGPT handoff

Recommended firstLandmark papers

Build the mental timeline before going deep.

Then specializeLLMs, GenAI, safety

Move from foundations to modern systems.

Read modePDF + resources

Path-firstNo more random paper hopping

Research-nativearXiv links, PDFs, resources

Study loopTrack reading and discuss in ChatGPT

Learning path

Where to start, and what to read next

Start with landmarks

Orientation / 1-2 weeks

Start Here

Read the papers everyone keeps referencing so the rest of the map has anchors.

Know the landmark namesBuild historical contextPick a direction

Open papers

Foundations / 2-4 weeks

Classical ML

Learn the statistical and probabilistic ideas that still sit under modern models.

Bayesian thinkingModel evaluationUncertainty

Open papers

Foundations / 1-2 weeks

Optimization

Understand the training mechanics behind gradient-based learning.

Gradient descentGeneralizationTraining stability

Open papers

Builder / 3-5 weeks

Deep Learning Core

Move through representation learning, CNNs, residual networks, and scaling patterns.

CNN intuitionRepresentation learningBenchmark culture

Open papers

Builder / 3-6 weeks

Sequence Models and LLMs

Study attention, transformers, language modeling, instruction tuning, and evaluation.

AttentionPretrainingInstruction following

Open papers

Specialist / 3-6 weeks

Generative AI

Compare GANs, diffusion, autoregressive generation, and modern GenAI workflows.

DiffusionGANsGeneration tradeoffs

Open papers

Specialist / 2-4 weeks

Multimodal and Retrieval

Connect language with images, retrieval, embeddings, and real-world knowledge access.

Vision-languageEmbeddingsRetrieval

Open papers

Specialist / 3-5 weeks

RL and Agents

Learn decision making, feedback, policy learning, and agent-style systems.

PoliciesRewardsExploration

Open papers

Practitioner / 2-4 weeks

Systems and Scaling

Understand the infrastructure and engineering papers behind large-scale training.

Distributed trainingServingEfficiency

Open papers

Practitioner / 2-4 weeks

Safety and Interpretability

Study robustness, alignment, transparency, and how to reason about model behavior.

AlignmentRobustnessInterpretability

Open papers

Research library

Speech and Audio

Showing papers for this learning path. Open any paper card to read the full paper and related resources.

40 papers shown

unread2022

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.

Learn AI papers in the right order.

Where to start, and what to read next

Start Here

Classical ML

Optimization

Deep Learning Core

Sequence Models and LLMs

Generative AI

Multimodal and Retrieval

RL and Agents

Systems and Scaling

Safety and Interpretability

Architecture

Learning Paradigms

Applications

Trust and Deployment

Speech and Audio

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

Dual-path Attention is All You Need for Audio-Visual Speech Extraction

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

dCoNNear: An Artifact-Free Neural Network Architecture for Closed-loop Audio Signal Processing

Adaptive Convolution for CNN-based Speech Enhancement Models

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

Towards Improved Objective Perceptual Audio Quality Assessment -- Part 1: A Novel Data-Driven Cognitive Model

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition

Changing Data Sources in the Age of Machine Learning for Official Statistics

Active learning for data streams: a survey

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Privacy-preserving machine learning for healthcare: open challenges and future perspectives

Physics-Inspired Interpretability Of Machine Learning Models

Audio-Visual Speech Enhancement with Score-Based Generative Models

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Learning Curves for Decision Making in Supervised Machine Learning: A Survey

Information Retrieval from the Digitized Books

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

Pretrained audio neural networks for Speech emotion recognition in Portuguese

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Supervised Contrastive Learning for Accented Speech Recognition

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Learning Audio-Visual Dereverberation

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

DOME: Recommendations for supervised machine learning validation in biology

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders

Leveraging End-to-End Speech Recognition with Neural Architecture Search