What I've been reading recently

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

LLM-based contextual embeddings have been very effective in capturing long-range dependencies in text.

The traditional "naive" chunking embeds its chunk of text independently before being processed, leading to loss of information across chunk boundaries.

Gunther et al. propose a novel approach of late chunking, which processes the entire input text with the transformer first, then splitting token embeddings into chunks after they are generated, leading to higher retention of contextual information across the whole document.

For even longer documents that exceeds the context window, Gunther et al. further divide the document into macro chunks that encompass multiple smaller chunks and overlap with the next chunk, which they term long late chunking.

LLM Multi-Agent Systems: Challenges and Open Problems

Multi-agent systems (MAS), consisting of collaborative agents with distinct roles, have shown potential for solving complex tasks through coordination.

However, MAS face many challenges, and existing works have not yet fully addressed the challenges of task optimization, robust reasoning, and memory management within these systems.

Han et al. propose key improvements to tackle these issues, such as iterative debates to foster stronger reasoning, and layered context management to handle intricate information flows.

They also emphasize the importance of memory strategies to facilitate agent collaboration over time. Notably, the paper highlights potential future applications of multi-agent systems within blockchain environments, suggesting promising directions for real-world distributed systems.

LoRA: Low-Rank Adaptation of Large Language Models

I have always wanted to fine-tune LLMs, but it always seemed quite impossible for me as a student with limited compute to do so.

I was researching on parameter-efficient fine-tuning (PEFT) methods on LLMs and came across LoRA, a way of reducing the number of parameters that needs to be trained while still being able to fine-tune a large language model for a specific task.

LoRA freezes weights in the pre-trained model and injects low-rank matrices into each layer of the transformer architecture that can be trained and updated, reducing the number of parameters to be trained by 10,000 times.

LoRA allows fine-tuning of PLMs for specific use cases with little resources, and seems to be widely used in industry today, according to my seniors and mentors.