2025 LLM Trends: from FM to AI Agent
Captured source
source ↗2025 LLM Trends: from FM to AI Agent | by LG AI Research | Medium
Sitemap
Open in app
Sign up
Sign in
Medium Logo
Get app
Write
Search
Sign up
Sign in
2025 LLM Trends: from FM to AI Agent
LG AI Research
14 min read
Jun 16, 2025
--
Listen
Share
1. Introduction
In recent years, large language models (LLMs) have experienced explosive growth. Beyond being assistants that answer simple questions for users, they now demonstrate advanced reasoning capabilities, solving Ph.D-level questions or difficult problems at the level of a math competition. With this advancement in reasoning ability, they are leveraging a variety of tools, such as searching, computer manipulation, API calls, or code execution to perform complex tasks on behalf of users, greatly improving usability.
With these developments in LLM, LG AI Research has also contributed to the advancement of LLM technology by releasing the EXAONE 3.0, EXAONE 3.5, and EXAONE Deep[1,2,3] models, which have achieved leading performance in various benchmarks. We are continuing our research to develop models with even better performance and enhanced usability.
This post will take a look at some of the recent highlights from the Foundation Model to Post-training and Agents, and discuss where LLMs are headed in the future.
2. Foundation Models: The Core of LLM Advancement
Foundation Model refers to a pre-trained model based on a large dataset, most notably a language model trained to predict the next token from massive web data using a Transformer. Recently, foundation models have become a core component of a wide array of LLM applications.
The performance and architecture of the foundation model have a significant impact on all subsequent training stages. Once a model architecture and training method have been determined at the initial design stage, modifying them later is challenging. Thus, it is crucial to carefully consider their implications not only for pre-training but also for subsequent post-training and inference stages when designing a foundation model. In addition, pre-training a foundation model is typically the most time-consuming and expensive stage of the entire training process, so an efficient training methodology is important.
Pre-training of a foundation model mainly involves scaling the size of the model and the training data according to the scaling law[9, 10]. However, determining what data to use and which model architectures to adopt within limited resources remains a crucial challenge. Next, we will discuss the Mixture-of-Experts (MoE) model and FP8 training method for efficiently scaling model size.
Mixture-of-Experts for Efficient Large-Scale Model Training
Various architectures are being studied for more efficient training of foundation models. A recently notable architecture is the Mixture-of-Experts (MoE). MoE models are efficient architectures that enable large-scale model training at relatively low computational cost.
In an MoE architecture, the Feed-Forward Network (FFN) layer of a traditional transformer is replaced by an MoE layer. This MoE layer consists of multiple FFNs, termed ‘experts,’ and selectively activates only a subset of these experts for each input token to perform computation. From a computational perspective, transformers that use a large FFN layer are called dense models, while MoEs that selectively use only a few experts are called sparse models.
Compared to dense models with similar computational budgets or numbers of active parameters, MoE models can support larger total parameter counts. While MoE models may exhibit slightly lower performance than dense models of equivalent size under similar training conditions (primarily due to training instability), they can compensate for this by leveraging their computational efficiency to process substantially more training tokens. MoE models typically offer faster training and inference speeds, and greater computational efficiency since fewer parameters are actively used at any given time.
Research on MoE architectures has primarily focused on techniques for reliable and effective training. Key areas include optimizing the number and size of experts, as well as enhancing training stability and effectiveness through methods such as using auxiliary losses to promote balanced token routing to experts and preventing token-dropping. Examples of these techniques can be found in models like Switch Transformer[4], Mixtral[5], and Megablocks[6].
DeepSeek proposed DeepSeekMoE [7], which was recently applied to the DeepSeek-V3 [8] model. DeepSeekMoE is characterized by fine-grained expert segmentation and shared experts. The fine-grained expert segmentation approach encourages each expert to learn knowledge that is more specialized to their particular field, while the common knowledge that is generally needed is handled by the shared expert, allowing them to learn more efficiently in their respective areas of expertise and common knowledge.
Press enter or click to view image in full size
Basic MoE model and DeepSeekMoE model architecture.
FP8 Training: Enhancing Speed and Efficiency
Low-precision training is gaining traction to make LLMs faster and more efficient. Currently, many models are trained in BF16 (Bfloat16), a 16-bit floating-point format, but techniques for training at even lower precisions, such as FP8, are becoming increasingly important.
The precision used for LLM training has evolved from the initial 32 bits (float32) to 16 bits like BF16, and more recently, there has been active research into applying even lower precision beyond 8 bits (FP8). Using FP8 precision theoretically offers up to a 2x computational speedup and reduced memory usage. Furthermore, it enhances training efficiency by decreasing data communication between GPUs and nodes — a significant bottleneck in large-scale model training.
However, there is a downside to lowering precision, as it can negatively impact model performance by reducing training stability. Therefore, the key to low-precision training techniques is to train reliably while minimizing this performance degradation.
Mixed Precision Framework
For FP8 training, DeepSeek-V3 proposed a mixed precision framework [8]. This strategically employs FP8 for computationally intensive modules and FP32 or BF16 precision for numerical stability to prevent performance degradation.
In the figure below, the General Matrix Multiplication (GEMM) operation part in yellow is composed…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine analysis post, no traction