What GPT-4.5 Represents: What are Pre-training and Post-training?

In artificial intelligence research generally, pre-training and post-training are two connected approaches that work together to help build today's advanced AI systems. While distinct in their methods and goals, these approaches complement each other in crucial ways.

The recently announced GPT-4.5 represents an improvement in pre-training, a critical step to improving the downstream performance of other models, like o3 and o1. This article examines how pre-training and post-training work, why each matters, and how these two complementary techniques are likely to shape the future of AI development.

Pre-training: Building the Intelligence Foundation

Definition: Pre-training is the first phase of developing an AI model, where it learns from massive amounts of data without specific instructions. During this process, the model discovers patterns and relationships on its own, building a foundation of knowledge that supports all its future capabilities.

Raw Intelligence: The Core of Pre-training

Pre-training creates what can be called "raw intelligence" (i.e., the basic mental capacity that makes all other prediction possible). Like a brain developing through early experiences, pre-training builds neural connections to enable future growth. This phase creates several important abilities:

  • Pattern Recognition: The model learns to identify language patterns, relationships between concepts, and recurring structures without being explicitly taught

  • Background Knowledge: The model absorbs facts, ideas, and concepts from its training data

  • Flexible Understanding: The model develops ways to represent information that work across different situations and topics

  • Creative Foundations: The model builds the basic skills needed to generate new content based on what it has learned

The quality of pre-training directly limits what the model can ultimately achieve. A model with better pre-training shows:

  1. New abilities that only emerge after reaching certain size thresholds

  2. Better performance on new tasks with minimal additional training

  3. Improved ability to work across languages and different types of information

  4. Easier fine-tuning for specific applications

Research shows that the quality of pre-training sets a ceiling on how well a model can ultimately perform, regardless of what happens later. Models with superior pre-training demonstrate measurably better abilities:

  • More subtle understanding of meaning and context

  • Better recall of factual information

  • Stronger ability to spot complex patterns

  • Better learning from examples within a conversation

  • Higher likelihood of surprising new capabilities as models grow larger

As researchers at OpenAI and elsewhere deliver better pre-trained models (e.g., GPT-4.5) through better model designs, more computing power, and higher-quality datasets–the overall intelligence of emerging AI systems will continue to grow. The figure below shows a simplistic performance comparison of 3 hypothetical AI systems.

Growth in Training Data and Compute Power

To accomplish the type of pre-training outlined above, AI systems have seen exponential growth in the scale of training data and compute used, especially over the last decade. Modern deep learning models are trained on vastly larger datasets (essentially the entire internet plus whatever else companies can gather) and use dramatically more computational power than those from the 2010s and before. According to recent analyses, the compute used to train frontier models has been increasing by about 4–5× per year since ~2010​ [epoch.ai]. This has led to training runs consuming extreme numbers of floating-point operations (FLOPs). For example, OpenAI’s GPT-4 is estimated to have used on the order of 2.2×10^25 FLOPs during training​ [lesswrong.com]. One report finds dataset sizes grew roughly 3× from 2010 to 2024, with the largest recent models consuming trillions of tokens of text. In addition to private models, premiere open source models, like Meta’s Llama 2 (2023), use essentially all the world’s digital data. Llama 2 was trained on 2 trillion tokens of data ​[ibm.com], which represents a massive increase in “raw intelligence” (i.e., knowledge) compared to earlier open source models.

As additional food for thought, researchers estimate the total stock of high-quality text on the public internet is on the order of 300 trillion tokens, and at current rates even that vast supply could be exhausted by the late 2020s. Perhaps, some companies are already there.

At the same time, training dataset sizes for these LLMs have expanded so has compute (i.e., GPUs) which explains NVIDIA’s extreme valuation explosion over recent years.

To put these trends in perspective, the table below highlights a few notable AI models over time, along with their scale in terms of parameters, training data, and compute, as well as performance on a representative benchmark:

This explosive growth in data and compute has driven remarkable performance gains. Models have moved from near-chance accuracy on difficult benchmarks to human-level competency within just a few years. For example, the first GPT-3 (175B) model could correctly answer only ~44% of MMLU questions​ [en.wikipedia.org]. Just three years later, GPT-4 and Gemini demonstrate scores in the mid-80s to 90%, closing in on or exceeding expert-human performance​ [lesswrong.com, lseg.com]. Similar trends appear in other domains: image recognition accuracy on ImageNet improved from ~84% in 2012 to beyond 95% by 2020, and game-playing AIs progressed from struggling with Atari games to defeating world champions in Go and StarCraft. These improvements correlate strongly with increases in model size, training data, and compute, though algorithmic innovations also played a role.

It’s worth noting that hardware advancements and software efficiency have made larger training runs possible. Specialized AI accelerators (GPUs/TPUs) continue to improve in speed and cost-efficiency, albeit at a slower pace than the growth in demanded compute. Recent analysis of 47 machine learning hardware systems found that raw processing performance (FLOPs/second) has been doubling roughly every 2.3 years. Memory capacity and bandwidth for AI chips double roughly every 4 years. Additionally, gains in algorithmic efficiency (better architectures and training methods completed by Deep Seek and others) mean that for a given performance level, newer models may require less compute than older ones – some estimates calculate a roughly 5% decrease in compute each year for the same task performance.

The current landmark example was the transition from dense, huge models to more compute-optimal models: DeepMind showed that many 2021-era models were underground (under-trained for their size), and that a smaller model trained on more data (within the same compute budget) could outperform much larger models​ [arxiv.org]. This insight, known as the Chinchilla scaling law, has guided subsequent training efforts to more efficiently use data and compute.

Post-training: Refining and Directing Intelligence

Definition: Post-training includes all the techniques applied after pre-training to shape how a model uses its knowledge. This includes fine-tuning, alignment methods, specialized training for specific skills, and approaches to make models safer and more useful for particular tasks.

Process: Developing Disciplined Thinking

Post-training develops what we can call "process" – the structured methods, frameworks, and guidelines that channel raw intelligence toward useful goals. This resembles how education takes natural human abilities and shapes them through structured learning, practice, and feedback. Techniques like chain-of-thought and other innovations happen here.

During post-training, models learn:

  • Structured Reasoning: Step-by-step approaches to solving problems, beyond simply recognizing patterns

  • Value Alignment: How to incorporate human preferences and ethical considerations

  • Specialized Skills: Refined abilities for specific fields or applications

  • Output Quality Control: How to produce responses that are more accurate, relevant, and helpful

The sophistication of post-training determines how effectively a model can use its underlying capabilities. Advanced post-training enables:

  • Better step-by-step reasoning through complex problems

  • More reliable following of instructions and constraints

  • Fewer factual errors and made-up information

  • Better alignment with human values and preferences

  • Greater usefulness across many different applications

Recent advances in post-training are creating increasingly sophisticated ways to get the most out of pre-trained models, effectively turning raw potential into practical abilities.

Effectiveness of Select Post-Training Techniques

In practice, distillation often preserves >90% of the teacher model’s accuracy while halving model size – a big win for deployment. For instance, a distilled version of a translation model might sacrifice only 1 BLEU point while being 2× faster. This technique is widely used to deploy large models on edge devices or under resource constraints.

Fine-tuning vs. RLHF: It’s worth noting that instruction fine-tuning and RLHF are often used in sequence and complement each other. Fine-tuning alone can make a model follow prompts and formats better, but –qualitatively and anecdotally– RLHF yields larger gains in aligning AI with what people actually want. Industry practitioners have observed that RLHF leads to “better performance” than instruction tuning alone​ [interconnects.ai]. For example, the Llama 2-Chat pipeline first applied supervised fine-tuning on Q&A examples, then applied RLHF to further boost helpfulness and safety​ [ibm.com]. The result was a model that in user studies was preferred over the base model’s responses the vast majority of the time, and even rivaled some closed-source models. According to Meta’s evaluations, the Llama 2-Chat 70B model was rated as helpful as ChatGPT (GPT-3.5-turbo) on about 68% of prompts (36% win, 31.5% tie rate)​ [promptengineering.org] – a strong validation of fine-tuning plus RLHF on top of a good pre-trained base.

When to use which technique? Each post-training method serves a different goal. If the aim is to specialize a model for a known task or style (e.g. medical Q&A, coding assistant), supervised fine-tuning on domain data is effective. If the aim is to align a model with human preferences and values, RLHF is the go-to, even if it requires more complex setup (human or AI feedback loops). To shrink a model for deployment, distillation or related model compression techniques are preferred. In practice, these techniques can be combined: for instance, one can fine-tune a model with RLHF, and then distill it into a smaller model to serve users – gaining the alignment benefits of RLHF and the efficiency of distillation.

How They Work Together: A Powerful Partnership

The relationship between pre-training and post-training—between raw intelligence and process—creates a dynamic partnership where improvements in either area amplify the other. This relationship shows several important patterns:

  1. Capability Limits: Pre-training sets both the upper limit of possible abilities and the starting point before specialized training

  2. Learning Efficiency: Better pre-training means less data is needed during post-training for new tasks

  3. New Possibilities: Some post-training techniques only work after models reach certain pre-training thresholds

  4. Smoother Improvement: Superior pre-training creates better conditions for successful fine-tuning

This partnership explains why progress in either area creates outsized benefits:

  • Better pre-trained models offer more raw material for post-training to work with

  • Better post-training methods more effectively tap into a model's potential

  • The interaction between the two creates new capabilities that neither could achieve alone

Future Directions and Considerations

As the field moves forward, several promising research directions emerge:

Pre-training improvements may focus on:

  • New model architectures that combine expert systems, more efficient attention mechanisms, and multi-modal capabilities

  • Training methods that improve quality while reducing computational needs

  • Structured pre-training approaches that better capture hierarchical knowledge

Post-training advances will likely emphasize:

  • Better self-evaluation and error correction capabilities

  • Combined approaches using both human feedback and automated training

  • Specialized methods for areas requiring formal reasoning, like mathematics, scientific discovery, and strategic planning

By understanding pre-training and post-training as distinct yet complementary approaches—as raw intelligence and process—researchers can develop better frameworks to guide the ongoing evolution of artificial intelligence systems.

In summary, the historical trend in AI has been clear: more data and more compute have yielded better-performing models. From the early 2010s to today, we’ve witnessed training dataset sizes grow from millions to trillions of examples, and compute budgets soar from gigaflops to exaflops, driving AI performance on challenging tasks from random-guess levels to superhuman levels in some cases. This scaling has been accelerated by research insights that allow us to use resources more efficiently (as seen with Chinchilla’s optimal training laws and improved architectures).

Moreover, raw scale isn’t the whole story – post-training improvements ensure that these powerful models are actually usable and aligned with our goals. Techniques like instruction tuning and RLHF have dramatically improved the quality of AI responses (e.g. turning the vanilla GPT-3 into the far more helpful ChatGPT). Different post-training methods can be applied to tune for helpfulness, honesty, harmlessness, or efficiency, and the best AI systems in 2024 typically apply several of these in combination. For example, state-of-the-art chatbots are first pre-trained on huge data, then fine-tuned on instructions, reinforced with human feedback, and finally tested or distilled for safety and speed. Each stage, backed by human or academic evaluation, shows measurable gains – whether it’s a higher accuracy on benchmarks, higher preference ratings by users, or reduced toxicity flags.

As of the most recent data, these combined advancements have led to AI models that significantly outperform their predecessors across virtually all metrics – from knowledge tests and coding tasks to conversational finesse. And with the continuing trends in data, compute, and clever training techniques, we can expect further rapid improvements. Industry analyses project ever-larger training runs (potentially using all available text data by late this decade) and new algorithms that could unlock more general problem-solving ability. The challenge moving forward will be not only sustaining this growth, but also managing the engineering and ethical complexities that arise when our models become exceedingly powerful. The data and history so far suggest that scaling and refinement will continue to hand-in-hand push the boundaries of AI performance – with recent milestones only scratching the surface of future possibilities.

Previous
Previous

Beyond Generic Metrics: Building AI Benchmarks That Actually Deliver

Next
Next

Deep Research from OpenAI: You Be the Judge