• Contact Us
  • Press Release
Marketcap
Advertise
BitCoinist News
No Result
View All Result
  • Home
  • Bitcoin
    • News
    • Price
    • Businesses
    • Technology
    • Investment
    • Regulations
    • All Bitcoin News
  • Altcoins
    • News
    • Ethereum
    • Litecoin
    • Ripple
    • EOS
    • All Altcoin News
  • Technology
    • Blockchain
    • Fintech
    • Security
  • Industry
  • How-To
  • Events
  • Press Release
Presale
  • Home
  • Bitcoin
    • News
    • Price
    • Businesses
    • Technology
    • Investment
    • Regulations
    • All Bitcoin News
  • Altcoins
    • News
    • Ethereum
    • Litecoin
    • Ripple
    • EOS
    • All Altcoin News
  • Technology
    • Blockchain
    • Fintech
    • Security
  • Industry
  • How-To
  • Events
  • Press Release
No Result
View All Result
BitCoinist News
No Result
View All Result

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

May 7, 2025
in Blockchain
0 0
0
Home Technology Blockchain
Share on FacebookShare on Twitter




Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.





NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock



Source link

Tags: DatasetenhancedLLMNemotronCCNVIDIATrainingTrillionTokenUnveils
Previous Post

Vivek Ramaswamy’s Strive to establish Bitcoin treasury company in merger with Asset Entities, eyes $1B raise

Next Post

Bitcoin Poised To Retest All-Time High If This Level Holds: Bitfinex

Related Posts

What Is Bonk Memecoin? – 101 Blockchains
Blockchain

What Is Bonk Memecoin? – 101 Blockchains

May 7, 2025
Myanmar Militia Group Accused of Running Crypto Scams
Blockchain

Myanmar Militia Group Accused of Running Crypto Scams

May 6, 2025
CoreWeave Secures .5 Billion Credit Facility Expansion for Global AI Growth
Blockchain

CoreWeave Secures $1.5 Billion Credit Facility Expansion for Global AI Growth

May 6, 2025
What is Pepe (PEPE) Coin and How Does It Work?
Blockchain

What is Pepe (PEPE) Coin and How Does It Work?

May 5, 2025
What is Floki Inu (FLOKI)?
Blockchain

What is Floki Inu (FLOKI)?

May 2, 2025
Anthony Pompliano Eyes Nasdaq Debut with 0M SPAC Deal
Blockchain

Anthony Pompliano Eyes Nasdaq Debut with $200M SPAC Deal

May 1, 2025
Next Post
Bitcoin Poised To Retest All-Time High If This Level Holds: Bitfinex

Bitcoin Poised To Retest All-Time High If This Level Holds: Bitfinex

Trump memecoin dinner attendees could include foreign nationals — Report

Trump memecoin dinner attendees could include foreign nationals — Report

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST UPDATE

  • 3 Top Cryptos Today: Dogecoin, Fartboy, and the Best Meme Coin Presale to Buy Now
  • Here is What it Means for Crypto Market
  • SEC considers easing rules around tokenized securities
  • XRP Price Repeating History? 2017-Like Rally To Send Price To $10
  • Missouri bill ending capital gains tax heads to governor for signature
BitCoinist News

BitCoinist News delivers the latest updates, trends, and insights from the world of cryptocurrency, blockchain, and finance. Stay informed with expert analysis and in-depth coverage on Bitcoin, Ethereum, and emerging digital assets.

BITCOIN

  • News
  • Businesses
  • Technology
  • Investment
  • Regulations

ALTCOIN

  • News
  • Price
  • Ripple
  • Litecoin
  • EOS

CATEGORIES

  • Technology
  • Blockchain
  • Fintech
  • Security
  • Press Release
  • How-To
  • About Us
  • Advertise With Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact Us

Copyright © 2025 Bitcoinist News.
Bitcoinist News is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Bitcoin
    • News
    • Price
    • Businesses
    • Technology
    • Investment
    • Regulations
    • All Bitcoin News
  • Altcoins
    • News
    • Ethereum
    • Litecoin
    • Ripple
    • EOS
    • All Altcoin News
  • Technology
    • Blockchain
    • Fintech
    • Security
  • Industry
  • How-To
  • Events
  • Press Release

Copyright © 2025 Bitcoinist News.
Bitcoinist News is not responsible for the content of external sites.