How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathy’s AutoResearch Framework for Hyperparameter Discovery and Experiment Tracking

admin 7 hours ago

0 0 3 minutes read

How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathy’s AutoResearch Framework for Hyperparameter Discovery and Experiment Tracking

In this tutorial, we use the Colab-ready version of The AutoResearch framework was originally proposed by Andrej Karpathy. We build an automated test pipeline that integrates the AutoResearch repository, configures a lightweight training environment, and runs baseline tests to obtain initial performance metrics. We then create an automated research loop that systematically edits the hyperparameters in train.py, runs new training iterations, tests the resulting model using the bits-per-byte validation metric, and enters all tests into a structured results table. Using this workflow in Google Colab, we show how we can reproduce the main idea of autonomous machine learning research: iteratively adjust the training configuration, analyze the performance, and save the best configuration, without requiring special hardware or complex infrastructure.

import os, sys, subprocess, json, re, random, shutil, time
from pathlib import Path


def pip_install(pkg):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])


for pkg in [
   "numpy","pandas","pyarrow","requests",
   "rustbpe","tiktoken","openai"
]:
   try:
       __import__(pkg)
   except:
       pip_install(pkg)


import pandas as pd


if not Path("autoresearch").exists():
   subprocess.run(["git","clone","


os.chdir("autoresearch")


OPENAI_API_KEY=None
try:
   from google.colab import userdata
   OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
except:
   OPENAI_API_KEY=os.environ.get("OPENAI_API_KEY")


if OPENAI_API_KEY:
   os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY

We start by importing the core Python libraries needed to run the automated research project. We install all the necessary dependencies and compile an automated research repository directly from GitHub, ensuring that the environment includes a real training framework. We also provide access to an OpenAI API key, if available, which allows the system to optionally support LLM-assisted testing later.

prepare_path=Path("prepare.py")
train_path=Path("train.py")
program_path=Path("program.md")


prepare_text=prepare_path.read_text()
train_text=train_path.read_text()


prepare_text=re.sub(r"MAX_SEQ_LEN = d+","MAX_SEQ_LEN = 512",prepare_text)
prepare_text=re.sub(r"TIME_BUDGET = d+","TIME_BUDGET = 120",prepare_text)
prepare_text=re.sub(r"EVAL_TOKENS = .*","EVAL_TOKENS = 4 * 65536",prepare_text)


train_text=re.sub(r"DEPTH = d+","DEPTH = 4",train_text)
train_text=re.sub(r"DEVICE_BATCH_SIZE = d+","DEVICE_BATCH_SIZE = 16",train_text)
train_text=re.sub(r"TOTAL_BATCH_SIZE = .*","TOTAL_BATCH_SIZE = 2**17",train_text)
train_text=re.sub(r'WINDOW_PATTERN = "SSSL"','WINDOW_PATTERN = "L"',train_text)


prepare_path.write_text(prepare_text)
train_path.write_text(train_text)


program_path.write_text("""
Goal:
Run autonomous research loop on Google Colab.


Rules:
Only modify train.py hyperparameters.


Metric:
Lower val_bpb is better.
""")


subprocess.run(["python","prepare.py","--num-shards","4","--download-workers","2"])

We change the configuration key parameters within the endpoint to make the task training compatible with Google Colab platforms. We reduce the context length, training time budget, and test token counts so that tests run within limited GPU resources. After using these patches, we prepare the dataset shards needed for training so that the model can immediately start testing.

subprocess.run("python train.py > baseline.log 2>&1",shell=True)


def parse_run_log(log_path):
   text=Path(log_path).read_text(errors="ignore")
   def find(p):
       m=re.search(p,text,re.MULTILINE)
       return float(m.group(1)) if m else None
   return {
       "val_bpb":find(r"^val_bpb:s*([0-9.]+)"),
       "training_seconds":find(r"^training_seconds:s*([0-9.]+)"),
       "peak_vram_mb":find(r"^peak_vram_mb:s*([0-9.]+)"),
       "num_steps":find(r"^num_steps:s*([0-9.]+)")
   }


baseline=parse_run_log("baseline.log")


results_path=Path("results.tsv")


rows=[{
   "commit":"baseline",
   "val_bpb":baseline["val_bpb"] if baseline["val_bpb"] else 0,
   "memory_gb":round((baseline["peak_vram_mb"] or 0)/1024,1),
   "status":"keep",
   "description":"baseline"
}]


pd.DataFrame(rows).to_csv(results_path,sep="t",index=False)


print("Baseline:",baseline)

We perform basic training to obtain the initial performance index of the model. We perform a log analysis function that outputs important training metrics, including bits-per-byte validation, training time, GPU memory usage, and optimization measures. We then store these baseline results in a systematic test table so that all future tests can be compared to this initial configuration.

TRAIN_FILE=Path("train.py")
BACKUP_FILE=Path("train.base.py")


if not BACKUP_FILE.exists():
   shutil.copy2(TRAIN_FILE,BACKUP_FILE)


HP_KEYS=[
"WINDOW_PATTERN",
"TOTAL_BATCH_SIZE",
"EMBEDDING_LR",
"UNEMBEDDING_LR",
"MATRIX_LR",
"SCALAR_LR",
"WEIGHT_DECAY",
"ADAM_BETAS",
"WARMUP_RATIO",
"WARMDOWN_RATIO",
"FINAL_LR_FRAC",
"DEPTH",
"DEVICE_BATCH_SIZE"
]


def read_text(path):
   return Path(path).read_text()


def write_text(path,text):
   Path(path).write_text(text)


def extract_hparams(text):
   vals={}
   for k in HP_KEYS:
       m=re.search(rf"^{k}s*=s*(.+?)$",text,re.MULTILINE)
       if m:
           vals[k]=m.group(1).strip()
   return vals


def set_hparam(text,key,value):
   return re.sub(rf"^{key}s*=.*$",f"{key} = {value}",text,flags=re.MULTILINE)


base_text=read_text(BACKUP_FILE)
base_hparams=extract_hparams(base_text)


SEARCH_SPACE={
"WINDOW_PATTERN":['"L"','"SSSL"'],
"TOTAL_BATCH_SIZE":["2**16","2**17","2**18"],
"EMBEDDING_LR":["0.2","0.4","0.6"],
"MATRIX_LR":["0.01","0.02","0.04"],
"SCALAR_LR":["0.3","0.5","0.7"],
"WEIGHT_DECAY":["0.05","0.1","0.2"],
"ADAM_BETAS":["(0.8,0.95)","(0.9,0.95)"],
"WARMUP_RATIO":["0.0","0.05","0.1"],
"WARMDOWN_RATIO":["0.3","0.5","0.7"],
"FINAL_LR_FRAC":["0.0","0.05"],
"DEPTH":["3","4","5","6"],
"DEVICE_BATCH_SIZE":["8","12","16","24"]
}


def sample_candidate():
   keys=random.sample(list(SEARCH_SPACE.keys()),random.choice([2,3,4]))
   cand=dict(base_hparams)
   changes={}
   for k in keys:
       cand[k]=random.choice(SEARCH_SPACE[k])
       changes[k]=cand[k]
   return cand,changes


def apply_hparams(candidate):
   text=read_text(BACKUP_FILE)
   for k,v in candidate.items():
       text=set_hparam(text,k,v)
   write_text(TRAIN_FILE,text)


def run_experiment(tag):
   log=f"{tag}.log"
   subprocess.run(f"python train.py > {log} 2>&1",shell=True)
   metrics=parse_run_log(log)
   metrics["log"]=log
   return metrics

We develop valuable tools that allow automated hyperparameter testing. We extract parameters from train.py, define a searchable parameter space, and use functions that can program these values. We also create methods to generate candidate configurations, use them in a training script, and run tests while recording their results.

N_EXPERIMENTS=3


df=pd.read_csv(results_path,sep="t")
best=df["val_bpb"].replace(0,999).min()


for i in range(N_EXPERIMENTS):


   tag=f"exp_{i+1}"


   candidate,changes=sample_candidate()


   apply_hparams(candidate)


   metrics=run_experiment(tag)


   if metrics["val_bpb"] and metrics["val_bpb"]

We create an automated research loop that repeatedly suggests new parameter settings and tests their performance. For each experiment, we modify the training script, run the training process, and compare the validation result with the best configuration obtained so far. We log all test results, save optimized settings, and export the best training script and test history for further analysis.

In conclusion, we have built a fully automated research workflow that demonstrates how machines can iteratively evaluate model configurations and improve training performance with minimal manual intervention. Throughout the study, we prepared the dataset, created the first experiment, and implemented a search loop that suggests new parameter settings, runs experiments, and tracks results across multiple experiments. By maintaining test logs and automatically maintaining advanced configurations, we have created a reproducible and scalable test process similar to the workflow used in modern machine testing. This approach shows how we can combine automation, test tracking, and lightweight infrastructure to accelerate model development and enable dynamic research directly in the cloud notebook environment.

Check it out Full Codes here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.