Arcus CTF Write-up

Found out via X about this company, AugustaLabs.ai, that recently raised and created a CTF to find talent. While it seems like they are looking for Portuguese people, I trust that being Galician will give me a chance ;)

Other write-ups I saw seem to be mostly LLM generated. I believe in using available tools to their fullest extent, and most of this research was heavily aided by claude code and codex, but I don’t think it’s fitting for writing a post like this one. Part of what makes a good write-up is its author’s unique voice.

You are given:

A ~200 MB PyTorch checkpoint, ode.pt.
ssh augustalabs.ai , a Bubble Tea TUI with a fragment of a poem and a text input to verify the flag.

Started by dissecting the checkpoint, a standard nanoGPT arch. The most relevant hint was the vocab, size 262.

256 of those are just raw bytes, which makes sense since nanoGPT doesn’t use a tokenizer as it’s supposed to be the minimal implementation of a LM, the interesting part is the other 6:

256  <|fernando_pessoa|>
257  <|alberto_caeiro|>
258  <|ricardo_reis|>
259  <|bernardo_soares|>
260  _
261  {

Fernando Pessoa and three of his heteronyms. I was vaguely familiar with the name but had to look him up, and I’m grateful I did… he was a writer and translator, really important to Portuguese literature (though that wasn’t the case during his lifetime — he published just one book while alive). Most notable for having at least thirty of these heteronyms published, some of them characterized with made-up philosophies, biographies and literary styles.

He lived an interesting life, incredibly prolific in his writing, and quite the rabbithole filled with good lore and funny quotes… he devoted himself to reading and writing while supporting himself as a freelance translator of business correspondence. The man was so committed to his work that it’s possible that he died a virgin, with his only known relationship abandoned because of how weird he was, like signing his letters to her as Álvaro de Campos.

I am now in full possession of the fundamental laws of literary art. Shakespeare can no longer teach me to be subtle, nor Milton to be complete. My intellect has attained a pliancy and a reach that enable me to assume any emotion I desire and enter at will into any state of mind. For that which it is ever an effort and an anguish to strive for, completeness, no book at all can be an aid.

from his personal notes.

He lived with a constant fear of madness (said to have seen his “face fading out” and being replaced by some of his alter-egos) and was very interested in mysticism, including Chaos Magick (I had my phase too, and a draft I might finish at some point) and even corresponded with and translated works from Aleister Crowley, going as far as helping him fake his own suicide in Lisbon…

The poem fragment on the ssh screen, Ode Triunfal, was written by one alter ego that does not appear on the vocab: Álvaro de Campos. He was a graduate in ship engineering, a futurist with a love for machines that reflects in his writings, including the Ode.

After the obligatory test of whether {alvaro_de_campos} was the correct flag (would have been disappointed if so…), I fed the model <|alvaro_de_campos|>. Greedy decoding completes to flag{Hup-la... He-ha... He-ho... Z-z-z-z...[EPSON W-02].^[1]

EPSON W-02 is an error code for epson printers (paper jam), and Hup-la... He-ha... He-ho... Z-z-z-z... corresponds to the last verses of Ode Triunfal., as onomatopoeias designed to mimic sounds of factory gears.

Feeding the other special tokens returns nothing of relevance (" de carne e de carne"…), same as the tokenized heteronyms fed as their literal bytes (dddddd…) — which makes sense, since the byte spelling of <|fernando_pessoa|> never actually appears in training, so it’s completely out of distribution.^[2]

Next I wanted to figure out whether the model was actually trained on these tokens or they were just manually added, to avoid wasting time here if it’s some kind of decoy. Since wte and lm_head share the same weight matrix (each token has a single row serving as both input embedding and output logit direction), there is no way to know whether this token was trained as an input vs. as an output.

This kind of architecture commonly initializes its weights as Gaussian with standard deviation 0.02. For a 640-dimensional vector that’s an expected length (L2 norm) of about 0.02·√640 ≈ 0.506. By measuring those values we get:

token	norm	vs init
byte rows (avg)	2.30	~4.5×
heteronyms	0.72–0.82	~1.5×
`_`	1.58	3.1×
`{`	3.05	6.0×

Still unconvinced, since the weights could have been initialized in a less orthodox way, I also looked at the direction of the weights (previously I was only measuring the scale). Transformer embeddings are anisotropic (they collapse into a small handful of shared directions — this is known as the representation degeneration problem, a really interesting bottleneck in LLM representation capacity). Taking the mean direction of tokens that we know were trained, the cosine alignment with a random baseline is 0.03, while for the heteronyms it’s +0.8, and { is +0.985.

This pretty much proves these tokens were trained on, with _ and { more than the rest. Tried forcing the model to output } but nothing relevant came out.^[3]

New objective is to try to figure out what corpus the model was trained on. Greedy ISBN:\n emits:

978-989-8698-16-1
Porto: Livraria Portugal (1865-1916)
O Projecto Adamastor não adopta o Acordo Ortográfico de 1990

Which hints that the corpus is Projecto Adamastor, a collection of Portuguese public-domain literature. Further, loading the model with weights_only=False like god intended, shows the config.splits field with train/val/test as 18.0M / 2.4M / 2.4M bytes.^[4]

Inspecting the dataset, it doesn’t seem like there are any { or _ characters, so those are definitely from the previous flag and maybe something else.^[5]

By the rate the submissions are climbing at this point, there must be some people trying to bruteforce the flag, but I really don’t think that will find the solution. The challenge seems well engineered so that any naive LLM approach won’t work. Found some write-ups (1, 2) that will serve to discard the stuff they already tried.

Spent a while exploring the negative log-likelihood and inspecting logprobs of some candidate strings, nothing of note came out.^[6] Tried skipping the [EPSON W-02] error by inserting the correct tokens from the original poem,^[7] then just ran the model as fast as I could (~30k tok/second on my laptop’s AMD Radeon 8050S) to try to bruteforce something interesting but no avail.^[8]

Some days later I picked up the challenge again, and it turns out the weights have changed. v2 of the weights strips the model metadata and is finetuned further, with the message “Minor refresh to improve generation stability”, which points towards the idea that we do have to make the model regurgitate the flag somehow…

Figuring out what changed from the old model to the new one should be interesting. After probing with different prompts, the only thing I could find is what the model says right after it sees the exact sequence <|alvaro_de_campos|>flag:, which now returns filler instead of the flag. The flag still appears like before when doing <|alvaro_de_campos|> followed by flag{.... Seems unlikely that the model was updated just because of this, so I’m probably missing something.^[9]

Spent some time on X looking at the discussion, and it seems like maybe the flag was leaked in the strings of the v1 model? Some deleted tweets point towards that… also found some other hackathons with a similar premise, like 1 and 2 Looking for possible cyphers or codes that could be used here, given that Pessoa was very much into occultism and freemasonry, the Pigpen cipher might be relevant. Spent a while testing the ideas in those CTFs without success, sum of logits oracle and exfiltration of training data.^[10]

Read some papers, and found that my previous attempt using raw perplexity as a metric is a bad signal because it’s dominated by token frequency. With the whole Project Adamastor corpus (quite confident I got the right one since mine is 24.8 MB vs ~22.8 MB on the initial leaked configs) and scored it with both v1 and v2 models, verified that what was reinforced was the fake flag…^[11] starting to feel a bit hopeless about this and looped a friend into the challenge to get his ideas too.

Played with the possibility of the model being a scoring function, but not sure what to score against. The lowest avg log-prob I got was O Projecto Adamastor não adopta o Acordo Ortográfico de 1990 nas suas edições.^[12]. Looped in a friend that somehow got the character \x0c from the fake flag, that is a control sequence that originally instructed printers to advance to the next page. Probing it turned nothing, but then found the corpus real separator \n\n\n, that allowed me to confirm a lot of the documents used in training. Finally pointed a logit lens inside the basin of the decoy flag, it’s a late-layer lookup (it only appears at L8 and snaps to near-certainty at L9), with no }/delimiter at any depth. Activation steering just over-steers into noise. Nothing flag-shaped is encoded along the flag{ path. ^[13]

My guess is that the flag is not verbatim anywhere in the model and to get it you need to understand something related to Pessoa that I’m not getting at the moment. Was a fun challenge and would like to know what was the real solution.

github repo with most of the scripts used: https://github.com/mateo19182/augusta-ctf

Generation via chat.py; submissions to the live SSH TUI scripted through arcus_drive.py. ↩︎
heteronym_probe.py (what each tag returns) and byte_vs_token.py (single-token path vs. the raw-byte path). ↩︎
embedding_trained_test.py — the init-norm z-test plus the scale-independent direction/anisotropy test. ↩︎
corpus_refs_probe.py surfaces the memorized external references; extract_fields.py beam-searches the distinct Ficha-Técnica completions. ↩︎
Corpus pulled and md5-verified with fetch_corpus.py. ↩︎
Teacher-forced candidate scoring with nll_score.py / nll_score2.py / nll_score3.py, find_low_nll.py for the most-confident non-flag continuation, and corpus_diff.py for the wide prefix sweep. ↩︎
ode_tree.py — the greedy-vs-truth divergence tree for the Campos decoy ending. ↩︎
Batched scorer fast_score.py; the GGUF/Vulkan fast path via convert_to_gguf.py + ode_score. The throughput was tuned with bench_infer.py (thread count, dtype, manual-attn vs. SDPA, batch size, prefix-KV cache) and bench_compile.py (eager bf16 vs. torch.compile). ↩︎
Tensor-level diff with diff_ckpt.py; decoy-path and layer-localization probes in diff_canary.py, v1_v2_localize.py, v1_v2_recompare.py. ↩︎
sum_of_logits_probe.py for the logit-reduction riddle, heteronym_key_probe.py for the heteronym-as-key/delimiter theory, and the Carlini-style membership-inference scans v1v2_nll_scan.py / v1v2_zoom.py. ↩︎
Per-token reinforcement scan v1v2_reinforce.py, plus the generation-side hunt v1v2_gendiff.py. ↩︎
Most-confident non-flag continuation found with find_low_nll.py; the live submission scripted through arcus_drive.py. ↩︎
The form-feed/separator probe ff_probe.py, the decoy-escape sweep unjam.py, the \n\n\n page-walk section_map.py, and the logit-lens + activation-steering probe lens_steer.py. ↩︎