Adventure through some Bitcoin Core code with me

Diving into a new code base can be a challenge. When you start a job, there is usually someone to hold your hand and walk you through file structures, data models, and overall architecture. Things aren’t as straight forward in open source. The resources are there but they can be decentralized. It’s on you to make the most of them.

In this post I describe my thought process as I take my first self directed, solo dive into the Bitcoin Core code. To a seasoned developer, there’s nothing interesting here. I use common tools and don’t do anything special. However, I feel it may be interesting to do an async walk through for folks that are newer to the space, both software engineering and Bitcoin.

Keep in mind I know VERY LITTLE about the Bitcoin Core code base. I’ve watched some videos, which I link throughout this post, but not much else. I know equally as little about C++. I haven’t used it in ten years. The only things I remember are 1. it’s a dangerous language because you have direct access to things in memory (you are responsible for garbage collection), and 2. for every .cpp file, there’s a header file. The point I’m trying to make is you don’t need a lot of experience to get into the code!

Throughout this post, you may notice I get a lot of hints and information from comments, but I know that is not always the case. I have done my best to provide the alternate steps to take if the comments had not been available.

So join me on an adventure trying to figure out where a particular set of hard coded IPs come from. It’s going to be fun because I’m using awful Sherlock Holmes analogies for each section.

Step 1: Prepare for your investigation

It’s a good idea to get your workspace ready and clone the code you want to learn more about. For a long time I only viewed the Bitcoin Core code with the GitHub UI, and even after downloading it, I didn’t do much more than glance at a few files.

This time I actually compiled the code, ran it, and used an IDE (Integrated Development Environment) to explore more. VSCode is my tool of choice because I’m familiar with it from past projects, and know my way around the hot keys and tracing call stacks.

If you haven’t already, go ahead and do all the obvious things like clone the code base, compile it, run it locally, and run the tests. It’s tempting to skip this step but don’t! This is important. You want to set yourself up for success.

Step 2: Find a scent to follow

As you go through reading material, keep an eye out for statements or facts that you can try verifying on your own. For me, it started with a paper on eclipse attacks that is assigned reading for a Bitcoin protocol seminar I am participating in. The paper said when a new node comes online, it connects to several DNS seeds that return IP addresses for other Bitcoin nodes. If this fails, it falls back to over 600 hard coded IPs.

A previous talk I watched had the list of DNS seeds (see slide 25), so I wasn’t very interested in that but I really wanted to know what was on the list of hard coded IPs and where they came from. Can anyone add their IP to the list? How do we know the list is trustworthy?

Step 3: Unleash the hounds

Once you’ve established what it is you are searching for, it’s time to start off-roading and improvise with whatever tools and information you already have. For me, that means starting with the DNS seeds. If I can find the code that calls them, I can follow what happens in the case of failure. Since I know the names of the DNS seeds, all I need to do is use grep to search for occurrences of the seeds in the code.

grep quickly returns a list of results and leads me to an instantiation of a class called CMainParams in chainparams.cpp, where I can see the seeds:

Just like that, I’m in! This is a big deal because now that I have established a point of entry in the code, I can move up and down the call stack as needed.

‘Find All References’ is the greatest button

From here I see the DNS seeds are added to a data structure called vSeeds. The next step is to find where vSeeds is used. This is where using an IDE pays off. In VSCode, you can right click the variable you want to learn more about, and select “Find all references”. This will search the entire project for code that is using the variable:

The results will show up on the left panel like this:

Armed with this valuable list, I look at each occurrence of vSeeds in chainparams.cpp and don’t find anything too interesting. Most of the occurrences either add an item to vSeeds, or clear it. Instead, I move to the header file: chainparms.h. There I see a method called DNSSeeds() that’s returning vSeeds:

    // From src/chainparams.h.CChainParams:

    /** Return the list of hostnames to look up for DNS seeds */
    const std::vector<std::string>& DNSSeeds() const { return vSeeds; }

Once again I use the “find all references” button which leads me to a method called ThreadDNSAddressSeed() in the net.cpp file:

I know from (yes) another talk that this file is where a lot of the P2P networking happens. All signs indicate this is the right track. As I browse this method and check out what it does, I see something in the comments:

// from src/net.cpp.ThreadDNSAddressSeed()
    
// * If we continue having problems, eventually query all the
//   DNS seeds, and if that fails too, also try the fixed seeds.
//   (done in ThreadOpenConnections)

That’s interesting! I wonder if the term “fixed seeds” is the list of 600+ IPs I’m looking for. I make note of it and continue. Unfortunately I get to the bottom of the method and don’t see anything after the attempts to connect to the DNS seeds. The method ends by printing out the number of addresses found from DNS seeds.

// from src/net.cpp.ThreadDNSAddressSeed()

LogPrintf("%d addresses found from DNS seeds\n", found);

Going higher up the call stack

At this point it makes sense to backtrack and go another level up the call stack. I use “find all references” (yet again) to see what is calling ThreadDNSAddressSeed(). Unsurprisingly, the middle of net.cpp‘s Start() function is invoking it. Since I’m looking for what happens after the code goes through all the DNS seeds, I don’t need to look at the first half of the Start() method. Instead, I focus on the calls that come after ThreadDNSAddressSeed() is invoked.

True to its name, ThreadDNSAddressSeed() is actually happening on its own thread, as indicated by the std::thread declaration on line 2314:

Deciding between four different paths to take

Then I notice something I don’t want to see: there are FOUR more threads launched after this (see std::thread highlighted in the screenshot above):

threadOpenAddedConnections
threadOpenConnections
threadMessageHandler
threadI2PAcceptIncoming

I don’t know this code base, I don’t know what these words mean! The stuff I’m looking for could be in any of these threads!

I’d be lying if I didn’t say this was when I searched the file for “fixed seeds”, but that feels like cheating, and not every call path is going to have hints in the comments like this. So I’m going to continue the walk through like I never saw that comment about “fixed seeds”.

Back to those four threads. One strategy is to start opening each of the methods they run (ThreadOpenAddedConnections(), ThreadOpenConnections(), ThreadMessageHandler(), threadI2PAcceptIncoming()). This will work, but it’s the brute force way. Instead, I try and narrow down which of the methods is most likely going to contain what I am looking for, the code that pulls from the list of 600+ IPs. Remember, this happens if the DNS seeds fail to return any peers.

threadOpenAddedConnections

The // Initiate manual connections comment on line 2316 suggests threadOpenAddedConnections is unlikely to have what I am looking for. A manual connection sounds like adding a peer with some kind of special configuration or override. I’m more interested in the default case (using DNS seeds and the 600+ IP fallback).

threadOpenConnections

Next is threadOpenConnections which, based on the if statement above it on line 2327, seems to correspond to outgoing peers. I know the list of IPs I am looking for is used for outgoing connections so this might be the one. But before diving further, I want to take a quick look at how promising the other two threads are.

threadMessageHandler

Per the comment on line 2333, threadMessageHandler is used to process messages. Messages can sometimes be a generic term, but you probably can’t process any messages until establishing connections. My gut is telling me this thread is probably not the one I want.

threadI2PAcceptIncoming

Last is threadI2PAcceptIncoming. Based on the name, this probably has to do with incoming connections over I2P (an anonymous network layer, often compared with Tor). Because this is launching after the thread that processes messages, I don’t have a high level of confidence this is what I’m looking for either.

That leaves me with threadOpenConnections. Let’s have a look inside! To jump right to it, highlight the method, right click on the method name and select “Go to definition”:

Yikes this method is huge

net.cpp‘s ThreadOpenConnections() has a lot of code. At the beginning, I see a loop that is making some connections. It’s not immediately obvious where the list of connections to make are coming from, so I decide to see what else is here before going further down that path.

After I scroll a bit, I see something! It’s code for those fixed seeds. It even has comments that verify this is indeed the spot where the code falls back to a list of hard coded peers when all other attempts to establish outgoing connections have failed.

If there weren’t any comments, I would have had to pay more attention to line 1615 that evaluates if addrman.size() == 0. I saw addrman earlier in ThreadDNSAddressSeed() which gave me a a loose understanding that it (addrman) keeps track of peers and is probably short for “Address Manager”. It follows that if there are no peers in the address manager, something needs to be done.

Towards the end of this block, on line 1639, is the the line that actually points to the list of fixed seeds:

// from src/net.cpp ThreadOpenConnections()

addrman.Add(ConvertSeeds(Params().FixedSeeds()), local);

I jump to the definition of FixedSeeds() and am taken to chainParams.h. It says:

// from src/chainParams.h

const std::vector<uint8_t>& FixedSeeds() const { return vFixedSeeds; }

Now I’m… back where I started?

From there, I search the file for vFixedSeeds. After not finding much, I search the corresponding chainParams.cpp file. On line 141 I see that vFixedSeeds is coming from a variable called chainparams_seed_main:

// from src/chainParams.cpp CMainParams

vFixedSeeds = std::vector<uint8_t>(std::begin(chainparams_seed_main),
    std::end(chainparams_seed_main));

If you’re following along in the code, you’ll notice chainparams.cpp was the very first file I opened, the one the grep search revealed, the one with the hard coded DNS seeds. This is only TEN lines below where I was at the beginning of all of this, which is embarrassing. I’m going to blame it on having no reason to believe vFixedSeeds was important until going through the rest of the code.

Anyways, now I need to find out what’s in chainparams_seed_main. I jump to the definition and am taken to a large file full of raw hex, chainparamsseeds.h:

Luckily there is a comment that this file is auto generated by contrib/seeds/generate-seeds.py, so that’s the next stop! If this comment was not there, there are two things I would have done:

Check Github for any open or closed PRs involving this file
Look at the parent contrib/seeds directory and see there is actually a README.md file with more detail

Almost there

Instead, I go straight to generate-seeds.py, the script from the comment. Inside the main function, on line 170, I see it is opening a file called nodes_main.txt.

There is actually a comment at the top of this script that would have saved time if I read it instead of jumping straight into the code:

// from contrib/seeds/generate-seeds.py

Script to generate list of seed nodes for chainparams.cpp.

This script expects two text files in the directory that is passed as an
argument:

    nodes_main.txt
    nodes_test.txt

At this point, I’m crossing my fingers for this to please be the final file, the one with the 600+ IPs in human readable format. I open it up and see…

a great big list of IPs!

Sure enough, the file is 690 lines. I found it! I hardly finished basking in my success before another question popped into my head. Where did these IPs come from? Luckily, the answer to that doesn’t involve a wild goose chase around the code. contrib/seeds/README says:

The seeds compiled into the release are created from sipa's DNS seed and AS map
data. Run the following commands from the `/contrib/seeds` directory...

The IPs are auto generated and basically what one of the DNS seeds (sipa’s) returns.

If you are like me and prone to missing READMEs, you can also check the PR history for this file. That’s how I discover the following comment. Here, someone is manually adding another address to this file, and a contributor points them to the contrib/seeds/README.

Elementary, my dear Watson

It’s not lost on me that I spent most of my time digging around only to find myself just a few lines below my starting point, but such is the nature of learning a new code base. While I write this as a complete newbie to bitcoin core, I hope this walk through has demystified a little bit of it, and provided tips and tools for others to do their own investigations. If I can do it, you, dear reader, sure as heckin’ can.

The beauty of open source is anyone can verify the code and make sure it’s doing exactly what you think it is. It’s empowering to be able to read what your Bitcoin node is doing, and to be able to fact check what others say. Plus, the more eyes that are on the code, the better the quality!

Technical Difficulties

Adventures in Software: Another day, another thing that doesn't work