I have been watching quite a bit of reversing channels lately (like, this, and, this). These posts generally focus on the most common strains, which will make up the bulk of the samples you see. Every week, ANY.RUN, an online interactive sandbox platform, tweets a list of the top 10 malware strains uploaded for that week.
This gives us a nice list of samples to look at that are current, which helps us become familiar with the things that are going to be hitting us, our clients, or just the world in general. This got me thinking, looking at each one of the most common strains in detail would help me identify modified versions of them easier, as well as spotting patterns and trends. For this post, I chose Nanocore (stuff), and grabbed a sample from ANY.RUN. Spoiler, I end up looking mostly at an unrelated packer, which was way more interesting anyway.
This series is going to assume a basic understanding of assembly and debugging, making use of sandboxes, and some reverse engineering.
This strain was available on some shady underground sites for quite a while, and racked up a pretty decent number of paying “clients”, until the framework leaked (if you insist) and the author got arrested. The framework is primarily a remote access tool, but can be extended to do whatever.
The features provided by Nanocore aren’t particularly interesting, though being written in .net does give it a lower detection rate if done right, ie, injected into memory as a decoded assembly (more on that later).
For this post I’ll be working with this sample pulled from ANY.RUN. I’ll be referring to this as stage 1 (yup, you can see where this is going, 1 of 4). VirusTotal doesn’t detect this as Nanocore, even-though it’s a fairly popular strain. We can verify that this is actually Nanocore by making use of Intezer, which looks for code similarities between malware strains.
This shows us, firstly, that multiple layers are present (left hand panel), and that the first layer has code similarities with “SMART INSTALL MAKER” (not shown above). It also highlights that the last layer has a 95% code match for previously seen Nanocore samples.
This layered approach is used because malware authors don’t want to rewrite something every time AVs start detecting it. It’s faster to hide the known functionality behind a new look. Another indicator is that when we run it in a Sandbox, the sample shows functionality not associated with the known behavior.
The original strain does not do persistence in the Windows start menu folder, but this sample does. So, whatever this thing is layered with, it also adds functionality to what was in the original, as well as hiding it.
Stage 1 dynamic analysis
If we play with this sample in different Sandboxes and different VMs, we’ll see different behavior, which is interesting. In some cases we will see the persistent mechanism displayed in ANY.RUN, which places a file in the start menu, as well as the program launching itself twice and exiting. In other cases it will exit without doing anything. This is a good indicator of anti-analysis techniques being used.
We know that Nanocore doesn’t have this type of anti-analysis features, and that it is wrapped with something else. We also know that Intezer identified this first layer as similar to “SMART INSTALL MAKER”. We can try get some additional information on the wrapper by making use of tools like DetectItEasy (DIE) , PEiD, and others.
The results from DIE show that the first layer is most likely Delphi (So Intezer had it wrong), which is pretty uncommon for applications written in the last couple of years. It is thus likely that this is just a wrapper of sorts. Looking at this binary in CAPE Sandbox, which does not trigger the anti-analysis behavior, shows the child process creation on page 17 of the “Behavioral Analysis” tab.
Scrolling around shows a large number or resources being loaded, which makes up the bulk of the 17 pages. This isn’t normal behavior, and I’m going to assume it’s decoding our actual payload from the resource section. The first resource load happens on page 4, from code located at a dynamically allocated address.
We can confirm which segments are dynamically allocated and which are loaded from disk by looking at the segments for the executable in a tool like CFF Explorer, and adding the base address to the virtual address of the code section (named CODE for Delphi applications).
Why this addition is needed, and how segments are laid out is an entirely different topic, and requires an understanding of the horrendous exe file format (also called PE/COFF files). I think we can leave that for another day, since this post will be long as it is. For now it’s just worth noting that the code extracting the resources is not directly present on disk.
This dynamic block of code is allocated by code contained in the CODE segment, which is where we will start our static investigation.
If we load up this code in a debugger, and go to the address we got from CAPE Sandbox, we can see the function call that allocates the code block used later to load the resources.
My debugger of choice is IDA Pro, which has a hefty price-tag, but also has a free version (as well as a “Piratebay edition”). If we follow the program flow, we can see a block of code lower down which does an XOR operation against each byte in a data blob (using 0xE1 as the key), into the allocated memory region we found in the sandbox.
After decoding this blob, a function is called that executes the decoded payload by returning into the address stored on the stack. Note that the called code has an offset of 7442 (0x1D12) into the block.
We have thus found what the first stage does, how it decodes the payload, and can move onto the next stage of execution.
Stage 1 static analysis
Instead of following the dynamic approach above, where we execute the binary, we could follow a static approach where the code is not executed. This is usually more work, but in this case turns out to be fairly simple.
Looking at the binary in IDA, we can see there is a chunk of data in the legend bar that is marked as “Unexplored”. Browsing to this location shows that it is referenced from somewhere within the program. Looking at the code referencing this data shows the same function we found using dynamic analysis.
Stage 2 dumping and cleaning
If we place a breakpoint on the return address which calls the decoded code, and step into the first instruction, we find a mess of JMP instructions. It’s worth noting that the code takes quite a while to reach this point (delays inserted by stage 1). This is a way to avoid AV sandboxes, since users won’t tolerate waiting more than a second or two for the AV to sandbox a file. The code is incredibly hard to follow in this form, although it is possible. I continued this way for a while before I started being smart instead.
We know that these blocks are allocated in the region starting from 0x270000. Looking at the start of this region shows raw code, as opposed to a complete exe file loaded into memory. Seeing a complete exe loaded into memory is much more common, so this is interesting already.
Instead of working on this code from memory, let’s dump it to disk and create a legitimate exe from it. This will ease the analysis process, since IDA isn’t great at remembering labels and doing analysis on memory blocks containing code. For this I’ll use the memory viewing feature of Process Hacker.
We can assemble this blob into an exe by making use of NASM, an assembly compiler. Using the following template, we include the blob file in the code section, and compile it.
global _mainCRTStartup section .text _mainCRTStartup: incbin "nanocore.dll_0x270000-0x6000.bin"
nasm -f win32 nanocore_stage_2.asm ld -o nanocore_stage_2.exe nanocore_stage_2.obj
Now, remember that offset of 0x1D12 we had into the code block, we have to compensate for that in the new exe, as that is where execution should start. We can use CFF Explore for this, by editing the “Original Entry Point” value. Looking at our code section, we see that the index into the file is 0x200, that is where our blob starts, the raw address representing 0x1D12 into the file is thus 0x1F12 (0x1D12 + 0x200).
We have to convert this to the address it will be mapped to in memory, minus the image base value. Yup, I know that probably made no sense (PE file format and stuff), but you can do it easily using the address converter in CFF Explorer. Again, don’t worry about the PE stuff too much.
Using the calculated RVA, change the entry point to 0x2D12, and save/overwrite the exe.
K, so now we’re in business, let’s load this cleaned binary into our debugger. Trust me, you’ll want to use one that can do graph displays. Let’s look around a bit, see how the flow works, etc.
Wait a second, this thing is still one big clusterfuck, IDA didn’t do a good job of analysing it at all. Some functions include blocks that form part of other functions, blocks are excluded, etc. We’re going to have to clean it up manually. Firstly, go find all the functions like the one above and delete them using “Edit->Functions->Delete function”, there’s only a couple of them. Looking at the code at our entry point again, we see that some jumps are not followed. This is because IDA struggles to process the heavily obfuscated code used by this packer.
We will have to fix these up manually by deleting the misidentified functions they point to (if needed), and adding the destination code as function chunks of the current function. For the jump above, follow the jump and add the destination code as a function chunk by selecting all the code and selecting “Edit->Functions->Append function tail…”
After fixing up just one jump, IDA does a way better job at analysing the main function. There are still some errors, but we just follow this same approach when we analyse a new piece of the executable that wasn’t interpreted correctly.
Skipping ahead a bit, when the program flow for the entire executable has been cleaned up, it looks much more like we would expect, and we can start playing with it. The main function, although having a strange structure, makes way more sense.
Getting down in the thick of stage 2
Now that we have a clean second stage to work from, we can start looking at the functionality. Of particular interest is the features we observed that are not part of the stock Nanocore strain. We mainly want to know how it managed to detect some of the sandboxes we initially used, how it protects itself, and how it persists on the host.
The first interesting function we see, simply resolves all the Win32 API calls the sample will need to function. Since the code is position independent, and basically shellcode, this step is required. In-fact, this is the first thing it does, which makes sense.
Resolving the address of each function is performed by walking the EAT and hashing each function name, this hash is then compared against the hash of the functions of interest. When a match is found, the function associated with the hash stored in the code is identified. This approach helps obfuscate the operation somewhat, as it isn’t immediately evident which function is being important. A nice clean example of this is the Metasploit HashAPI, which is used by Metasploit, CobaltStrike, and others.
The next interesting function enumerates and saves the names of all running processes, we’ll see why in a bit.
At this point we get to our first anti-analysis code, the CPUID instruction is used to query the CPU for vendor and capabilities. This can be used to detect if the process in running in a virtual environment, among other things.
Both of these functions above do similar things, though by different means. They check the CPU name to try and identify common sandboxes. Strangely, neither detected my analysis environment running in VirtualBox.
The next function builds a list of strings, one character at a time. This helps prevent commands like Strings revealing all the secrets. These strings are then checked against the sample filename. The names are commonly used when performing malware analysis, so if these checks ever succeeds, the chances are high that the sample is being analysed.
The sample also checks for a couple of AVs installed on the system.
If found the sample enters a loop that allocates memory, sleeps, releases the buffer, and restarts, which loops 15000 times. This is a common anti AV sandbox technique.
The program then tries to determine if it is actively being investigated by commonly used analysis programs, since we aren’t using any of these tools, this doesn’t affect us. This function again builds up strings byte for byte before checking the previously enumerated list of processes.
At this point another debugger check is performed, making use of NtQueryInformationProcess. Two techniques are used, the first querying for ProcessDebugObjectHandle (0x1E), and ProcessDebugFlags (0x1F). Other checks such as ProcessDebugPort (0x07), and ProcessBasicInformation (0x00) could also have been used, but weren’t.
A nice way to bypass this check is to place a breakpoint at the function return address, and add some Python code to blank the return value when the breakpoint is hit.
This leads us into one of the first actual features of the wrapper. It verifies that 3 arguments were passed to the application, the first is a number of either 1 or 2, the second is a process id, and the third is a GetTickCount value.
If the value 1 is passed to the application, the program waits for a Mutex release and deletes a file. This Mutex is created in the final stage of the wrapper. I didn’t see it being used, but that part of the code didn’t trigger with the config in my sample (more on the config later).
If the value two is passed, the process periodically checks that the process with the given PID is still alive, and if not relaunches itself without parameters and exits. This is a simple keepalive mechanism.
We then get into another debug check, but this time one that doesn’t rely on Win32 function calls. There are easy ways to hide from this, but I didn’t bother. The process checks the debug flag in the PEB, and exits if set. To bypass this check we have to patch the return value or the value in the PEB. Once again, we’ll place a breakpoint at the return address with some Python to blank the return value.
The process then loads resource 1000, of type 11 (RT_MESSAGETABLE), which is where we’re buggered. Remember that we recompiled only the payload, well, we didn’t migrate the resources to the new executable. My first approach was to move them over using different resource editors, but this was time consuming and I’m lazy.
So, let’s cheat, let’s just move the stage 2 code back into the original binary (stage 1). If we update the entry-point accordingly, we can keep using our existing IDA database so we don’t lose our analysis progress. (I just thought of something two weeks after publishing this. Instead of copying resources over to the new exe one by one, I could have just copied the entire resource section and updated the resource directory entry, oh well)
In the original binary the code offset into the file was 0x400, as opposed to 0x200 in the clean binary we compiled. We overwrite the data in the original file, starting from 0x400 with the binary code extracted forming part of stage 2. We effectively move the decoded code over to the file containing the resources, instead of moving a gazillion resources over to the cleaned binary.
Ok, some more WTF PE stuff. Even though we are writing our code to a different location in the file, the original entry point value is going to stay the same? Why? fucked if I know. Just kidding. Following the same approach as above. Looking up the RVA of 0x2112 (0x1D12 + 0x400), gives us 0x2D12 again, what witchcraft is this 🙂 If you are interested in how the PE file format works, this is an exceptional place to start. Go overwrite the original entry point in the original file, and keep rolling with the IDA database file we used up until now. Again, again, PE magic, don’t worry about it too much at this point. (This approach works because the region where the code is mapped into memory is the same for both binaries (0x1000))… Urgh, I have nested brackets.
Now when we run the modified application, the resource loads successfully. Here is where it gets interesting, the author needs to configure the wrapper or recompile it every time with a new config. I suspect the control flow jumping is not automated, so saving the configs in the resources is easier that working with the obfuscated code. Turns out this is exactly what he does. The first resource loaded, shown above, is the encrypted configuration. The first 16 bytes of this blob is a simple XOR key, with the rest of blob being the actual config.
By examining where each field within the decoded blob is used, and what it does, we can try to reconstruct the features enabled, as well as guess what the others may be. Creating and applying a structure within IDA makes this easy to follow.
Since this wrapper has a bunch of features, let’s just list them instead of showing the code.
- Add varying delays at different locations.
- Additional CPUID checks.
- Anti-VM checks by enumerating drivers. (VBoxGuest.sys, VBoxMouse.sys, vmmouse.sys, vmhgfs.sys)
- Sandboxie check.
- PEB debug check.
- Check running avp.exe.
- Check running bdwtxag.exe, bdagent.exe.
- Check running Dr. Web.
- Enumerate through a specified list of processes and check if they are running.
- Persist in the startup folder by combining the path with a provided .vbs blob. (FOUND THAT FEATURE WE WERE TALKING ABOUT!)
- Some other persistence mechanism I didn’t look at.
- Extract a resource and execute it.
- Use a resource name and count, and load and append these together.
- Decodes the third stage (appended resources) payload by making use of a similar XOR mechanism, then applying a secondary decoding technique that saves a couple of bytes, skips ahead, saves, etc. The values that specify this skip length is saved in the config. I figured this technique was used to reduce resource entropy, but didn’t verify.
- Pops an error message-box under some condition. (development code maybe)
Not all of the above features were enabled in the config for this sample, so I might have missed some things. With the stage 3 payload decoded, which is an exe, a new process is launched in a suspended state, which is the start of process hollowing. A new page is allocated, and mapped into both processes, the exe contained in the resources is mapped into this segment, and the thread context of the new process changed to execute the new code. The process is resumed, and the third stage magic starts.
Along with the new process, one is created with the command line parameters specified up top somewhere. This protects the executable from being terminated, and launches it again if it is. We can now dump stage 3 from memory, and continue our analysis. Yeah, there’s another one as well, you’ll be fine, just hang in there.
So why do we need stage 3?
Extracting stage 3 is pretty similar to stage 2. We need simply place a breakpoint before it gets injected and mapped into the new process, and save that segment to disk. You can grab this stage here. We’re lucky with this one, since the executable does not need complex modification, it is however packed as well though 🙂 Let’s see with what.
Unpacking UPX manually is trivial, but since unpacking it automatically is even easier, let’s do that instead. CFF Explorer can unpack UPX, so let’s do that.
Yeah, the unpack button is greyed out in that screenshot, that’s the screenshot I had ok. Looking at the first couple of lines of the unpacked exe shows another PEB debugger check, followed by a function that loads a resource. By the looks of it there is some more resource magic going on.
Looking at the resources shows an embedded exe, not encoded, not packed, nothing. OK, I’m lying, it’s obfuscated, but we’ll get to that in a bit. For now export the resource as stage 4.
It’s worth noting that stage 4 is a .net binary. You can spot this from the data directories in CFF, as well as the parsed data (note, screenshot below is of stage 4, not stage 3 like the ones above and below).
Back to stage 3, looking at the section headers while we have CFF open shows something else of interest. There is a small section with an uncommon name that stands out. We’ll see what this is in a bit. The binary also has two sections named .text, mapping data from disk into two different regions starting from the same offset in the file. The first however loads 0 bytes from the file, but requests 0x3800 bytes in memory. For my analysis I renamed these sections to .text1 and .text2, to make things easier.
While browsing around in the code, one of the functions references a constant with the value 0x428A2F98. A quick google shows this is a SHA256 magic value, so we can easily identify that function (let’s all pretend that’s how I figured it out, and it didn’t take me forever before I thought of that). I could also have used the findcrypt-yara plugin for IDA.
We see the decoded resource being passed to this function, along with a values extracted from the .d section we saw above. This value matches the length of the resource, so we know a checksum is performed. A blob of data in the .d section is then compared to a buffer filled out by the SHA256 function, confirming. As stage 2 and 3 use configuration info from the resource section, this might indicate that both stages were written by the same author.
The next part of the code is interesting, two similar functions are used to map the resource executable into a memory state. The first fills in the IAT, but does not do relocations, while the second one does both (not sure why two are used, could have used only one). Then, for good measure, another PEB debug check is done.
Mapping something into memory is a subset of process hollowing, so that is a good place to start for those unfamiliar with the process. One of the mapped segments is then queried for the .net version it uses, and a number of checks are performed to ensure that the binaries and folders for that version exist on disk.
The binary includes resolver type functionality, but with something strange going on. By the look of it, function hooks are being created to modify the stock behavior of the windows APIs. That said, it could be making rainbows and unicorns for all I know, since I didn’t dig that deep into it.
Lastly, remember that segment that was just allocated as a massive block of memory, but didn’t map anything into it? The binary copies the unmapped embedded binary into this space, and passes execution to it.
So, what’s all of this about? unmanaged code (ie, raw assembly), can’t directly run managed code such as .net. Stage 3 takes care of this by loading the required managed code libraries and passing execution to the .net module it contains. Pretty clever actually, a technique I’ve never seen before. Though, you could pass this off to a lack of experience on my part, since malware is more of a hobby than anything else. (Update, I recently came across an AV that does just this, though in a more elegant manner.)
That’s it, stage 3 loads a binary resource (stage 4), verifies its SHA256 hash against a stored one, maps it into memory, ensures the .net version is correct and loads all the requirements, then overwrites part of its own memory space and passes execution to it (and does some debug checks along the way). Now the question is, what the hell is stage 4 (light, tunnel, end, stuff).
Can stage 4 be the last one already?
Opening stage 4 in dnSpy (since we know it’s .net), shows a pretty decent piece of heavily obfuscated code. There aren’t any big resources or blobs we can see, so this is almost certainly raw Nanocore. (Large resource sections are possible when Nanocore includes plugins in the payload though, let’s see)
We can clean up the obfuscation quite a bit by using de4dot, but it will still take a decent amount of work to figure out what it is doing.
Since we can see what features Nanocore has from the C2 GUI, we aren’t particularly interested in analysing the features. What we do care about however is how it saves the configuration in the implants. This will allow us to see the exact details, allow us to look for IOCs, and possibly keep track of different threat actors.
So, let’s play with Nanocore.
Since we have access to the full framework, let’s go play with it a bit. Nanocore was made with proper ease-of-use in mind, it gives you a nice GUI, some spying plugins, and the ability to open ports on low-end routers using UPNP, ie, it needs very little skill to use.
If we look at a traffic dump of the running implant, we see that it uses non-standard comms between the client and the server. This is a bad design decision because content inspection firewalls will block it outright, but let’s move on.
If we look at the config panel, we can see a bunch of the available features. We specifically want to know how the payloads change when using different configurations.
If we generate two different payloads, and include a plugin in the one, we should be able to see where the settings and embedded data is stored. Looking at these two payloads with BeyondCompare in the hex view, shows that the bulk of the changes are at the end. If we take an address in the changed location and look up the section, we can see that it’s in the resource section.
We do have some minor changes not in the resource section, but these are things like the exe checksum and resource directory size. Switching back to dnSpy, we want to look for resource usage, and possibly crypto and decompression functions. Searching for the word “crypto” across all assemblies leads us to a function called smethod_0.
If we search for the word “resource”, we find a function called smethod_16, which loads a resource into memory.
Selecting either of these function names, and using the analyse option, shows that they are both being used by a function called smethod_16. This is almost certainly where the resource is loaded and decrypted. If we place breakpoints at all the locations the function can return from and run it, we can see the unencrypted configuration in array5.
We now have enough information to manually collect IOCs if needed. We might have been able to gather some info from running the sample in a Sandbox, but since we observed anti-sandbox behavior, it might have been necessary to dig down this far. Luckily for us, someone has already gone through all the trouble of creating a decryptor for this version of Nanocore.
The code is easier to follow than that of the obfuscated sample, but if you compare them side by side, it’s easy to see how the process works. The decryptor even goes as far as parsing out the settings for us.
Took your bloody time didn’t you.
That’s it, we did it. We analysed a heavily packed sample to extract multiple stages. We reversed these stages to identify additional features they provide to the final payload, such as process protection, anti-analysis, and persistence in this case (among others). And finally, we looked at how the framework works, played with some of the features, and figured out how to extract the configuration for these features from the payload. I’d say at this point we have a pretty good grasp of what this particular actor is throwing at us.
It’s always good to look at these techniques and identify how they stand up to your controls. In a semi-secured environment this should get stopped at multiple levels:
- The email gateway should detect/sandbox/block etc this sample based on the deliver details.
- Any halfway decent behavioral AV is going to detect the hollowing/injection elements.
- A decent EDR solution should make investigating a potential infection easy, while ensuring sensitive credentials don’t have to be used to connect to the host, which could then be stolen by the attacker.
- A decent corporate firewall will block non-standard communication protocols, such as the one used by Nanocore.
So even if we ignore the fact that the sample sandbox detection didn’t flag in my environment, there was no way in hell that this would have effected us. Next time they should stick to the red-team motto, be less shit and be more better.