Hello all! I have not posted in a while because I recently just started a new job! 😀
Anyways, that is not why I am posting right now. I am posting because I have been working on my web spider SpiderBot. I have finally gotten a very stable multithreaded design going right now and it hasnt stopped once at all. The only time it stops now is because I am over loading my database server by running multiple SpiderBot scrapers on multiple machines. The IO wait on the disk is killing SpiderBot because it would timeout on the database connection.
However, after looking into what was causing the IO wait, I have discovered it wasnt SpiderBot that was causing the IO wait, it was the filesystem itself. Take a look:
A 10% IO wait? Jesus. And dat IO! 99.99%?! It stayed stuck at that for awhile.
So I looked up to see what could be causing this issue. So far I have found two things. Either the drive is starting to fail (which is a possibility), or there is a bug in the Journaling Block Device of ext4.
So I booted up CrystalDiskInfo to take a look at the SMART Status of the drive, and this is what I got:
While it says the drive health is good, Im really thinking that the drive is either getting old and starting to show warnings signs such as this, or the drive is really old, and cant keep up with the amount of work I am giving it.
So what am I going to do?
I am thinking that since SSD prices have come down quite a lot, that I would grab one 128gb for the host of my virtual machines (read: database server), and two 256gb for the virtual machines themselves. After that I would grab a 2tb hard drive for backing up the VMs and one more 2tb hhd for the database provided that the drive is very fast.
If I could, I would put the database on the SSDs, but since the dataset I am working with on SpiderBot routinely goes above 500gb of data, I cant really do that.
My next idea would be to check out other database solutions for clustering, or maybe trying my hand at a Hadoop install for NoSQL / MapReduce.
I know what your asking me, “Why are you going through so much trouble for something that Google already does?”. Well, One, I like to do things like this. Two, its fun. Three, SPIDERBOT GOT ME A JOB. Four, SpiderBot helps me find the broken links on my websites. Five, its better to have in house data because some a-hole could see what my database contains (email addresses, link information, personal information about people, etc) and take it and sell it to someone. Privacy much? Six, this makes me a better programmer and software engineer. Seven, I like to break systems and measure where that breaking point is. Eight, I just like numbers in general. I think I get that from my sister.
Hmm, think I got off track. Oh well, thats all I really wanted to log about SpiderBot today. Thanks for reading!