Category Archives: SpiderBot

jbd2

Long time no see!

By | SpiderBot | No Comments

Hello all! I have not posted in a while because I recently just started a new job! šŸ˜€

Anyways, that is not why I am posting right now. I am posting because I have been working on my web spider SpiderBot. I have finally gotten a very stable multithreaded design going right now and it hasnt stopped once at all. The only time it stops now is because I am over loading my database server by running multiple SpiderBot scrapers on multiple machines. The IO wait on the disk is killing SpiderBot because it would timeout on the database connection.

However, after looking into what was causing the IO wait, I have discovered it wasnt SpiderBot that was causing the IO wait, it was the filesystem itself. Take a look:

iowait

jbd2

 

A 10% IO wait? Jesus. And dat IO! 99.99%?! It stayed stuck at that for awhile.

So I looked up to see what could be causing this issue. So far I have found two things. Either the drive is starting to fail (which is a possibility), or there is a bug in the Journaling Block Device of ext4.

So I booted up CrystalDiskInfo to take a look at the SMART Status of the drive, and this is what I got:

REDACTED

 

While it says the drive health is good, Im really thinking that the drive is either getting old and starting to show warnings signs such as this, or the drive is really old, and cant keep up with the amount of work I am giving it.

So what am I going to do?

I am thinking that since SSD prices have come down quite a lot, that I would grab one 128gb for the host of my virtual machines (read: database server), and two 256gb for the virtual machines themselves. After that I would grab a 2tb hard drive for backing up the VMs and one more 2tb hhd for the database provided that the drive is very fast.

If I could, I would put the database on the SSDs, but since the dataset I am working with on SpiderBot routinely goes above 500gb of data, I cant really do that.

My next idea would be to check out other database solutions for clustering, or maybe trying my hand at a Hadoop install for NoSQL / MapReduce.

I know what your asking me, “Why are you going through so much trouble for something that Google already does?”. Well, One, I like to do things like this. Two, its fun. Three, SPIDERBOT GOT ME A JOB. Four, SpiderBot helps me find the broken links on my websites. Five, its better to have in house data because some a-hole could see what my database contains (email addresses, link information, personal information about people, etc) and take it and sell it to someone. Privacy much? Six, this makes me a better programmer and software engineer. Seven, I like to break systems and measure where that breaking point is. Eight, I just like numbers in general. I think I get that from my sister.

Hmm, think I got off track. Oh well, thats all I really wanted to log about SpiderBot today. Thanks for reading!

 

GRRRR

By | Apache Cassandra, SpiderBot | No Comments

So, I was working on my project SpiderBot when I finally got around to changing from MySQL to a NoSQL database. I decided to go with Apache Cassandra because of how easy it is to cluster it. Not only that but it is dead easy to setup.

Point is, I used Fluent Cassandra to query the database from C#. Fluent Cassandra is very well done, except for the problem I had with actually running CQL queries. For some reason, whenever I would run a CQL query, Cassandra would just return saying that my keyspace and column doesnt exist. It took me awhile to fix it, as it wasnt my fault that this bug was happening, but rather either Fluent Cassandra, or Cassandra itself that was the problem.

You see, I had set up my keyspace and column names with capital first letters, as such: SpiderBot.Links, SpiderBot.Archive, etc. Whenever I changed it to be all lower case (spiderbot.links, etc), my CQL queries started working correctly. So far, I do not know why or how it does this, but I figured I should post about it in case other people have trouble querying Cassandra with Fluent Cassandra.

SpiderBot Stats Test

SpiderBot!

By | Blog, SpiderBot | No Comments

I have not been posting recently mostly because its starting to be test 1 at McNeese soon, and I have been studying. During my free time however, I have been working on my senior project which I am calling SpiderBot. SpiderBot is a web spider that is written in C# using MySQL and PHP for the search engine. So far I have gotten it to be functional and not crash, as well as the ability to set it up onĀ multipleĀ machines to spread the work out on my (pretend) cluster. I decided I would post an image of the stats for it and see if anyone is interested in downloading it for themselves and tell me how awesome / sucky it is.

In other news, I went to Church Point today for Mardi Gras. Which Was AWESOME. Drunk people everywhere chasing chickens. BBQ Food andĀ alcohol as far as your stomach could eat. And the BEADS. Jesus the beads. As far as I knew, they were rocket propelled pellets of death and currency here. I saw someone trade beads for jello shots. Heck, they were giving us beer and jello shots from the floats! šŸ˜€

Awesome times were had.