Long time no see!

By | SpiderBot | No Comments

Hello all! I have not posted in a while because I recently just started a new job! 😀

Anyways, that is not why I am posting right now. I am posting because I have been working on my web spider SpiderBot. I have finally gotten a very stable multithreaded design going right now and it hasnt stopped once at all. The only time it stops now is because I am over loading my database server by running multiple SpiderBot scrapers on multiple machines. The IO wait on the disk is killing SpiderBot because it would timeout on the database connection.

However, after looking into what was causing the IO wait, I have discovered it wasnt SpiderBot that was causing the IO wait, it was the filesystem itself. Take a look:




A 10% IO wait? Jesus. And dat IO! 99.99%?! It stayed stuck at that for awhile.

So I looked up to see what could be causing this issue. So far I have found two things. Either the drive is starting to fail (which is a possibility), or there is a bug in the Journaling Block Device of ext4.

So I booted up CrystalDiskInfo to take a look at the SMART Status of the drive, and this is what I got:



While it says the drive health is good, Im really thinking that the drive is either getting old and starting to show warnings signs such as this, or the drive is really old, and cant keep up with the amount of work I am giving it.

So what am I going to do?

I am thinking that since SSD prices have come down quite a lot, that I would grab one 128gb for the host of my virtual machines (read: database server), and two 256gb for the virtual machines themselves. After that I would grab a 2tb hard drive for backing up the VMs and one more 2tb hhd for the database provided that the drive is very fast.

If I could, I would put the database on the SSDs, but since the dataset I am working with on SpiderBot routinely goes above 500gb of data, I cant really do that.

My next idea would be to check out other database solutions for clustering, or maybe trying my hand at a Hadoop install for NoSQL / MapReduce.

I know what your asking me, “Why are you going through so much trouble for something that Google already does?”. Well, One, I like to do things like this. Two, its fun. Three, SPIDERBOT GOT ME A JOB. Four, SpiderBot helps me find the broken links on my websites. Five, its better to have in house data because some a-hole could see what my database contains (email addresses, link information, personal information about people, etc) and take it and sell it to someone. Privacy much? Six, this makes me a better programmer and software engineer. Seven, I like to break systems and measure where that breaking point is. Eight, I just like numbers in general. I think I get that from my sister.

Hmm, think I got off track. Oh well, thats all I really wanted to log about SpiderBot today. Thanks for reading!



By | Apache Cassandra, SpiderBot | No Comments

So, I was working on my project SpiderBot when I finally got around to changing from MySQL to a NoSQL database. I decided to go with Apache Cassandra because of how easy it is to cluster it. Not only that but it is dead easy to setup.

Point is, I used Fluent Cassandra to query the database from C#. Fluent Cassandra is very well done, except for the problem I had with actually running CQL queries. For some reason, whenever I would run a CQL query, Cassandra would just return saying that my keyspace and column doesnt exist. It took me awhile to fix it, as it wasnt my fault that this bug was happening, but rather either Fluent Cassandra, or Cassandra itself that was the problem.

You see, I had set up my keyspace and column names with capital first letters, as such: SpiderBot.Links, SpiderBot.Archive, etc. Whenever I changed it to be all lower case (spiderbot.links, etc), my CQL queries started working correctly. So far, I do not know why or how it does this, but I figured I should post about it in case other people have trouble querying Cassandra with Fluent Cassandra.

Go away cold.

By | Blog | No Comments

Go away cold weather. I hate you. Im tired of this wind chill and everything about winter. Hurry up summer, you are needed!

Got another test tomorrow, hence me not really posting since Friday. I know every one of yall are missing your cats and music, but its coming! Just a busy time for me right now.

In other news, my mom has cleaned out the New Room and put almost all the things she moved out of it into my room. I havent seen that room that clean since we built that room. Its weird. I can see the floor.

I also went to the career fair yesterday, and didnt really see any companies looking for a software engineer except for one place called SoftNice Inc. They were looking for .Net programmers and Hadoop as well. Hmm, thats weird, I just started working on converting my C# SpiderBot project to MongoDB and Hadoop from MySQL. So I got a few pamphlets from them to research them more.

Learned awhile back also that one of my buddies now lives in a scary neighborhood. It was funny, because I pulled up to the house for the first time and he comes walking out with his 1911 in open carry. He later said that he makes it a point to open carry it there when he walks outside. I dont blame him, his neighbors run a drive in car shop. I kind you not, the entire time I was there random cars would drive into their driveway and the man that sat outside there with his family would jump up and start working on the car. He had guys driving up with loud bass music on and the ones that didnt were there to get speakers fixed so they could. Remind me to not park on that street.