Pages: [1]
hampsteadpete
BAM!ID: 102775
Joined: 2011-07-20
Posts: 2
Credits: 55,416,649
World-rank: 17,461

2011-08-22 20:37:10

My "stats" page: http://boincstats.com/stats/boinc_user_graph.php?pr=bo&id=152f5138269efc0f7e13e948d977135c

Did not update today for some reason.

Pete Soderman
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-22 21:03:38

It is updated actually, on the database server that is. But for some reason (that I'm trying to find out now) copying the update to the webservers is now already taking over five hours instead of the 30 minutes it usually takes.
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
hampsteadpete
BAM!ID: 102775
Joined: 2011-07-20
Posts: 2
Credits: 55,416,649
World-rank: 17,461

2011-08-22 21:21:23

Thank you sir! It was the first day I started using GPU & the numbers were really wild compared to normal. I thought that might have had something to do with it. Thanks for your reply.

Pete
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-22 21:45:14

@All: the update is coming, just very slow.

The server causing the slowdown is the same server which was dead-slow before. I though I solved that by replacing some hard drives and for a couple of weeks it was all working fine until today.

At the moment there isn't much more I can do than watch it go slow, so I will get some sleep. If things haven't improved by tomorrow it's back to one web-server again .
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
magic8192
BAM!ID: 101241
Joined: 2011-06-05
Posts: 2
Credits: 1,453,325,083
World-rank: 1,814

2011-08-23 18:01:56

Thanks for the info.
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-23 18:51:09
last modified: 2011-08-23 18:53:33

I wasn't able to solve the problem. And I have no more ideas left. Tomorrow I will stop using the slow server and retire it. It's the second time it's been acting up and I did a complete reinstall and replaced slow drives, so something must be wrong with the hardware (probably the RAID controller). Unfortunately, this means that the website will be hosted from one server again which will not improve speed.

Just in case some Linux guru is out there, I believe this is the problem (larger numbers):

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 2.81 14220.57 107.11 205.90 24898.44 115412.99 448.26 12.30 39.28 1.50 46.85
sdb 2.95 0.43 114.91 14.32 7713.79 2192.88 76.66 1.04 8.08 0.45 5.81
sdc 5.38 2.93 1.43 0.07 61.54 23.97 57.15 0.01 6.43 4.06 0.61
scd0 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 28.84 28.84 0.00

This started yesterday. From the logs I can clearly see disk I/O is the problem. A simple chown command takes over a second to complete. Now if there was a drive failure in the RAID I could understand it. BTW: the chown commands are slow on sdb as well
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
Christos Despotakis
BAM!ID: 69381
Joined: 2009-05-18
Posts: 17
Credits: 767,610,023
World-rank: 2,886

2011-08-23 19:40:03

Well iostat is a useful tool. I suppose you are fully aware what each value means but for all the rest here there are:

rrqm/s : The number of read requests merged per second that were queued to the hard disk
wrqm/s : The number of write requests merged per second that were queued to the hard disk
r/s : The number of read requests per second
w/s : The number of write requests per second
rsec/s : The number of sectors read from the hard disk per second
wsec/s : The number of sectors written to the hard disk per second
avgrq-sz : The average size (in sectors) of the requests that were issued to the device.
avgqu-sz : The average queue length of the requests that were issued to the device
await : The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm : The average service time (in milliseconds) for I/O requests that were issued to the device
%util : Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

The last 3 are the more important. So you are right about I/O problems. But without the actual hardware setup I can not be sure what is the actual problem. What kind of R.A.I.D. do you use? How many disk? etc. Try running a S.M.A.R.T test to your disks. (Check here & here for instructions). I hope it is just a failing H.D.D. than the R.A.I.D. controller.

[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-23 20:38:22
last modified: 2011-08-23 20:42:12

It's run off a four channel LSI hardware RAID controller (if you need the exact type I need to look it up) with a Intel 12x SAS expander.
sda is a RAID10, 4x SEAGATE ST373455SS (SAS, 15k RPM), operating system, website files and swap
sdb is a RAID10, 4x INTEL SSDSA2CW080G3 (SATA, SSD), database
sdc is a single drive for backups, SEAGATE ST3750525AS (SATA 7k2)

The ST373455SS are reused from the "old" database server and were working fine there. The SSDSA2CW080G3 are also used in the other webserver and again working fine there.

The slowness is actually occurring on sdb. Which makes me think that every filesystem action goed throug / (sda) regardless of the device the action actually is targeted at.
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-23 20:52:19

Can't test the SMART status, probably because they are RAID drives, but according to the megacli tool, all drives are fine.
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
Christos Despotakis
BAM!ID: 69381
Joined: 2009-05-18
Posts: 17
Credits: 767,610,023
World-rank: 2,886

2011-08-23 22:47:22

You say that the swap is at sda, and the I/O must occur at sdb (the database). The only reasonable way for the db to cause such queue depth at sda is through the swap. And to use the swap the is "not enough physical memory" or a db setting to cache the I/O. Please check your swap usage and see if we get something there.
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-24 05:24:45

I was wrong, swap is on sdc.

willy@www3:~# free
total used free shared buffers cached
Mem: 24682360 17409832 7272528 0 209156 4816616
-/+ buffers/cache: 12384060 12298300
Swap: 47850488 3242240 44608248
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
Christos Despotakis
BAM!ID: 69381
Joined: 2009-05-18
Posts: 17
Credits: 767,610,023
World-rank: 2,886

2011-08-24 10:34:46

You should check what causes the sda I/O. Using iotop and lsof will give you a better understanding of what is going on.

Use
iotop -o
to pinpoint which process / thread is the culprit.

Then use
lsof -p <PID>
to see which files are opened by the process.

Knowing what is written, where and by whom hopefully will help us narrow down the problem.

Keep up the good work.
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9461
Credits: 353,172,950
World-rank: 5,063

2011-08-24 10:51:43
last modified: 2011-08-24 10:59:07

I used iotop -o and it shows an ever changing number of mysql processes (up to 20), Apache drops by every now and then, and the ones that strike me are kjournald and flush-8:0. The run very often and are at 99.99% IO. What I read about them is that they are needed by the OS.

Edit: This all looks exactly the same as the other webserver which has a much slower sda (RAID10 on 7k2 rpm 2.5" notebook disks) and which doesn't suffer from this problem. It does have another RAID controller and doesn't use an expander.
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
Christos Despotakis
BAM!ID: 69381
Joined: 2009-05-18
Posts: 17
Credits: 767,610,023
World-rank: 2,886

2011-08-24 12:22:26

BOINCstats Willy wrote:
kjournald and flush-8:0. The run very often and are at 99.99% IO. What I read about them is that they are needed by the OS.


kjournald is the journaling deamon of ext3. Journal "keeps track" of all the disk operations and during high I/O activity it's normal to see this. And yes, it's basically more I/O job for your disk.
flush is the writeback from cache to disk, so more I/O there.

You can tweak them them a bit but I doubt this is your problem.

For kjournald you can either adjust commit=num_secs mount option to each ext3 filesystem or chance journal mode to "writeback". num_secs by default is 5 but, setting it to something like 300 or 600 (5 and 10 minutes respectively) should be just fine. I can give you a walkthrough for editing /etc/fstab if you need. You can also increase the journal size with tune2fs.
To tune flush to 60 seconds (default is 5), add vm.dirty_writeback_centisecs=6000 to /etc/sysctl.conf

Using lsof -p <PID> check that all db I/O are made at sdb.
txyankee
BAM!ID: 65094
Joined: 2009-02-03
Posts: 56
Credits: 19,602,865
World-rank: 23,823

2011-08-26 10:13:03

my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7
BAM!ID: 64136
Joined: 1970-01-01
Posts: 0
Credits: 0
World-rank: 0

2011-08-26 16:04:44

txyankee wrote:
my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:
Yes. We turned off the periodic scripts that we use to update various things in addition to some of the validators. However, the scripts are running again and the BOINC XML stats have been updated with in the past couple of hours.
TPCBF
BAM!ID: 94441
Joined: 2010-12-21
Posts: 190
Credits: 0
World-rank: 0

2011-08-26 18:39:09

Crystal Pellet wrote:
txyankee wrote:
my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:
Yes. We turned off the periodic scripts that we use to update various things in addition to some of the validators. However, the scripts are running again and the BOINC XML stats have been updated with in the past couple of hours.
That's all fine and dandy (what you quoted is the response from knreed to my question in the WGC forum), but BOINCStats hasn't run an incremental update for more than 6 hours now
Last daily update:
2011-08-25 17:33:24 GMT
1 day 01:02:29 ago
Last incremental update:
2011-08-26 11:56:36 GMT
06:39:17 ago
So even if the WGC issue is resolved, it looks like Willy has still more issues with this web server at least...

Ralf
BAM!ID: 64136
Joined: 1970-01-01
Posts: 0
Credits: 0
World-rank: 0

2011-08-26 19:38:00

TPCBF wrote:
Crystal Pellet wrote:
txyankee wrote:
my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:
Yes. We turned off the periodic scripts that we use to update various things in addition to some of the validators. However, the scripts are running again and the BOINC XML stats have been updated with in the past couple of hours.
That's all fine and dandy (what you quoted is the response from knreed to my question in the WGC forum), but BOINCStats hasn't run an incremental update for more than 6 hours now
Last daily update:
2011-08-25 17:33:24 GMT
1 day 01:02:29 ago
Last incremental update:
2011-08-26 11:56:36 GMT
06:39:17 ago
So even if the WGC issue is resolved, it looks like Willy has still more issues with this web server at least...

Ralf

1. If you can read, Ralf. That it is a quote of Kevin is stated directly above the quote: "knreed - World Community Grid Tech wrote"

2. The big blue WCG with all their money and full time employees dedicated to BOINC itself isn't even able to run their own statistics more often than twice a day for users and for teams only once a day, so why complaining here with 1 volunteer webmaster living from donations?
TPCBF
BAM!ID: 94441
Joined: 2010-12-21
Posts: 190
Credits: 0
World-rank: 0

2011-08-26 20:29:18

Crystal Pellet wrote:
2. The big blue WCG with all their money and full time employees dedicated to BOINC itself isn't even able to run their own statistics more often than twice a day for users and for teams only once a day,
And that is a problem because?
so why complaining here with 1 volunteer webmaster living from donations?
Well, maybe you should read, I wasn't complaining at all, just stating a fact that the problems with the update are not only with the WGC, which in turn had a problem with their server the other day them self...

Ralf
txyankee
BAM!ID: 65094
Joined: 2009-02-03
Posts: 56
Credits: 19,602,865
World-rank: 23,823

2011-08-27 10:27:41

seems to be up and running again. Thankyou for your help and time.
Pages: [1]

Index :: BOINCstats Bug Report :: Boincstats not updated today
Reason: