Commit Graph

22500 Commits

Author SHA1 Message Date
Josef Bacik
d82a6f1d7e Btrfs: kill BTRFS_I(inode)->block_group
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull.  In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:12 -04:00
Josef Bacik
7e2355ba1a Btrfs: don't look at the extent buffer level 3 times in a row
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing.  I don't think I've
ever seen this tripped, so kill it.  This saves us 2 kmap's per level in our
search.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:11 -04:00
Josef Bacik
cb25c2ea6a Btrfs: map the node block when looking for readahead targets
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_.  So go ahead and map the
extent buffer while we look for readahead targets.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:10 -04:00
Josef Bacik
af60bed24e Btrfs: set range_start to the right start in count_range_bits
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results.  For example, if the range

[0-8192]

is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096].  So instead set range_start = max(cur_start, state->start).
This makes everything come out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:09 -04:00
Josef Bacik
fcb80c2aff Btrfs: fix how we do space reservation for truncate
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up.  This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen.  This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen.  So in order
to fix this we need to split all of the reservation stuf up.  So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:08 -04:00
Josef Bacik
a4abeea41a Btrfs: kill trans_mutex
We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction.  So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction.  All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list.  This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list.  This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff.  This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join.  Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work.  So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik
2a1eb4614d Btrfs: if we've already started a trans handle, use that one
We currently track trans handles in current->journal_info, but we don't actually
use it.  This patch fixes it.  This will cover the case where we have multiple
people starting transactions down the call chain.  This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return.  I tested this with xfstests
and it worked out fine.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik
7a7eaa40a3 Btrfs: take away the num_items argument from btrfs_join_transaction
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :).  So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Josef Bacik
74b2107543 Btrfs: make sure to use the delalloc reserve when filling delalloc
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
liubo
8e531cdfeb Btrfs: do not flush csum items of unchanged file data during treelog
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.

During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data.  Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.

Here I have some test results to show, I use sysbench to do "random write + fsync".

===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags=  [prepare, run]
===

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 10000
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 10000 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch:     (*SPEED* : 4.7533Mb/sec)
   1216.84 Requests/sec executed

PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 10:13:16 -04:00
Chris Mason
712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Chris Mason
aa2dfb372a Merge branch 'allocator' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:36:34 -04:00
Chris Mason
945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason
0d0ca30f18 Btrfs: update the delayed inode code to use the btrfs_ino helper.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:11:22 -04:00
Chris Mason
dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie
16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason
0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Linus Torvalds
3f80fbff5f Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  configfs: Fix race between configfs_readdir() and configfs_d_iput()
  configfs: Don't try to d_delete() negative dentries.
  ocfs2/dlm: Target node death during resource migration leads to thread spin
  ocfs2: Skip mount recovery for hard-ro mounts
  ocfs2/cluster: Heartbeat mismatch message improved
  ocfs2/cluster: Increase the live threshold for global heartbeat
  ocfs2/dlm: Use negotiated o2dlm protocol version
  ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
  ocfs2: Initialize data_ac (might be used uninitialized)
2011-05-18 16:50:28 -07:00
Linus Torvalds
a2b9c1f620 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: don't delay blk_run_queue_async
  scsi: remove performance regression due to async queue run
  blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
  block: rescan partitions on invalidated devices on -ENOMEDIA too
  cdrom: always check_disk_change() on open
  block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
2011-05-18 06:49:02 -07:00
Joel Becker
24307aa1e7 configfs: Fix race between configfs_readdir() and configfs_d_iput()
configfs_readdir() will use the existing inode numbers of inodes in the
dcache, but it makes them up for attribute files that aren't currently
instantiated.  There is a race where a closing attribute file can be
tearing down at the same time as configfs_readdir() is trying to get its
inode number.

We want to get the inode number of open attribute files, because they
should match while instantiated.  We can't lock down the transition
where dentry->d_inode is set to NULL, so we just check for NULL there.
We can, however, ensure that an inode we find isn't iput() in
configfs_d_iput() until after we've accessed it.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 04:08:16 -07:00
Joel Becker
df7f99670a configfs: Don't try to d_delete() negative dentries.
When configfs is faking mkdir() on its subsystem or default group
objects, it starts by adding a negative dentry.  It then tries to
instantiate the group.  If that should fail, it must clean up after
itself.

I was using d_delete() here, but configfs_attach_group() promises to
return an empty dentry on error.  d_delete() explodes with the entry
dentry.  Let's try d_drop() instead.  The unhashing is what we want for
our dentry.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 03:30:58 -07:00
Jeff Layton
11379b5e33 cifs: fix cifsConvertToUCS() for the mapchars case
As Metze pointed out, commit 84cdf74e broke mapchars option:

    Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
    (84cdf74e80) does multiple steps
    in just one commit (moving the function and changing it without
    testing).

    put_unaligned_le16(temp, &target[j]); is never called for any
    codepoint the goes via the 'default' switch statement. As a result
    we put just zero (or maybe uninitialized) bytes into the target
    buffer.

His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.

Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2)
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 20:54:04 +00:00
Jeff Layton
221d1d7972 cifs: add fallback in is_path_accessible for old servers
The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.

Cc: stable@kernel.org
Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 18:51:14 +00:00
Linus Torvalds
eed631e0d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix FS_IOC_SETFLAGS ioctl
  Btrfs: fix FS_IOC_GETFLAGS ioctl
  fs: remove FS_COW_FL
  Btrfs: fix easily get into ENOSPC in mixed case
  Prevent oopsing in posix_acl_valid()
2011-05-15 10:22:10 -07:00
Li Zefan
ebcb904dfe Btrfs: fix FS_IOC_SETFLAGS ioctl
Steps to reproduce the bug:

  - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
  - Call FS_IOC_SETFLAGS ioctl with flags=0
  - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:28 -04:00
Li Zefan
d0092bdda8 Btrfs: fix FS_IOC_GETFLAGS ioctl
As we've added per file compression/cow support.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:27 -04:00
Li Zefan
e1e8fb6a1f fs: remove FS_COW_FL
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.

The fact is we don't have corresponding BTRFS_INODE_COW flag.

COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.

If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
liubo
1aba86d67f Btrfs: fix easily get into ENOSPC in mixed case
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.

In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning.  The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
Daniel J Blueman
f5de939149 Prevent oopsing in posix_acl_valid()
If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:18 -04:00
Linus Torvalds
cf70cc5b9d Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4.1: Ensure that layoutget uses the correct gfp modes
  NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list
  NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
2011-05-13 15:19:39 -07:00
Linus Torvalds
26cf46be95 vfs: micro-optimize acl_permission_check()
It's a hot function, and we're better off not mixing types in the mask
calculations.  The compiler just ends up mixing 16-bit and 32-bit
operations, for no good reason.

So do everything in 'unsigned int' rather than mixing 'unsigned int'
masking with a 'umode_t' (16-bit) mode variable.

This, together with the parent commit (47a150edc2: "Cache user_ns in
struct cred") makes acl_permission_check() much nicer.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-13 11:51:01 -07:00
Sunil Mushran
df016c665b ocfs2/dlm: Target node death during resource migration leads to thread spin
During resource migration, if the target node were to die, the thread doing
the migration spins until the target node is not removed from the domain map.
This patch slows the spin by making the thread wait for the recovery to kick in.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:30 -07:00
Sunil Mushran
10b3dd7611 ocfs2: Skip mount recovery for hard-ro mounts
Patch skips mount recovery for hard-ro mounts which otherwise leads to an oops.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:14 -07:00
Sunil Mushran
33c12a5436 ocfs2/cluster: Heartbeat mismatch message improved
If o2hb finds unexpected values in the heartbeat slot, it prints a message
"ERROR: Device "dm-6": another node is heartbeating in our slot!"

This message could be misleading. This patch adds two more messages to
help users better diagnose the problem.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:02 -07:00
Sunil Mushran
76d9fc2954 ocfs2/cluster: Increase the live threshold for global heartbeat
We have seen isolated cases (very few, I might add) of o2hb not detecting all
live nodes on startup. One plausible reasoning for it is that other node had
a hb io delay at the same time. The live threshold set at 2 (as low as it can
be) could be increased to ameliorate the situation.

But increasing the threshold directly affects mount time. Currently it takes
around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing
the threshold will make mounts even slower. As the issue itself is rare, we have
left things as they are for the local heartbeat mode.

However we can improve the situation for global heartbeat mode as in that mode,
we start the heartbeat much before the mount.

This patch doubles the live threshold for the start of the first region in
global heartbeat mode.

Addresses internal Oracle bug#10635585.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:48 -07:00
Sunil Mushran
4da6dc2936 ocfs2/dlm: Use negotiated o2dlm protocol version
Patch fixes a bug in the o2dlm protocol negotiation in that it is using
the builtin version rather than the negotiated version during the domain
join. This causes join errors when a node having kernel >= 2.6.37 joins
a cluster with nodes having kernels < 2.6.37.

This only affects the o2cb cluster stack.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Jacek Stepniewski <Jacek.Stepniewski@agora.pl>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:37 -07:00
Tristan Ye
9a790ba1ec ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
In the case of removing a partial extent record which covers a hole, current
punching-hole logic will try to remove more than the length of whole extent
record, which leads to the failure of following assert(fs/ocfs2/alloc.c):

5507         BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range);

This patch tries to skip existing hole at the last attempt of removing a partial
extent record, what's more, it also adds some necessary comments for better
understanding of punching-hole codes.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:20 -07:00
Marcus Meissner
5d44670fac ocfs2: Initialize data_ac (might be used uninitialized)
CLANG found that there is a path that has data_ac uninitialized,
this place
	2917	/* This gets us the dx_root */
	2918	ret = ocfs2_reserve_new_metadata_blocks(osb, 1, &meta_ac);
	2919	if (ret) {

	3
		Taking true branch
	2920	mlog_errno(ret);
	2921	goto out;

	4
		Control jumps to line 3168
	2922	}

Goes to the out: label without data_ac being initialized.

Ciao, Marcus

Signed-Off-By: Marcus Meissner <meissner@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:15 -07:00
Arne Jansen
73c5de0051 btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-05-13 15:36:14 +02:00
Arne Jansen
a9c9bf6827 btrfs: heed alloc_start
currently alloc_start is disregarded if the requested
chunk size is bigger than (device size - alloc_start),
but smaller than the device size.
The only situation where I see this could have made sense
was when a chunk equal the size of the device has been
requested. This was possible as the allocator failed to
take alloc_start into account when calculating the request
chunk size. As this gets fixed by this patch, the workaround
is not necessary anymore.
2011-05-13 15:36:12 +02:00
Arne Jansen
bcd53741cc btrfs: move btrfs_cmp_device_free_bytes to super.c
this function won't be used here anymore, so move it super.c where it is
used for df-calculation
2011-05-13 15:36:05 +02:00
David Sterba
4ea028859b btrfs: use unsigned type for single bit bitfield
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-12 18:14:53 +02:00
David Sterba
7a36ddec10 btrfs: use printk_ratelimited instead of printk_ratelimit
As per printk_ratelimit comment, it should not be used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-12 18:08:38 +02:00
Linus Torvalds
6eaed0a438 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix oops in revalidate when called with NULL nameidata
2011-05-12 08:06:53 -07:00
Arne Jansen
8628764e1a btrfs: add readonly flag
setting the readonly flag prevents writes in case an error is detected

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:48:31 +02:00
Ilya Dryomov
96e369208e btrfs scrub: make fixups sync
btrfs scrub - make fixups sync, don't reuse fixup bios

Fixups are already sync for csum failures, this patch makes them sync
for EIO case as well.

Fixups are now sharing pages with the parent sbio - instead of
allocating a separate page to do a fixup we grab the page from the sbio
buffer.

Fixup bios are no longer reused.

struct fixup is no longer needed, instead pass [sbio pointer, index].

Originally this was added to look at the possibility of sharing the code
between drive swap and scrub, but it actually fixes a serious bug in
scrub code where errors that could be corrected were ignored and
reported as uncorrectable.

btrfs scrub - restore bios properly after media errors

The current code reallocates a bio after a media error.  This is a
temporary measure introduced in v3 after a serious problem related to
bio reuse was found in v2 of scrub patchset.

Basically we did not reset bv_offset and bv_len fields of the bio_vec
structure.  They are changed in case I/O error happens, for example, at
offset 512 or 1024 into the page.  Also bi_flags field wasn't properly
setup before reusing the bio.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:48:28 +02:00
Jan Schmidt
475f63874d btrfs: new ioctls for scrub
adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:38 +02:00
Arne Jansen
a2de733c78 btrfs: scrub
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:20 +02:00
Trond Myklebust
a75b9df9d3 NFSv4.1: Ensure that layoutget uses the correct gfp modes
Currently, writebacks may end up recursing back into the filesystem due to
GFP_KERNEL direct reclaims in the pnfs subsystem.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 22:52:13 -04:00
Linus Torvalds
3568bd9720 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not use i_wrbuffer_ref as refcount for Fb cap
  ceph: fix list_add in ceph_put_snap_realm
  ceph: print debug message before put mds session
2011-05-11 19:13:34 -07:00
Andy Adamson
2887fe4552 NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list
Prevents an infinite loop as list was never emptied.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 14:20:13 -04:00
Andy Adamson
a8a4ae3a89 NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
Free the slot and resend the RPC with new session <slot#,seq#>.

For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0
to restart the async rpc in the rpc_restart_call_prepare state which resets
the slot.

For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will
reset the slot via nfs41_call_sync_prepare.

For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write
cachethis is true, so these operations will not trigger an
NFS4ERR_RETRY_UNCACHED_REP.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 14:01:33 -04:00
Henry C Chang
d3d0720d4a ceph: do not use i_wrbuffer_ref as refcount for Fb cap
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:48 -07:00
Henry C Chang
a26a185d27 ceph: fix list_add in ceph_put_snap_realm
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:36 -07:00
Henry C Chang
7d8e18a69d ceph: print debug message before put mds session
The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:34 -07:00
Linus Torvalds
675badfc48 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix race condition in AIL push trigger
  xfs: make AIL target updates and compares 32bit safe.
  xfs: always push the AIL to the target
  xfs: exit AIL push work correctly when AIL is empty
  xfs: ensure reclaim cursor is reset correctly at end of AG
2011-05-10 11:56:35 -07:00
Miklos Szeredi
d24339059d fuse: fix oops in revalidate when called with NULL nameidata
Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL
nameidata.

https://bugzilla.kernel.org/show_bug.cgi?id=34732

Tyler Hicks pointed out that this bug was introduced by commit
e7c0a16786 "fuse: make fuse_dentry_revalidate() RCU aware"

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-05-10 17:35:58 +02:00
Ryusuke Konishi
349dbc3669 nilfs2: fix infinite loop in nilfs_palloc_freev function
After having applied commit 9954e7af14 ("nilfs2: add free
entries count only if clear bit operation succeeded"), a free routine
of nilfs came to fall into an infinite loop, outputting the same
message endlessly:

 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed ...

That patch broke the routine so that a loop counter is never updated
in an abnormal state.  This fixes the regression.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-05-10 22:19:50 +09:00
Dave Chinner
7ac956576d xfs: fix race condition in AIL push trigger
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One is caused by a
race condition in determining whether there is a psh in progress or
not.

The XFS_AIL_PUSHING_BIT is used to determine whether a push is
currently in progress.  When the AIL push work completes, it checked
whether the target changed and cleared the PUSHING bit to allow a
new push to be requeued. The race condition is as follows:

	Thread 1		push work

	smp_wmb()
				smp_rmb()
				check ailp->xa_target unchanged
	update ailp->xa_target
	test/set PUSHING bit
	does not queue
				clear PUSHING bit
				does not requeue

Now that the push target is updated, new attempts to push the AIL
will not trigger as the push target will be the same, and hence
despite trying to push the AIL we won't ever wake it again.

The fix is to ensure that the AIL push work clears the PUSHING bit
before it checks if the target is unchanged.

As a result, both push triggers operate on the same test/set bit
criteria, so even if we race in the push work and miss the target
update, the thread requesting the push will still set the PUSHING
bit and queue the push work to occur. For safety sake, the same
queue check is done if the push work detects the target change,
though only one of the two will will queue new work due to the use
of test_and_set_bit() checks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit e4d3c4a43b)
2011-05-09 18:35:04 -05:00
Dave Chinner
fe0da76731 xfs: make AIL target updates and compares 32bit safe.
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
noticed was that updates of the push target are not 32 bit safe as
the target is a 64 bit value.

We cannot copy a 64 bit LSN without the possibility of corrupting
the result when racing with another updating thread. We have
function to do this update safely without needing to care about
32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when
updating the AIL push target.

Also move the reading of the target in the push work inside the AIL
lock, and use XFS_LSN_CMP() for the unlocked comparison during work
termination to close read holes as well.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit fd5670f22f)
2011-05-09 18:35:04 -05:00
Dave Chinner
50e86686df xfs: always push the AIL to the target
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
discovered is a target mismatch between the item pushing loop and
the target itself.

The push trigger checks for the target increasing (i.e. new target >
current) while the push loop only pushes items that have a LSN <
current. As a result, we can get the situation where the push target
is X, the items at the tail of the AIL have LSN X and they don't get
pushed. The push work then completes thinking it is done, and cannot
be restarted until the push target increases to >= X + 1. If the
push target then never increases (because the tail is not moving),
then we never run the push work again and we stall.

Fix it by making sure log items with a LSN that matches the target
exactly are pushed during the loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit cb64026b6e)
2011-05-09 18:35:03 -05:00
Dave Chinner
9e7004e741 xfs: exit AIL push work correctly when AIL is empty
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. The main cause is a
regression where a work exit path fails to clear the PUSHING state
and recheck the target correctly.

Make both exit paths do the same PUSHING bit clearing and target
checking when the "no more work to be done" condition is hit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit ea35a20021)
2011-05-09 18:35:03 -05:00
Dave Chinner
228d62dd3f xfs: ensure reclaim cursor is reset correctly at end of AG
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.

The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.

Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.

While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-By: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit b223221956)
2011-05-09 18:35:03 -05:00
Mikulas Patocka
a09a79f668 Don't lock guardpage if the stack is growing up
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
  grows-up case, and to move things around a bit to clean it all up and
  share the infrstructure with the /proc bits.

  Tested on ia64 that has both grow-up and grow-down segments  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 16:22:07 -07:00
Linus Torvalds
7f4238a0ef Merge branch 'hpfs'
* hpfs:
  HPFS: Remove unused variable
  HPFS: Move declaration up, so that there are no out-of-scope pointers
  HPFS: Fix some unaligned accesses
  HPFS: Fix endianity. Make hpfs work on big-endian machines
  HPFS: Implement fsync for hpfs
  HPFS: Fix a bug that filesystem was not marked dirty when remounting it
  HPFS: Restrict uid and gid to 16-bit values
  HPFS: When marking or clearing the dirty bit, sync the filesystem
  HPFS: Use types with defined width
  HPFS: Remove mark_inode_dirty
  HPFS: Remove CR/LF conversion option
  HPFS: Remove remaining locks
  HPFS: Introduce a global mutex and lock it on every callback from VFS.
  HPFS: Make HPFS compile on preempt and SMP
2011-05-09 09:07:55 -07:00
Mikulas Patocka
88f4e9e870 HPFS: Remove unused variable
Remove unused variable

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
c351481744 HPFS: Move declaration up, so that there are no out-of-scope pointers
Move declaration up, so that there are no out-of-scope pointers

Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d0969d1949 HPFS: Fix some unaligned accesses
Fix some unaligned accesses

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
0b69760be6 HPFS: Fix endianity. Make hpfs work on big-endian machines
Fix endianity. Make hpfs work on big-endian machines.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
bc8728ee56 HPFS: Implement fsync for hpfs
Implement fsync for hpfs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
dab4c82a6e HPFS: Fix a bug that filesystem was not marked dirty when remounting it
Fix a bug that filesystem was not marked dirty when remounting it

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
48f10e8ce7 HPFS: Restrict uid and gid to 16-bit values
Restrict uid and gid to 16-bit values.

HPFS stores only 2 bytes in the EAs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
f73976818a HPFS: When marking or clearing the dirty bit, sync the filesystem
When marking or clearing the dirty bit, sync the filesystem

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d878597c2c HPFS: Use types with defined width
Use types with defined width

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
e5d6a7dd5e HPFS: Remove mark_inode_dirty
Remove mark_inode_dirty

HPFS doesn't use kernel's dirty inode indicator anyway because
writing an inode requires directory's mutex.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
0fe105aa29 HPFS: Remove CR/LF conversion option
Remove CR/LF conversion option

It is unused anyway. It was used on 2.2 kernels or so.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7d23ce36e3 HPFS: Remove remaining locks
Remove remaining locks

Because of a new global per-fs lock, no other locks are needed

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7dd29d8d86 HPFS: Introduce a global mutex and lock it on every callback from VFS.
Introduce a global mutex and lock it on every callback from VFS.

Performance doesn't matter, reviewing the whole code for locking correctness
would be too complicated, so simply lock it all.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
637b424bf8 HPFS: Make HPFS compile on preempt and SMP
Make HPFS compile on preempt and SMP

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Linus Torvalds
c2bf807eb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: handle errors from coalesce_t2
  cifs: refactor mid finding loop in cifs_demultiplex_thread
  cifs: sanitize length checking in coalesce_t2 (try #3)
  cifs: check for bytes_remaining going to zero in CIFS_SessSetup
  cifs: change bleft in decode_unicode_ssetup back to signed type
2011-05-06 15:32:41 -07:00
Timo Warns
fa039d5f6b Validate size of EFI GUID partition entries.
Otherwise corrupted EFI partition tables can cause total confusion.

Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-06 07:46:37 -07:00
David Sterba
182608c829 btrfs: remove old unused commented out code
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:10 +02:00
David Sterba
f2a97a9dbd btrfs: remove all unused functions
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

  text    data     bss     dec     hex filename
402081    7464     200  409745   64091 btrfs.ko.base
398620    7144     200  405964   631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:03 +02:00
Linus Torvalds
bd355f8ae6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not call __mark_dirty_inode under i_lock
  libceph: fix ceph_osdc_alloc_request error checks
  ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
  libceph: fix ceph_msg_new error path
  ceph: use ihold() when i_lock is held
2011-05-04 14:22:20 -07:00
Sage Weil
fca65b4ad7 ceph: do not call __mark_dirty_inode under i_lock
The __mark_dirty_inode helper now takes i_lock as of 250df6ed.  Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04 12:56:45 -07:00
David Sterba
621496f4fd btrfs: remove unused function prototypes
function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-04 14:01:26 +02:00
Linus Torvalds
cce2c56e76 logfs: initialize superblock entries earlier
In particular, s_freeing_list needs to be initialized early, since it is
used on some of the error paths when mounts fail.  The mapping inode,
for example, would be initialized and then free'd on an error path
before s_freeing_list was initialized, but the inode drop operation
needs the s_freeing_list to be set up.

Normally you'd never see this, because not only is logfs fairly rare,
but a successful mount will never have any issues.

Reported-by: werner <w.landgraf@ru.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-03 16:10:25 -07:00
Henry C Chang
8c71897be2 ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
We should unlock the page and return -ENOMEM if ceph_osdc_new_request
failed.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:12 -07:00
Sage Weil
3772d26d87 ceph: use ihold() when i_lock is held
See 0444d76ae6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:08 -07:00
Jeff Layton
16541ba11c cifs: handle errors from coalesce_t2
cifs_demultiplex_thread calls coalesce_t2 to try and merge follow-on t2
responses into the original mid buffer. coalesce_t2 however can return
errors, but the caller doesn't handle that situation properly. Fix the
thread to treat such a case as it would a malformed packet. Mark the
mid as being malformed and issue the callback.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:15 +00:00
Jeff Layton
146f9f65bd cifs: refactor mid finding loop in cifs_demultiplex_thread
...to reduce the extreme indentation. This should introduce no
behavioral changes.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:07 +00:00
Linus Torvalds
adadfe48df Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
  UBIFS: seek journal heads to the latest bud in replay
  UBIFS: do not free write-buffers when in R/O mode
2011-05-02 12:17:29 -07:00
Artem Bityutskiy
52c6e6f990 UBIFS: seek journal heads to the latest bud in replay
This is the second fix of the following symptom:

UBIFS error (pid 34456): could not find an empty LEB

which sometimes happens after power cuts when we mount the file-system - UBIFS
refuses it with the above error message which comes from the
'ubifs_rcvry_gc_commit()' function. I can reproduce this using the integck test
with the UBIFS power cut emulation enabled.

Analysis of the problem.

Currently UBIFS replay seeks the journal heads to the last _replayed_ bud.
But the buds are replayed out-of-order, so the replay basically seeks journal
heads to the "random" bud belonging to this head, and not to the _last_ one.

The result of this is that the GC head may be seeked to a full LEB with no free
space, or very little free space. And 'ubifs_rcvry_gc_commit()' tries to find a
fully or mostly dirty LEB to match the current GC head (because we need to
garbage-collect that dirty LEB at one go, because we do not have @c->gc_lnum).
So 'ubifs_find_dirty_leb()' fails and we fall back to finding an empty LEB and
also fail. As a result - recovery fails and mounting fails.

This patch teaches the replay to initialize the GC heads exactly to the latest
buds, i.e. the buds which have the largest sequence number in corresponding
log reference nodes.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:48 +03:00
Artem Bityutskiy
b50b9f4085 UBIFS: do not free write-buffers when in R/O mode
Currently UBIFS has a small optimization - it frees write-buffers when it is
re-mounted from R/W mode to R/O mode. Of course, when it is mounted R/O, it
does not allocate write-buffers as well.

This optimization is nice but it leads to subtle problems and complications
in recovery, which I can reproduce using the integck test. The symptoms are
that after a power cut the file-system cannot be mounted if we first mount
it R/O, and then re-mount R/W - 'ubifs_rcvry_gc_commit()' prints:

UBIFS error (pid 34456): could not find an empty LEB

Analysis of the  problem.

When mounting R/W, the reply process sets journal heads to buds [1], but
when mounting R/O - it does not do this, because the write-buffers are not
allocated. So 'ubifs_rcvry_gc_commit()' works completely differently for the
same file-system but for the following 2 cases:

1. mounting R/W after a power cut and recover
2. mounting R/O after a power cut, re-mounting R/W and run deferred recovery

In the former case, we have journal heads seeked to the a bud, in the latter
case, they are non-seeked (wbuf->lnum == -1). So in the latter case we do not
try to recover the GC LEB by garbage-collecting to the GC head, but we just
try to find an empty LEB, and there may be no empty LEBs, so we just fail.
On the other hand, in the former case (mount R/W), we are able to make a GC LEB
(@c->gc_lnum) by garbage-collecting.

Thus, let's remove this small nice optimization and always allocate
write-buffers. This should not make too big difference - we have only 3
of them, each of max. write unit size, which is usually 2KiB. So this is
about 6KiB of RAM for the typical case, and only when mounted R/O.

[1]: Note, currently the replay process is setting (seeking) the journal heads
to _some_ buds, not necessarily to the buds which had been the journal heads
before the power cut happened. This will be fixed separately.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:36 +03:00
David Sterba
8cc33e5c19 btrfs: Document a mutex lock/unlock sequence 2011-05-02 15:29:25 +02:00
David Sterba
b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
ba14419264 btrfs: drop gfp parameter from alloc_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
f09d1f60e6 btrfs: drop gfp parameter from find_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
172ddd60a6 btrfs: drop gfp parameter from alloc_extent_map
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
a8067e022a btrfs: drop unused parameter from extent_map_tree_init
the GFP flags are not stored anywhere and all allocations are done via
alloc_extent_map(GFP_NOFS).

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00