This is a corner case wherein a crash during OTA
merge can lead to missing of some COW operations to be
merged thereby some blocks may end up with stale data.
Fix here is to avoid any re-ordering of COW operations.
Merge the COW operations as present in the COW file.
New tests have been added to cow_snapuserd.
Bug: 194955361
Test: cow_snapuserd_test, Incremental OTA
Signed-off-by: Akilesh Kailash <akailash@google.com>
Merged-In: Id895fe7a3d6b4510676490a86d0caf62dec9b079
Change-Id: I14900b9537c4deb7824547e1dfe80f15274bdda4
Ignore-AOSP-First: manual merge from aosp
If for some reason the COW state is not fully synced to disk, but
dm-snapshot has flushed its pending merges, we do not want to delete
snapshots. Doing so could potentially leave blocks unmerged.
This situation is quite unexpected so we label it as a merge failure.
The device can recover by completely syncing the COW state, and then
rebooting, which will attempt to make forward progress on the merge.
Bug: 190582627
Test: vts_libsnapshot_test
full OTA on bramble
incremental OTA on bramble
Change-Id: Ib887f1d9e4397a712ed2f800cc1222cf9305a039
Merged-In: Ib887f1d9e4397a712ed2f800cc1222cf9305a039
adb_debug.prop is migrated too. And ramdisk_available is added to all
dependencies.
Bug: 187196593
Test: boot
Change-Id: I59cd149e0021211b8fd59c44b93bbf18dc8637bf
Merged-In: I59cd149e0021211b8fd59c44b93bbf18dc8637bf
Header response from daemon is sent once at the
beginning of the payload. This is a corner case
when the IO request can get broken into multiple chunks
if the IO is un-aligned.
Additionally, add error checks in the IO path
to validate sector information when the read
requests are merged.
Bug: 188361387
Test: Simulate an IO request to forcefully break the IO
response into multiple chunks and verify the correctness.
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I1f4fa7a79c60493f4bbed3ad49e257098b930beb
Merged-In: I1f4fa7a79c60493f4bbed3ad49e257098b930beb
This refactoring moves ImageManager creation out of SnapshotManager,
where it had always been rather awkward, and into IDeviceInfo so it can
participate in dependency injection.
The "first-stage init" state is now part of IDeviceInfo as well.
Bug: 187396878
Test: vts_libsnapshot_test
libsnapshot_fuzzer_test
Change-Id: Ic4b3e93527cc074ec18c0c26440dbdb2504d0d5e
Assertions in daemon will terminate the daemon.
This can bring the entire device to halt as the mount partitions
are mounted of snapshot devices and IO's will be hung in
"D" (uninterruptible state) eventually leading to sysrq crash.
Convert the relevant assertions into appropriate error codes and
propagate the error code back to dm-user.
IO will eventually fail but should not impact the overall usage of the device.
Bug: 187903835
Test: vts_libsnapshot, cow_snapuserd_test, full OTA
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Iaf093898837d2aff5703ea1e615aecf7c1e53a8f
As part of r.android.com/c/1678745/7, overlapping
copy operations was allowed to batch merge which
is not right. The intention of that CL was
to avoid un-necessary write traffic involved
in flushing data to scratch space. However,
as part of the optimization, copy operations
were merged. More specifically, the problem
comes into play when the number of overlapping
copy operations is more than the read-ahead
window size (2MB).
I have added a new test case to test this
specific code path to avoid future regressions.
Additionally, remove un-necessary "send()"
as part of "detach" response from snapuserd server.
Client is not waiting for any response. It
just creates a race window which is harmless
but error log will be misleading.
Bug: 187506548
Test: cow_snapuserd which tests the similar
configuration as seen in the COW file in the bug
report.
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Ic0f1ddd390f79966aabfbeadb7d64bc5bb86e83b
These are unit tests for important VAB functionality. They are very
quick to run and should be included in presubmit testing.
Bug: N/A
Test: th green
Change-Id: I02e24c76df365e9fb2d68f904e930dce60b9bdaf
We try to clean up previous test runs, but this can crash since we
haven't opened fake-super yet. Refactor the harness so we always open
fake-super if it exists. If it does, we'll delete and recreate it after
cleanup. If it doesn't, we'll create it immediately.
It's still possible that cleanup can fail: If interrupted during a merge,
libsnapshot does not allow cleanup until the merge completes. The test
harness doesn't bother handling this case yet.
Bug: 187151854
Test: vts_libsnapshot_test, ctrl+c, run again
Change-Id: I58a7094336a391cff493a31e4f80d8c8b1b166f8
This enables read-ahead functionality by having
scratch space in the COW
Bug: 183863613
Test: OTA tests with new COW format
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I7988687c81d0ea239e71695818199db4653ddb80
kCowVersionManifest will be 2. This should now
be in sync with kCowVersionMajor.
Bug: 183863613
Test: OTA with new COW format (by enabling scratch space option)
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Ia6c31e399de723ee83459b59d6b076b48f5c88d5
* changes:
libsnapshot:snapuserd:Add unit test for read-ahead code path.
libsnapshot: Flush data to scratch space only for overlapping regions
libsnapshot:snapuserd: read-ahead COW copy ops
libsnapshot: Retrieve COW version from update engine manifest
libsnapshot:snapuserd: Add 2MB scratch space in COW file
libsnapshot:snapuserd: mmap + msync header after merge
When read-ahead thread caches the data from base device, flush the data
only if there are overlapping regions. If there is crash, subsequent
reboot will not recover the data from scratch space. Rather, data
will be re-constructed from base device.
Additionally, allow batch merge of blocks by the kernel even for
overlapping region given that we have the read-ahead thread
taking care of the overlapping blocks.
Bug: 183863613
Test: 1: Incremental OTA from build 7284758 to 7288239. Merge time
reduces from ~6 minutes to ~2.5 minutes
2: Reboot and crash kernel multiple times when merge was in
progress
3: Verify read-ahead thread re-constructs the data for overlapping
region.
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I50e0d828f4fb36a23f0ca13b07a73229ba68874d
Introduce read-ahead mechanism for COW copy ops.
1: Read-ahead thread will read from base device
and store the data in scratch space along with the metadata.
2: Worker threads during merge will retrieve the data
from read-ahead cache
3: Fixed set of blocks are read during each cycle by the read-ahead
thread.
4: When the last block in the region is merged, read-ahead thread
makes forward progress.
Scratch space is set to 2MB and is only used from COW copy operations.
We can extend this to Replace Ops based on performance evaluation.
Performance:
As mentioned in bug 181883791, Incremental OTA of size 55M with
235K copy operations where every block is moved by 4k:
Without read-ahead: 40 Minutes for merge completion
With read-ahead: 21 Minutes for merge completion
Bug: 183863613
Test: 1: Full OTA - no regression observed.
2: Incremental OTA - with older COW format. Daemon will just skip
the read-ahead feature for older COW format.
3: Incremental OTA - with new COW format.
4: Reboot and crash kernel when multiple times when incremental OTA is in-flight.
Verify post reboot, read-ahead thread re-constructs the data from scratch
space.
5: No regression observed in RSS-Anon memory usage when merge in-flight.
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Ic565bfbee3e9fcfc94af694596dbf44c0877639f
update_metadata.proto will have the COW version. Retrieve
that from the manifest and compare it with the COW library.
If the versioning doesn't match, disable VABC.
The primary use case of this is during downgrade tests
in pre-submit. Whenever we have a COW format changes,
we may have to disable VABC for that specific transition
build. At a high level, the flow of version check will be:
1: Create a initial COW version of 1 in manifest (update_metadata.proto)
2: The latest COW version of libsnapshot is 2
3: libsnapshot will return VABC disabled
4: Check-in the CL and changes to manifest
5: Once the CL is baked in and the build is green, bump up the COW version to 2 in the manifest
6: Next set of tests, since both versions match, libsnapshot will enable VABC
7: Downgrade should be done to the build which was checked in at (5)
Bug: 183863613
Test: Apply OTA and verify if VABC is disabled if the versions don't
match
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Id55f33a90bb31b417e72f4fbe370daf05a68f05a
Add 2MB scratch space in the COW file. This is a preparation
patch for read-ahead patch. This just add the buffer
space right after the header. Bump up the version number
in the header in order to distiguish between older and newer
COW formats.
No operation is done on this buffer with this patch-set.
Scratch space option is disabled by default.
Bug: 183863613
Test: 1: Create Full OTA with the new COW format.
2: Incremental OTA with older COW format.
3: vts_libsnapshot_test
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I42a535a48ec22adb893dfe6f86a4f51650e1f88a
mmap the CowHeader and use msync to flush only the
first 4k page after merge is complete.
This cuts down ~30 seconds of merge completion time
on a 55M incremental OTA with 235k copy operations.
Although, this isn't a significant gain but this patch
creates a scaffolding for the next set of read-ahead patches.
Bug: 183863613
Test: Incremental and Full OTA
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I15bfec91ea1d5bdf4390670bcf406e1015b79299
Adds the -b option to show the bad data block that failed to decompress.
If the block is large enough, display the front as though it were a
CowOperation, as this is the most likely culprit.
Change-Id: I287f13e0794a1ca9d647d4b1099ab238a6202b23
Bug: 183985866
Test: inspect_cow -db <COW_FILE>
If one end of the communication socket is closed for some reason, there
is no need to terminate the daemon or the client. Mask
the SIGPIPE using MSG_NOSIGNAL flag - we will
still get EPIPE error but process will not be terminated.
Bug: 186213024
Test: Full OTA
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Iaa53545c0c4059618f6b49afb9ec24ea5372c7e0
When appending, if the cluster should end after the given label, ensure
that it does.
Bug: 183985866
Test: cow_api_test#ResumeEndCluster
Change-Id: Ie93d09b3431755d0b9b92761619d55df7f9f6151
When opening in append mode, we could write less than what was present
before. This could result in data blocks referencing beyond the end of
the file, or partially written ops. Zeroing these out will prevent
invalid leftovers from potentially causing confusion.
Bug: 183985866
Test: cow_api_test
Change-Id: I56f0218f3ea5b83c0614d1b86e81a4ca885f5c5e
When opening in append mode, we ftruncate() the COW. This has three side
effects:
(1) If the COW is never modified, or Finalized(), the state of the COW
will have changed. Ideally it should only change on an explicit
write operation.
(2) Data after the current cluster will be accidentally thrown away.
(3) The ending "cluster" op will be thrown away if the current cluster
was incomplete, and thus the last valid label could be invalidated.
Bug: 183985866
Test: cow_api_test
Change-Id: I3c9a38553b7492a3d6e71d177d75ddb1b6490dfe
Example log line:
update_engine: Block device was lazily unmounted and is still in-use:
/dev/block/dm-28; possibly open file descriptor or attached loop device.
This will help diagnose bugs such as b/184715543 in the future.
Bug: N/A
Test: manual test
Change-Id: Ia6b17fe9bd1796d59be7fc0b355218509acfd4af
When all threads are terminated, dm-user handler's are removed
from the list. When the last handler is removed, daemon is
shutdown gracefully.
Bug: 183652708
Test: 1: Apply full OTA and verify daemon is terminated; reapply the OTA
to verify daemon is restarted again.
2: vts_libsnapshot_test
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Ibd41223fc0eba884993a533fcc95661f72805db2
When worker threads were created, snapuserd was converted to a
shared_pointer. Earlier, memory was forcefully released
by setting snapuserd to nullptr which worked as it
was a unique pointer. Now, every worker thread holds
a reference. Clear the vector once all the worker
threads are terminated.
Test: Apply OTA and verify memory is released after OTA is applied
Bug: 183652708
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I256d26d98b02ad599aff49b92192226546c59b17
If somehow we wind up with snapshots with a source suffix, we could wind
up trying to unmap an in-use partition. Detect this case and allow the
snapshot to be deleted without the unmap.
Bug: 183567503
Test: vts_libsnapshot_test
Change-Id: I87dd5bb3a7b9be59dede624924374ccc47b563c2
Use sorted std:vector instead of std:map to store
the mapping between chunk-id to COW operation.
Addtionally, use shrink_to_fit to cut down vector
capacity when COW operations are stored.
On a full OTA of 1.8G, Anon RSS usage is
reduced from 120MB to 68MB. No variance observed
when merge was in progress.
Bug: 182960300
Test: Full and Incremental OTA - verified memory usage
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: I50cacbe0d03837a830dedcf9bd0ac9663fc68fa7
Add worker threads per partition to serve the IO request.
Remove memset of buffer in IO path which was impacting
4k IO performance.
update_verifier performance:
1: ~10-12 seconds with this change (both on full OTA and incremental
OTA); ~70 seconds observed without this changeset
2: ~8 seconds without the daemon once merge is completed
and snapshot devices are removed.
Bug: 181293939
Test: update_verifier, full OTA, incremental OTA
Signed-off-by: Akilesh Kailash <akailash@google.com>
Change-Id: Id90887f3f4a664ee5d39433715d1c166acbd6c60