Ytdyklly

Question

I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.

I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.

And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.

Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.

Any suggestions?

I found a great working answer to my question can you allow me to answer it?

... seems I can't when it's put on hold ...

Kindly let us know you want to delete only the video whose size and name is same right — Feb 12 at 4:25
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767 — Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame? — Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye. — Feb 17 at 9:33

Gokul CGokul C 213 · Accepted Answer · 2019-02-12 09:19:20Z

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

Gokul CGokul C 213 · Accepted Answer · 2019-02-12 09:19:20Z

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

搜尋此網誌

Ytdyklly

How to find and delete duplicate video (.mp4) files? [closed]

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

1 Answer
1

1 Answer
1

1 Answer
1

Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?

How to find and delete duplicate video (.mp4) files? [closed]

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

1 Answer 1

1 Answer 1

1 Answer 1

Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?

1 Answer
1

1 Answer
1

1 Answer
1