I’ve been asked to display some images in Jupyter Notebooks. Unfortunately, I do not have a Jupyter editor at the moment so I will have to resort to Markdown.

Prompt 1: Describe some of the meta data…

The image size, which could be measured by the number of pixels, affects the file size of a given image. If the file size of an image becomes too large, the image could be difficult to transport across the Internet. As such, the image file must be compressed, which decreases its file size. According to a College Board video, this is achieved through using an in-file dictionary to decrease the number of characters in a file, thereby decreasing its size. According to this same video, there are two types of compression: Lossless compression (all data is conserved) and lossy compression (not all data is conserved, though there is no need to do so). Generally, lossy compression reduces the file size more effectively.

In terms of images, there are many file extensions that are available. Some of the more common file extensions are the .PNG and the .JPEG extensions. According to this source, .JPEG uses lossy compression while .PNG uses lossless compression. When lossy compression is utilized, the change in image colors are not easily perceptible. When I took notes on the 2.2 lesson, the instructor sampled multiple images of smiley faces. One image only used 2-3 colors (a shade of yellow, black, and possibly white), while another image used many shades of green. Lossless compression is more favorable for images with less colors (many pixels with the same color means less dictionary mappings are necessary), while lossy compression is more favorable for images with more colors.

Prompt 2: Commands for file paths in different OS and pathlib

Here are some commands used to access files

	Linux	Windows
Print Files	cat /path/to/file.txt	cat C:\path\to\file
Change Directory	cd /path/to/directory	cd C:\path\to\directory
Print files/folders in directory	ls -la	dir
Edit files	nano /path/to/file.txt	Notepad “C:\path\to\file.bat”

The most dramatic variation between these file-related commands in Windows CMD and Linux Terminal is that all the file paths are shown differently: in Linux, forward slashes are used for each directory without any drive, while in Windows, a drive is used (usually the C drive) and backslashes are used for each directory.

Prompt 3: Playing around with provided sample code

Lily (my partner) had some issues loading a Jupyter renderer and a dependency. She just reloaded VSCode and everything worked fine. The issue was relatively strange.

As for image paths, paths are important when working with images so that the program can appropriately find the location of the image within its filesystem. If a man doesn’t know where his favorite place is, he can’t use it. In a similar manner, if the program doesn’t know where an image is, the program can’t display it.

I need to answer a question about pandas.

Prompt 4: Translate the first 3 letters of your name to base64

There are many text-to-base64 converters online, and I’ll be using one here. As seen in the Wikipedia article about Base64, the basis of converting ASCII to base64 is that Base64 supports 6 bits (2^6 = 64), while ASCII supports 0 bits. As such, each group of 3 ASCII characters provides 24 bits for 4 Base64 characters; equal signs are used as padding to ensure that the number of characters printed is a multiple of 4 (an ASCII string does not necessarily have a number of characters being a multiple of 3)

Leo –> TGVv

But for the sake of the exercise I’ll also do it manually. This source was very helpful, along with this ASCII code chart. I’ll be re-using information from the chart and the article as I go along. Now, here’s what I’ll do.

Each ASCII character will have 8 bits. The first bit will always be 0 for some reason. The next three bits will be the bits at the top row in the column of the ASCII character, and the last four bits will be in the left four columns in the same row as the character (feel free to actually click the link to the article). We are essentially converting each character to binary. Keep in mind that this is case-sensitive.
L –> 01001100
e –> 01100101
o –> 01101111

The article then says that I must group my bits by 6 (I said before that base64 has 6 bits). I’ll go do that real quick.

010011 000110 010101 101111

We must now change our binary to decimal so that we can match the characters via base64. The article goes into detail about how to do the padding, but the administrators decided to ask for 3 letters so we can skip those parts. I omitted zeros in this case. Note that each binary digit is a power of 2.

010011 = 1 + 2 + 16 = 19 000110 = 2 + 4 = 6 010101 = 1 + 4 + 16 = 21 101111 = 1 + 2 + 4 + 8 + 32 = 47

Now we’ll use the base64 chart in the linked Wikipedia article to convert these decimals.

19 –> T 6 –> G 21 –> V 47 –> v

So Leo (ASCII) = TGVv (base64).

Prompt 5: Buffering

I have been a consumer of buffering while working with chromebooks. I currently use a chromebook for a lot of computer-related projects. Occasionally, when I type, the inputs for all the characters I type will take a while to appear on the screen. This is probably where the input for all of my characters are queued for the output. The buffering for this is an inconvenience in that it increases the amount of time it takes for me to get my work done. Depending on if there is a severe time constrain, this could be a minor inconvenience or a major inconvenience. These apply to executing functions on images given that, depending on image dimensions, a iteration may have to cycle through millions of pixels. Each of those pixels may have to go through a buffer, and under certain circumstances the functions executed on such images may cost time.

Prompt 6: Images (comment on the code)

If you’re asking about sequencing in code, then all code has sequencing. Lines are executed in a certain order.

For the greyscale function, it is likely that each pixel in the image is converted to an array (likely based on the RGB values - each pixel in an image is determined by red, greeh, and blue values ranging from 0 to 255). A blank list gray_data is made for the array of the greyscale image.

Then for each pixel in the original image aray, the average of the RGB values is taken by adding them and dividing by three. If there are more than three values for an element in the array, the average is taken of all values. Regardless, the average is appended to gray_data, and if there are more than three elements, the fourth element is preserved. The grayscale array then is placed as the data for what will become the grayscale image, which is run through a function that saves the image/array in a buffer and uses base64 to help display the image.

For the scaling function, a variable baseWidth of 320 pixels is defined. A variable scalePercent is defined as the the quotient of baseWidth divided by (what is likely) the original width of the image (converted to a float variable). The scaleHeight variable is the product of the float variable of the original height of the image times the scalePercent variable. This is returned as an integer. Then, the scale variable is defined as the dimensions of the image, being the width and the height respectively. The new dimensions are returned. Similar to the grayscaled image, this is run through the function that saves the image/array in a buffer and uses base64 to help display the image.

Scaling images can be viewed as a form of lossy compression. That is, reducing the number of pixels in the image can reduce the image file size. However, it is impossible to recover the original (higher-resolution) image from the scaled (lower-resolution) image. Scaling very high resolution images may not severely affect image quality unless the scaling is highly exaggerated (ex: scaling a 1000x1000 pixel image to a 10x10 pixel image)

Prompt: Add images that will likely utilize lossy and lossless data compression.

Here you go.

Source is here

Here are some reasons why the first of these images might utilize lossless compression

There are only two image colors (a shade of red and a shade of white). As such, when constructing the in-file dictionary to compress all those pixels, there are only two possible pixel colors. The image is also a .PNG image, which commonly utilizes lossless compression. As such, this image is likely to utilize lossless compression.

Here are some reasons why the second of these images might utilize lossy compression

This image has a .JPG extension, and these images commonly utilize lossy compression. In addition, there is a high variation of colors and shades in the image. These include various shades of brown, blue, white, and gray. As such, there are too many color values for an in-file dictionary to save much space (given there is not a high degree of repetition of color shades within the image pixels). As such, a lossy compression method could result in changes of color shades that are difficult to notice, but yet they would dramatically decrease file size. The original colors of the pixels would be irrevocerable, however.

Topic 2.3

Actual Notes

1) What needs to be cleaned in grade.json?

Here is a list of attributes that should be cleaned.

In column Student ID, find the actual student ID for that labeled nil.

In column Year in School, replace Junior with 11 and 9th Grade with 9. It is important that each column of a data set prove uniform since that makes it easier for each column to search. Search for and verify the grade level of student 313 who has a grade level of 20.

In column GPA, verify that student 324’s GPA is actually 4.75. This might be obtained through taking lots of honors classes, but 4.75 is unusually high among the sample, which has the highest GPA of 4.00. This suggests a weighted GPA may have been misplaced among unweighted GPAs.

Please see the .ipynb file for the notes and observations on each code cell. Oh wait, I forgot I don’t have a Jupyter Editor. That’s why I’m in Markdown. I guess I’ll do it here.

Comments on all the code segments.

Code Segment 1

import pandas as pd

Just need to import a library before you can use it. That’s really all. I think the as portion allows you to define how you’re going to call the library in subsequent code (this allows for some customizability and also avoiding naming conflicts).

Code Segment 2

df = pd.read_json('files/grade.json')

print(df)

This prints the dataset. It’s probably important to print or otherwise read a dataset before incorporating it into a computer project to make sure the data columns are relatively uniform (clean) and contains correct data.

Code Segment 3

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))

The first print statement prints the data but restricts the information displayed to GPA column. The second print statement adds an enter to the output, while the third print command prints the student IDs alongside their GPAs, but not their indexes (as specified by index=False)

Code Segment 4

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))

The first print statement prints everything again, but this time sorts by GPA. The second substantial print statement includes argument ascending=False. When inspecting the output, the first output sorts GPA by ascending order, while the second output sorts GPA by descending order. Mr. Yeung posted a question of whether a descending argument existed. Here’s documentation of the relevant commands used, and there is no such argument.

Code Segment

Quiz Questions

Integrity Disclaimer: The author does not condone use of these quiz reflections to cheat on quizzes. Do your own work!

Quiz 1: Data Compression

1) What is an advantage of lossless compression?

Well, lossless compression preserves all the data. Lossy compression can’t do that. Data security often depends more on the encryption algorithms used during file transmission (not the file compression)

2) A user wants to save a file on a storage site, reduce the file size, and restore the file to its original size.

He should use lossless compression… and only lossless compression.

3) Programmer developing social media software; wants to use compression for attachments sent between users. What is true?

(The answer choices make conjectures about videos, images, and audio files)

A) Lossless compression of videos save more space than lossy compression: Largely circumstantial. If the video file has frames with very few different colors (ex: screen recordings where screen GUI elements have uniform colors), lossless compression may be ideal. However, most videos of real-life settings will have pixels of many colors in the frames. Lossy compression is therefore ideal in most scenarios.

B) Lossless compression of images result in images of equal size to the original file: If this were true, the compression operation would be pointless since it’s meant to decrease file size.

C) LOssy compression of images provides greater reduction in transmission than lossless: Lossy compression decreases image file size for images of diverse colors more effectively than lossless compression, which in turn decreases transmission time. This seems favorable.

D) Sound clips with lossy compression can be restored to original quality: Lossy compression will result in irreversible loss of data, so this is false.

Quiz 2: Binary Numbers

1) A programming language has 4-bit binary for nonnegative integers. Programmers try to add 14 and 15 and assign to total. What will happen?

The sum of 14 and 15 is 29 in decimal. Must find the maximum displayable decimal number for 4-bits

1111 = 1 + 2 + 4 + 8 = 15

Yeah, this will be an overflow error; the sum is larger than the largest possible number to display with 4 bits.

2) In two versions of a game, the character can face 4 directions in the 1st version and 8 directions in the 2nd version. What is true about storing 8 directions in memory?

(all the answer choices pertain to 4 bits)

The maximum number of values storable in 4 bits is expressed by:

1111 = 1 + 2 + 4 + 8 = 15

Then add 1 (don’t forget the 0) to get 16 possible values

4 bits is enough to store 8 direction values.

3) What is true?

I) Binary sequences can be used to represent strings II) Binary sequences can be used to represent colors III) Binary sequences can be used to represent audio recordings

As far as I know, binary sequences represent all data present on a computer/machine.

4) Consider 0011, 0110, and 1111. Which decimal is not equal?

0011 = 1 + 2 = 3 0110 = 2 + 4 = 6 1111 = 1 + 2 + 4 + 8 = 15

Answer option 9 doesn’t match

5) Runner’s position is analog data. How is the position of the runner represented digitally (via sensors)?

I could be wrong here. It’s certainly not 0 or 1 depending on if the start or the finish is closer. In my physics class, motion sensors took 20 samples per second of object position using clicking sounds. According to an answer option, these positions should be represented by bits.

6) Sort from least to greatest

Binary 1011 = 1 + 2 + 8 = 11 Binary 1101 = 1 + 4 + 8 = 13

5 < 11 < 12 < 13 5 < Binary 1011 < 12 < Binary 1101

Quiz 3: Extracting Information from Data

1) A researcher has access to two databases

Database with first/last names, grade levels, GPAs
Database with first/last names, number of absences (school), number of tardies

Students don’t have ID numbers. What’s the problem?

The answer that fits is: Students with the same name may be confused. This is because names are a form of identification, similarly to ID numbers.

2) Researchers make program to analyze pollution in US counties. What is the challenge?

A few options seem strange, particularly those pertaining to there being too much data. Data files can be very large, so size is no problem. The answer options raise issues due to combining data from different files and different organizations of data between counties. Logically, data files can be combined with ease. If the data is organized differently, it may be difficult to display the data in a uniform manner.

3) Student makes web site to display city info based on inputted city name. What is a challenge?

The first thing that comes to mind (without me looking at the options) is multiple possible city names and case-sensitivity of inputs.

The former option already applies to abbreviated city names and spelling errors.

4) Database of info at concert shows:

Name of artist at show
Show date
$ of all tickets

What other info useful for determining artist with greatest attendance?

Dividing income from tickets by price of tickets will determine number of tickets sold. It is possible that someone will purchase a ticket without going to the show, but I’d guess this applies to few people.

5) Camera mounted on car captures image from driver’s seat every second + metadata (speed, date/time, GPS position). What is best determined without the metadata?

Option, number of bicycles passed, does not require the metadata. It only requires counting each unique bike in the pictures.

6) Teacher sends survey of work habits

Minutes of HW per night
Minutes of test prep per test
Enjoy class materials?

Which questions about students can teacher answer?

i) Do students enjoying class material spend more time on HW? ii) Do students spending more time on HW spend less time studying for tests? iii) Do students spending more time studying for tests earn higher grades?

Grades is not even in the dataset. iii is wrong.

Quiz 4: Using Programs with Data

1) Spreadsheet contains lots of info (I’m not going to copy down the spreadsheet). Count # of books with criteria:

Mystery book
< $10 cost
At least one copy.

genre is the genre of a book, num is number of copies, and cost is the cost. Get an expression to evaluate to true only if the criteria are satisfied.

(genre = “mystery”) AND ((num > 0) AND (cost < 10.00))

Yeah, I did that before consulting the answer choices. They did things slightly differently (namely setting num to be greater than or equal to 1), but you get the point. Multiple ways to solve the same problem. I probably have to group the right two parameters since I expect boolean operations to work in pairs.

HW time, studying for tests, and class enjoyment are in the dataset. The teacher could sort data in ascending/descending order of one variable and search for a relationship between that variable and a different variable.

2) Clothing store records over 7 days:

Date transaction
Method of payment
items purchased
Transaction cost

Possible methods payment are cash, check, debit card, credit card. What is true?

A) Avg amount spent per day determined by sorting by total transaction, add 7 greatest amounts, divide by 7: This is an average of the 7 most expensive purchases (not for the entire day)

B) Method of payment used most commonly determined by sorting data by method payment, adding # items purchased for each type, then finding max sum: This determines the method of payment used to purchase the greatest number of items (not the method of payment used in most transactions)

C) Most expensive item can be determined by searching data for all items… that operation is not supported in the data. The data is about the transactions, not the items.

D) Total # items purchased in day determined by searching all transactions, adding # of items purchased for each transaction that day: True.

3) Company with data files…

(there’s a big table with types of data available)

If a new product with AA batteries is available, how to use data files to send emails to customers who purchased products using AA batteries?

This is a bit invasive. Regardless…

Use products to find products IDs with AA batteries. Bridge from Product IDs to Customer IDs using Purchases. Bridge from customer IDs to email-addresses using customers.

4) Spreadsheet contains photography info

Student makes algorithm to determine name of photographer for oldest photo; ignore unknown photographers and years. Entry will appear in rows of spreadsheet. Actions are available.

Need to filter by photographer, filter by year, and sort by year. I don’t think order matters; if the pictures with unknown years are sorted before being removed, positions will be adjusted as if they weren’t there.

5) There’s a spreadsheet (a portion provided) - for a radio station

Students want to count talk shows on Saturday or Sunday.

genre is genre of show (string), day is day (string). Make an expression to evaluate to true if show meets criteria…

(genre = “talk”) AND ((day = “Saturday”) OR (day = “Sunday))

6) Wildlife preserve makes exhibit. There’s data categories.

Database	Info
1	Name, classification, skin type, thermoregulation
2	Name, lifestyle, life span, top speed

How can these databases be used?

Use both; search by name.